Welcome to the SMPI/PARAVER integration

1 Installation

Here are the different files needed for this integration.

For this to work "system wide", these perl and sh scripts should be in the PATH. Eventually, they will be shipped with SMPI.

2 Achievements

2.1 April 2013 (Grenoble)

  • First prototype of a prv 2 csv conversion

2.2 October 2014 (Grenoble @ BSC, for Mont-Blanc)

  • First prototype of an integration of SMPI with Paraver

2.3 November 2014 (Grenoble and BSC @ Chicago, for JLESC)

  • Discussions between Judit and Arnaud on how to model a new machine in Simgrid.
  • Improve state naming with Harald
  • Minor cleanups

2.4 June 2015 (Grenoble @ BSC, for JLESC)

  • Moved this project from blog entry to the contrib section of SimGrid
  • Got access on Marenostrum and ensure we can compile and run pj_dump and SMPI.
  • Major reorganization and cleanups of the scripts, revamp of the documentation.
  • Judit had an issue with a very simple trace she had generated using directly smpicc. The trick is I was not converting Send and Recv yet as BigDFT had only collective operations. This is now fixed.
  • Jesus confirmed me he was running into this TCP_RTO issue as well on some machines, that fixing it was hard and that being able to account for it would be definitely useful to understand how applications can be sensitive to it.
  • Jesus provided me with large traces to play with.
      tar jtf /home/alegrand/Work/SimGrid/bsc/lulesh.tar.bz2
    
    lulesh2.0_p1000_n500_t1.chop1.pcf
    lulesh2.0_p1000_n500_t1.chop1.prv
    lulesh2.0_p1000_n500_t1.chop1.row
    lulesh2.0_p1331_n666_t1.chop1.pcf
    lulesh2.0_p1331_n666_t1.chop1.prv
    lulesh2.0_p1331_n666_t1.chop1.row
    lulesh2.0_p216_n108_t1.chop1.pcf
    lulesh2.0_p216_n108_t1.chop1.prv
    lulesh2.0_p216_n108_t1.chop1.row
    lulesh2.0_p512_n256_t1.chop1.pcf
    lulesh2.0_p512_n256_t1.chop1.prv
    lulesh2.0_p512_n256_t1.chop1.row
    lulesh2.0_p729_n365_t1.chop1.pcf
    lulesh2.0_p729_n365_t1.chop1.prv
    lulesh2.0_p729_n365_t1.chop1.row
    

    It turns out my prv2pj.pl script fails for the moment on these traces as they comprise two MPI process per node. I could easily fix it but for now I have many issues with send an receive matching.

2.5 July 2015 (Grenoble @ BSC, for Harald Servat's PhD defense)

I kept investigating this Isend/Irecv/Wait matching issue and it seems much more difficult than expected. I think there is not enough information in the paraver format to do it properly.

3 Roadmap

3.1 TODO Interaction between Paraver and SMPI [0/2]

  • [ ] Make a model of Mare Nostrum, the Mont-blanc prototype, so that BSC staff can really play with SMPI. (Edit: this was discussed in Chicago with Judit. I explained here the SimGrid XML plaform representation and she will try to play with SMPI and come back to me with questions).
  • [ ] Convert the 12 GB Nancy LU trace (700 process on 3 clusters) to paraver to see whether the behavior exhibited by ocelotl can be observer in Paraver. This involves slightly modifying the paje to paraver converter which was designed for SMPI paje traces.

    This trace was on flutin and I got it here: file:///exports/nancy_700_lu.C.700.pjdump.bz2

    Most of these issues are specific to this trace so it can be ignored by others than me.

    • [ ] Fix the state name conversion and the event conversion
    • [ ] In pjdump2prv.pl there is probably something wrong with the number of communicators. I use $nb_nodes at the moment.
    • [ ] The resulting prv starts from the pjdump and I forgot to sort it. Could we give an option to pjdump so that it sorts it according to time?
    • [ ] Do not use state 0 as it's reserved for computation
    • [ ] Create a state and event for MPI application (derived from being outside MPI calls)
    • [ ] clock resolution issue

4 Description of the interaction between Paraver and SMPI

We explain in this document how SMPI can be used as an alternative to Dimemas within the paraver framework. To this end, we need to make sure that SMPI can simulate paraver traces and output paraver traces.

Ideally, we would modify SMPI to that it can parse and generate such traces. It's an option we keep in mind as it would be much cleaner and faster but that would require to

  1. scavenge dimemas trace parsing (in C/C++) and meld it with SMPI trace replay.
  2. make sure SMPI can generate directly the paraver format.

This is potentially a lot of work to do within our time frame so instead we decided to go for simple trace conversions, i.e., a paraver to SMPI time-independent trace format conversion and a Paje to paraver conversion.

Some simple sample traces are available here:

4.1 Paraver to CSV and SMPI format Conversion

Method

Juan Gonzalez provided us a description of the Paraver and Dimemas format. The Paraver description is available here, i.e., from the Paraver documentation. Remember the pcf file describes events, the row file defines the cpu/node/thread mapping and the prv is the trace with all events. I reworked my old script to convert from paraver to csv, pjdump and SMPI time-independant trace format during the night. Unfortunately, on the morning, Juan explained me I should not trust the state records but only the the event and communication records. Ideally, I should have worked from the dimemas trace instead of the paraver trace to obtain SMPI trace but at least, this allowed me to get a converter to csv/pjdump, which is very useful to Damien for framesoc/ocelotl.

So I really struggled to make it work and had to make several assumptions and "Uggly hacks" (indicated in the code). In particular, something that is really uggly at the moment is that the V collective operations where send and receive are process specific appear as many times as there are process and since I translate on the fly, I do not produce a correct input for SMPI. The easiest solution to handle this is probably to have two pass but nevermind for a first proof of concept.

head paraver_trace/bigdft_8_rl.csv
State, 1, MPI_STATE, 0, 10668, 10668, 0, Not created
State, 2, MPI_STATE, 0, 5118733, 5118733, 0, Not created
State, 3, MPI_STATE, 0, 9374527, 9374527, 0, Not created
State, 4, MPI_STATE, 0, 17510142, 17510142, 0, Not created
State, 5, MPI_STATE, 0, 5989994, 5989994, 0, Not created
State, 6, MPI_STATE, 0, 5737601, 5737601, 0, Not created
State, 7, MPI_STATE, 0, 5866978, 5866978, 0, Not created
State, 8, MPI_STATE, 0, 5891099, 5891099, 0, Not created
State, 1, MPI_STATE, 10668, 25576057, 25565389, 0, Running
State, 2, MPI_STATE, 5118733, 18655258, 13536525, 0, Running

TODO Regression tests

Currently, it works well on an old small 8 node BigDFT paraver trace.

TODO Cleanups

A few uggly things had to be done here (reduce, alltoallV, no handling of p2p operations, second/nanosecond issue, …) and need to be cleaned.

TODO Extrae extension

Maybe it would be interesting to have an option that allows extrae to trace all the parameters ?

TODO Distinguish between MPI process and nodes

Give a try to lulesh2.0_p216_n108_t1.chop1.

TODO Correctly handle (I)sends and (I)recv

Here is an excerpt from lulesh2.0_p216_n108_t1.chop1.prv

2:10:1:10:1:625871180:50000001:3    # 2:...50000001:3 is MPI_Isend
1:10:1:10:1:626136517:626252559:1
2:10:1:10:1:626136517:50000001:0    # 2:...50000001:0 is Outside MPI
3:10:1:10:1:625871180:626136517:46:1:46:1:429496:677301531:104544:1024
                                    # This is a communication starting at the same
                                    # time as the MPI_Isend (625871180) and ending
                                    # at 677301531. the 104544 is the size and the 
                                    # 1024 is the tag
                                    # This tels us that the emiter is process 10:1:10:1
                                    # Whil the receiver is process 46:1:46:1
1:10:1:10:1:626252559:626484354:10
2:10:1:10:1:626252559:50000001:3    # again, another MPI_Isend
1:10:1:10:1:626484354:626601813:1
2:10:1:10:1:626484354:50000001:0    # computing oustide MPI, blabla
...
...
...                                 # and way later....
...
1:46:1:46:1:677298906:677301531:8
2:46:1:46:1:677298906:50000001:5    # And here finally comes the MPI_wait on the receiver side
1:46:1:46:1:677301531:677416782:1
2:46:1:46:1:677301531:50000001:0    # followed by computations
                                    # But I could not find anything about the 
                                    # corresponding Irecv...

So to summarize, here are where different information about this particular communication appear in the file:

cd ~/Work/SimGrid/bsc/
grep -n -e 625871180 -e 677301531 -e 677298906 ~/Work/SimGrid/bsc/lulesh2.0_p216_n108_t1.chop1.prv | sed 's/:/      /'
870      1:10:1:10:1:625786054:625871180:1
872      1:10:1:10:1:625871180:626136517:10
873      2:10:1:10:1:625871180:50000001:3
876      3:10:1:10:1:625871180:626136517:46:1:46:1:429496:677301531:104544:1024
10046      1:46:1:46:1:677297322:677298906:1
10048      1:46:1:46:1:677298906:677301531:8
10049      2:46:1:46:1:677298906:50000001:5
10050      1:46:1:46:1:677301531:677416782:1
10051      2:46:1:46:1:677301531:50000001:0

There is thus no way to do it an online conversion without storing every communication. So I'll go for a two pass conversion. I first parse all the "3:" lines that contain information about "who, what and when" and then I'll use these information when convertin the "2:" lines that explain how the communication is done.

Unfortunately, even when doing this, I can get what I think it the corresponding wait but not the receive operation. And this is where things get strange. The first Irecv operations on this node appear way later:

grep -n -e '2:46:1:46:1:.*:50000001:[4]' ~/Work/SimGrid/bsc/lulesh2.0_p216_n108_t1.chop1.prv | sed 's/:/      /' | head
38519      2:46:1:46:1:730783842:50000001:4
38535      2:46:1:46:1:730792925:50000001:4
38539      2:46:1:46:1:730796759:50000001:4
38547      2:46:1:46:1:730800800:50000001:4
38553      2:46:1:46:1:730804050:50000001:4
38561      2:46:1:46:1:730807217:50000001:4
38569      2:46:1:46:1:730810342:50000001:4
38575      2:46:1:46:1:730813384:50000001:4
38593      2:46:1:46:1:730829009:50000001:4
38601      2:46:1:46:1:730832676:50000001:4

Actually, the first point-to-point operations on this process are Isends and Wait and then, way later, there are Irecv.

grep -n -e '2:46:1:46:1:.*:50000001:[3456]'  ~/Work/SimGrid/bsc/lulesh2.0_p216_n108_t1.chop1.prv | sed 's/:/      /' | head -n 55
1541      2:46:1:46:1:643680033:50000001:3
1550      2:46:1:46:1:644017869:50000001:3
1560      2:46:1:46:1:644360913:50000001:3
1572      2:46:1:46:1:644700749:50000001:3
1591      2:46:1:46:1:645320878:50000001:3
1610      2:46:1:46:1:645643089:50000001:3
1637      2:46:1:46:1:645925591:50000001:3
1647      2:46:1:46:1:645953216:50000001:3
1652      2:46:1:46:1:645976341:50000001:3
1657      2:46:1:46:1:645999716:50000001:3
1662      2:46:1:46:1:646020758:50000001:3
1672      2:46:1:46:1:646042967:50000001:3
1682      2:46:1:46:1:646067342:50000001:3
1692      2:46:1:46:1:646085550:50000001:3
1702      2:46:1:46:1:646104551:50000001:3
1711      2:46:1:46:1:646127342:50000001:3
1724      2:46:1:46:1:646148468:50000001:3
1739      2:46:1:46:1:646170051:50000001:3
1744      2:46:1:46:1:646188135:50000001:3
1749      2:46:1:46:1:646202593:50000001:3
1762      2:46:1:46:1:646216468:50000001:3
1773      2:46:1:46:1:646241302:50000001:3
1778      2:46:1:46:1:646254093:50000001:3
1783      2:46:1:46:1:646267802:50000001:3
1796      2:46:1:46:1:646279260:50000001:3
1803      2:46:1:46:1:646291969:50000001:3
1808      2:46:1:46:1:646306552:50000001:6
10049      2:46:1:46:1:677298906:50000001:5
10079      2:46:1:46:1:677416782:50000001:5
10100      2:46:1:46:1:677530782:50000001:5
21162      2:46:1:46:1:707872381:50000001:5
21231      2:46:1:46:1:708029966:50000001:5
21459      2:46:1:46:1:708461719:50000001:5
34674      2:46:1:46:1:728184448:50000001:5
34694      2:46:1:46:1:728210073:50000001:5
34703      2:46:1:46:1:728236323:50000001:5
34712      2:46:1:46:1:728246740:50000001:5
34716      2:46:1:46:1:728256157:50000001:5
34730      2:46:1:46:1:728263990:50000001:5
34736      2:46:1:46:1:728273282:50000001:5
34744      2:46:1:46:1:728284823:50000001:5
34756      2:46:1:46:1:728293115:50000001:5
34768      2:46:1:46:1:728301365:50000001:5
34778      2:46:1:46:1:728313407:50000001:5
34796      2:46:1:46:1:728320824:50000001:5
34813      2:46:1:46:1:728330115:50000001:5
34821      2:46:1:46:1:728334740:50000001:5
34857      2:46:1:46:1:728352824:50000001:5
34869      2:46:1:46:1:728358116:50000001:5
38485      2:46:1:46:1:730759633:50000001:5
38489      2:46:1:46:1:730765425:50000001:5
38497      2:46:1:46:1:730769883:50000001:5
38501      2:46:1:46:1:730774300:50000001:5
38519      2:46:1:46:1:730783842:50000001:4
38535      2:46:1:46:1:730792925:50000001:4

So in the previous code, there are first 26 Isend, then a Waitall that probably generates the 26 corresponding 26 Wait (but how can I know which requests were actually provided to the Waitall ???) and then finally the corresponding series of MPI_Irecv that will be waited way later. It turns out that on the receiver side, MPI handles the receptions while doing some wait on the Isend but I have thus absolutely no way to match them and to know in which orders the receive were done and on which particular receives the receiver was waiting (even if it seems that in the previous particular case it did not happen).

Actually, after discussing about this with Judit, it appears that the trace was cut. So the wait on the receiver side we see actually correspond to Irecvs that are not present in the trace (and not to previous Isends as I inititially thought). E.g.,

grep -n -e '3:.*46:1:46:1:.*' ~/Work/SimGrid/bsc/lulesh2.0_p216_n108_t1.chop1.prv | sed 's/:/      /' | head -n 54
876      3:10:1:10:1:625871180:626136517:46:1:46:1:429496:677301531:104544:1024
1369      3:9:1:9:1:636610547:636635631:46:1:46:1:456247:728238073:1584:1024
1544      3:46:1:46:1:643680033:643922576:10:1:10:1:241942:648985350:104544:1024
1555      3:46:1:46:1:644017869:644238078:82:1:82:1:248060:679151189:104544:1024
1570      3:46:1:46:1:644360913:644574248:40:1:40:1:109029:709356245:104544:1024
1581      3:46:1:46:1:644700749:644918042:52:1:52:1:300746:702191368:104544:1024
1594      3:46:1:46:1:645320878:645333170:45:1:45:1:455956:702425925:104544:1024
1633      3:46:1:46:1:645643089:645908674:47:1:47:1:202569:730677120:104544:1024
1645      3:46:1:46:1:645925591:645948883:39:1:39:1:119654:710792058:1584:1024
1650      3:46:1:46:1:645953216:645969966:4:1:4:1:189402:708996725:1584:1024
1655      3:46:1:46:1:645976341:645992758:9:1:9:1:268400:715031158:1584:1024
1660      3:46:1:46:1:645999716:646016675:53:1:53:1:691480:727903557:1584:1024
1670      3:46:1:46:1:646020758:646036925:88:1:88:1:403516:711836359:1584:1024
1680      3:46:1:46:1:646042967:646058425:83:1:83:1:238664:730162810:1584:1024
1690      3:46:1:46:1:646067342:646081717:51:1:51:1:356288:714355232:1584:1024
1700      3:46:1:46:1:646085550:646100717:76:1:76:1:244486:730628399:1584:1024
1707      3:46:1:46:1:646104551:646119176:81:1:81:1:313561:715956508:1584:1024
1719      3:46:1:46:1:646127342:646144801:41:1:41:1:290240:727544484:1584:1024
1735      3:46:1:46:1:646148468:646164718:16:1:16:1:297752:719185976:1584:1024
1742      3:46:1:46:1:646170051:646185510:11:1:11:1:89850:719665701:1584:1024
1747      3:46:1:46:1:646188135:646200635:3:1:3:1:202611:711050599:24:1024
1757      3:46:1:46:1:646202593:646214676:75:1:75:1:265069:722029411:24:1024
1769      3:46:1:46:1:646216468:646228301:5:1:5:1:478618:728457736:24:1024
1776      3:46:1:46:1:646241302:646251802:77:1:77:1:639886:733204626:24:1024
1781      3:46:1:46:1:646254093:646265802:15:1:15:1:314170:721488302:24:1024
1792      3:46:1:46:1:646267802:646277427:87:1:87:1:468891:722540417:24:1024
1801      3:46:1:46:1:646279260:646289885:17:1:17:1:71365:728537857:24:1024
1806      3:46:1:46:1:646291969:646304427:89:1:89:1:544023:728380413:24:1024
2076      3:88:1:88:1:648286682:648319557:46:1:46:1:462622:728258657:1584:1024
2206      3:4:1:4:1:648613124:648631708:46:1:46:1:453288:728228656:1584:1024
2658      3:51:1:51:1:651750721:651793430:46:1:46:1:480413:728274907:1584:1024
3363      3:83:1:83:1:654689958:654751209:46:1:46:1:476538:728265782:1584:1024
4280      3:82:1:82:1:656929413:657166832:46:1:46:1:433705:677419823:104544:1024
5292      3:52:1:52:1:660879922:661167926:46:1:46:1:440705:707876673:104544:1024
9355      3:45:1:45:1:675422726:675439059:46:1:46:1:443830:708035007:104544:1024
10276      3:81:1:81:1:678370724:678409683:46:1:46:1:486497:728295032:1584:1024
11017      3:39:1:39:1:681302334:681349627:46:1:46:1:449997:728192031:1584:1024
11410      3:75:1:75:1:682741815:682787982:46:1:46:1:500747:728336699:24:1024
12859      3:76:1:76:1:686359862:686386779:46:1:46:1:483788:728287407:1584:1024
13284      3:5:1:5:1:687447840:687458590:46:1:46:1:503830:728355032:24:1024
14768      3:17:1:17:1:691480245:691684582:46:1:46:1:515247:730771675:24:1024
15926      3:41:1:41:1:694613356:694662273:46:1:46:1:489372:728303032:1584:1024
16621      3:87:1:87:1:696696110:696711985:46:1:46:1:512122:730766883:24:1024
17045      3:16:1:16:1:698185630:698201922:46:1:46:1:492039:728315449:1584:1024
17677      3:40:1:40:1:700177777:700446073:46:1:46:1:437288:707713047:104544:1024
17774      3:11:1:11:1:700480087:700788134:46:1:46:1:495164:728322324:1584:1024
20077      3:53:1:53:1:705672360:705713485:46:1:46:1:459580:728248865:1584:1024
21080      3:3:1:3:1:707627740:707643615:46:1:46:1:497997:728332115:24:1024
21595      3:89:1:89:1:708657448:708778949:46:1:46:1:519164:730776092:24:1024
26894      3:15:1:15:1:717486244:717532578:46:1:46:1:509372:730762717:24:1024
33288      3:47:1:47:1:723885205:725113262:46:1:46:1:446997:727742403:104544:1024
38242      3:77:1:77:1:730635045:730651337:46:1:46:1:506622:730746717:24:1024
48766      3:46:1:46:1:771492475:771554143:10:1:10:1:728395763:778474104:209088:2048
48833      3:46:1:46:1:771814853:771848228:40:1:40:1:730372678:799469231:209088:2048

As can be seen, except for the last two entries, all the logical receive time are completely bogus. From this, we can make the following assumption (that will be quite annoying if the flow control of the code is non-deterministic):

  • If we find a Isend, a Wait or an Irecv whose date do not correspond to a communication event 3:, we should just skip all of them.
  • If we find a waitall, well, we need to think about it to be sure. :(

We will still have the issue that we should actually collect the request id to do correctly the matching and handle gracefully wait_any and wait_all

4.2 Let's try to replay on SMPI

Method

This it the platform file I currently use for replaying:

graphene.xml

The script used for calling smpi is actually quite simple: smpi2pj.sh

DONE Use command line arguments

TODO Trace the running state (outside MPI)

Currently, it does not appear in the paje trace

4.3 Pjdump/smpi to Paraver Conversion

Method

Here is my uggly script with many hardcoded values: pjdump2prv.pl

DONE Collective naming

Improve the conversion to export events so that collective operation names are the same and things are easily comparable. This was done in Chicago with Harald.

DONE Factorization

There was originallytwo scripts (pjdump2prv.pl and pjsmpi2prv.pl). I've finally merged them into only one: pjdump2prv.pl

TODO Links

Add links (arrows) so that bandwidth can be computed in paraver

4.4 Gluing everything together to allow calling SMPI

Method

The Dimemas wrapper called by paraver is file:///usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh

Here is how I did proceed. I made a copy of it.

  mv /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh.backup

Basically, what I wanted to do is something like

perl prv2pj.pl
sh smpi2pj.sh >/dev/null
perl pjsmpi2prv.pl

So here is an equivalent version inspired from the dimemas wrapper: dimemas-wrapper.sh

TODO Library issue

When running inside paraver, I can't call pj_dump from my perl script. When trimming the fat to get an error messge, here is what you can get.

---> Input file is a paje trace. Running /home/alegrand/bin/pj_dump  /tmp/EXTRAE_Paraver_trace_mpich.sim.trace 2>&1
-----> /home/alegrand/bin/pj_dump: /usr/local/stow/wxparaver-4.5.4-linux-x86_64/lib/paraver-kernel/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/alegrand/lib/libpaje.so.1)
---> Intermediary file /tmp/EXTRAE_Paraver_trace_mpich.sim.pjdump

I think this is due to the fact that paraver is often statically compiled and must be doing something strange with dynamic libraries pre-loading.

TODO Better integration

Currently, we replace the dimemas wrapper and the platform file is hardcoded… This should be changed to allow to specify platform and deployment.