Welcome to SMPI Calibration's documentation!

The archive is available here

1 Welcome to SimGrid Calibration's documentation!

1.1 What is this all about?

In order to provide accurate timings for SMPI simulations on as many platforms as possible, it is vital to take several idiosyncrasies of each platform into account. With this tool, several MPI benchmarks are executed automatically on your machine. The consecutive statistical analysis of the collected data then determines several parameters used by SimGrid, making simulations much more reliable. These parameters are output conveniently into an XML file.

This software package is developed and maintained by the SimGrid team.

1.2 Prerequisites / Software Dependencies

  1. GNU R with the following packages:
    • plyr
    • knitr
    • XML
    • ggplot2

    On recent debian-based systems, just run

       sudo apt-get install r-base r-cran-ggplot2 r-cran-plyr r-cran-xml
    

    knitr is, to the best of our knowledge, not packaged in debian and thus has to be installed from within R with a

       install.packages("knitr")
    
  2. libxml2. On debian-based systems, just run
      sudo apt-get install libxml2-dev
    
  3. A working MPI implementation and 2 nodes with the minimum amount of links between them. On a debian system, you could for example do:
      sudo apt-get install openmpi-doc openmpi-bin libopenmpi-dev
    
  4. When you execute the script for the first time, it will attempt to install the packages required by R automatically.

    However, packages in the public R repository (cran) may require a newer version than is currently installed on your system. In this case, the easiest approach is to update your R version to the latest version (see below), update all your R packages and finally install plyr, ggplot2 and XML.

    The procedure to update to the latest version of R may depend on your system. For instance, when using an older Ubuntu release, you could proceed as described in the answer to this question on StackOverflow. After R is updated, issue this command in order to update all packages as well:

       update.packages(checkBuilt = TRUE, ask = FALSE)
    

1.3 Compiling

Once all dependencies are installed and available, you can simply compile by executing

  make

1.4 Running

Note
This section only describes how to execute the functionality this software package provides; it is presumed that you have already set up a working configuration. For details on how to configure this package, see the respective section below.

Execute the MPI benchmarks on your machine (on your actual machine, you also have to provide a hostfile; otherwise, both processes might be started on the same node, and you don't want that since the network is supposed to be benchmarked). Here is a typical call

  mpirun -np 2 ./calibrate -s zoo_sizes -M 1000000
Note
If possible, you should run your experiments in isolation, that is, all network links, routers/switches, nodes etc. should only be allocated to you; otherwise, noise will influence the results of this benchmark.

This command uses a configuration file (generally called config.xml, and which you will have to create by yourself; a working working example is discussed in the corresponding Section). Most configuration options can however be configed from command line as well. All possible configuration options are described in this other Section.

After termination of the benchmarks, you can execute the statistical analysis via

  make NAME.html

where NAME needs to be replaced with your specific filename. (This command requires NAME.xml to exist.)

You can then open NAME.html in a webbrowser and view the results.

2 Tests employed by this package

2.1 MPI_Recv

This experiment measures the time spent in MPI_Recv calls. This experiment executes some tests a priori (which determine how much time should be waited before calling MPI_Recv in order to minimize waiting times as much as possible). This test will then simply sleep for this amount of time before the call to MPI_Recv is eventually issued.

2.2 MPI_Isend

This experiment measures the time spent in MPI_Isend calls. Note that it is expected that this measure differs from the time measured for MPI_Send (in the pingpong experiment, see below).

2.3 Ping Pong

This experiment works as follows: First, node 0 sends a message to node 1; after the receive has been completed by node 1, it sends a message back to node 0 which in turn is received. The times measured here are the times required to return from the function calls; that is, it is measured how long it takes before MPI_Send and MPI_Recv, respectively, return.

2.4 MPI_Wtime

This test executes MPI_Wtime in a loop; only the total time taken for all executions is measured. This time is then used to compute the average time spent in a single MPI_Wtime call.

Note
This will always execute 10,000,000 times, independent of your value set via the iterations tag.

2.5 MPI_Iprobe

node 0 and node 1 exchange messages via MPI_Send and MPI_Recv. However, it is determined ahead of calling MPI_Recv whether a message is ready to be received; this is done by executing MPI_Iprobe in a loop. Only then is MPI_Recv called and the message received.

In this test, only the time spent in the MPI_Iprobe calls is measured.

2.6 MPI_Test

Executes several MPI_Send s and non-blocking MPI_Irecv calls; MPI_Test is then repeatedly executed to test whether the non-blocking operation has finished or not. Only the time spent in the MPI_Test operation is measured.

3 Configuration options

    <prefix value="machine_name" />

All output files (i.e., files containing data collected during the experiments and files generated by the statistical analysis) will be prefixed with the prefix machine_name. This allows you to prevent existing files from being overwritten.

    <dirname value="examples" />

All results obtained by the experiments will be stored in directory examples. Note that this directory is always relative to the calibrate binary - not to the location of the XML configuration file! Also, do not add trailing slashes.

    <sizeFile value="filename" />

The experiments will be executed with message sizes taken from filename; by default, this file is called zoo_sizes and contains 1000 different sizes, reaching up to 1 GB. All experiments will send messages with sizes found in the file specified by this option; there is no use of random sizes.

Note
Options minSize and maxSize (described below) can be used to limit sizes without actually changing the contents of filename.
Note
Such zoo_sizes file has been generated using R and the following kind of commands:
n = 1000;
m = 1;
M = 30;
x = ceiling(2**runif(n,min=m,max=M));
write(x,file="/tmp/zoo_sizes",sep="\n");

Feel free to customize it to suit your needs.

    <minSize value="foo" />

Only values found in the file specified by the <sizeFile/> tag (see above) that are larger than or equal to foo will be considered for the experiments. Everything smaller than int will not be used. int will be considered as kilobytes. The default value for this option is 0.

    <maxSize value="foo" />

Only values found in the file specified by the <sizeFile/> tag (see above) that are smaller than or equal to foo will be considered for the experiments. Everything larger smaller than int will not be used. int will be considered as kilobytes. The default value for this option is 0.

    <iterations value="int" />

Determines how often the experiment will be executed for every message size, i.e., the sample size can be increased. Experiments should generally be executed multiple times in order to take one-time effects into account (such as congestion used by other software). We found a value around 10 reasonable.

    <outliers_threshold value="float" />

Determines how many outliers are removed from the samples before the statistical analysis. The value provided as float must be a value between 0 and 1, as this value will determine the quantiles.

The following images were all generated on the same dataset, with float=0, float=0.1, float=0.115 and float=0.5, respectively.

outliers_removal_value_0_0.png

Figure 1: Results for float=0

outliers_removal_value_0_1.png

Figure 2: Results for float=0.1

outliers_removal_value_0_115.png

Figure 3: Results for float=0.115

outliers_removal_value_0_5.png

Figure 4: Results for float=0.5

Note
Try to keep this value as low as possible, as higher values will remove more data from your samples; however, make sure that it is still high enough such that all (clearly recognizable) outliers are removed.
    <breakpoints_file value="filename" />

Breakpoints can be used to calculate a piecewise (segmented) regression; this can greatly improve results, as MPI implementations or hardware may employ different algorithms/techniques based on the actual message size and hence, small changes in message size may at some point incur vastly different behavior and results. Here is for example what can be found in such a breakpoint file:

    Limit, Name
    5120, Medium
    17408, Asynchronous

The two columns limit and name are used as follows: Limit sets the upper message size bound for the breakpoint (in bytes), i.e., a new regression will start at Limit. The first regression always starts at 0. Name is simply a descriptor for this breakpoint and will be shown in any generated plot.

    <eager_threshold value="foo" />

Every message with size larger than foo will use the rendez-vous algorithm. The time spent in MPI_Send after crossing this limit will be assumed to be 0 (hence, the line representing the computed regression will plummet to 0). Thresholds for eager and detached messages depend on the library and the hardware used. Consult the documentation of your library on how to display this information if you can't visually determine it. (For asynchronous messages over Ethernet networks we saw values of 65536, while IB networks had values of 12288 or 17408 depending on the implementation. Medium messages on Ethernet networks had a value of 1550)

    <detached_threshold value="foo" />

MPI_Send will not block until the end of the communication if message is smaller than foo.

    <expected_bandwidth value="foo" />

This option is not used in the analysis itself but is used to compute the smpi/bw-factor parameter that is used by SimGrid. This parameter needs to be the bandwidth of your network, in bytes. So a 10 GBit/s network is 1.25 GByte/s and hence, this option would need to be set to 1.25e9 (1 GByte = 1e9 bytes).

The calibration script can then compute the smpi/bw-factor that essentially dynamically changes the available bandwidth, based on the message sizes. (So a value of 0.5 for this configuration option means that only 50% of the bandwidth are actually usable.)

The importance of this option cannot be overstated; if this option is set incorrectly, your simulations will almost certainly be wrong. Check the manual of your cluster for the maximum bandwidth of your network.

Make sure you set the same value here as you do for your links in your platform.

    <expected_latency value="foo" />

This is similar to the expected_bandwidth option above; ask your system administrators for the correct value.

Make sure you set the same value here as you do for the latency of your links in your platform.

4 Working out a full example: "stampede"

note
The files used in this example can be found as described here in the archive and are prefixed by "stampede".

In this section, all steps will be briefly reviewed. The configuration file "stampede.xml" looks as follows:

<?xml version="1.0"?>
<config id="Config">
<!-- prefix name for the output files -->
 <prefix value="stampede"/>
<!-- directory name for the output files (as seen from calibrate.c) -->
 <dirname value="examples"/>


<!-- Name of the file that contains all message sizes we can choose from. -->
 <sizeFile value="zoo_sizes"/>
<!-- Minimum size of the messages to send-->
 <minSize value="0"/>
<!-- Maximum size of the messages to send-->
 <maxSize value="1000000"/>
<!-- Number of iterations per size of message-->
 <iterations value="5"/>
<!-- Outliers removal - Remove n minimum and maximum times per run-->
 <outliers_threshold value="0.1"/>

<!-- File containing the breakpoints; breakpoints define the begin/end of any segment of the (segmented) regression -->
 <breakpoints_file value="stampede_breakpoints"/>
 <eager_threshold value="17408"/>
 <detached_threshold value="17408"/>

 <!-- This value will be used only when generating the output file, i.e., solely for your convenience -->
 <expected_bandwidth value="7e9"/>
 <!-- This value will be used only when generating the output file, i.e., solely for your convenience -->
 <expected_latency value="1e-5"/>
</config>

The calibration was then executed by running

    mpirun -np 2 ./calibrate -f stampede.xml

The data collected during the experiments is stored inside the following files, which were created during the execution:

Such files are later on analyzed with R.

Note
The format of these files (i.e., meaning of columns) is as follows:
  • MPI_PingPong: MPI operation type, bytes sent/received, starttime, total time taken for send operation / time required to recv the last sent byte
  • MPI_Isend: MPI operation type, bytes sent, starttime, total time

The statistical analysis uses these files and the data stored within. Breakpoints defined in the file breakpoints_stampede are initially defined as follows (they will be tweaked later on):

   Limit, Name
   5120, Medium
   17408, Asynchronous

Having run the experiments successfully, the statistical analysis can be dispatched. This is simple:

  make stampede.html

This command will create a file called stampede.html, which can be opened in any webbrowser to view information and several visualizations. The visualizations can be used to determine if the computed value (regression) is correct (the regression is depicted as a black line in the images). If not, breakpoints should be tweaked accordingly. Thresholds for eager and detached messages depend on the library and the hardware used. Consult the documentation of your library on how to display this information if you can't visually determine it. (For Ethernet network we saw values of 65536, while IB networks had values of 12288 or 17408 depending on the implementation.)

For MPI_Send, the previously shown configuration and breakpoints result in the regression as visualized below:

breakpoints_before.png

Obviously, the regression in the first segment (visualized in blue) is seriously influenced by the "jump" in the samples, as the curve is "pulled up" and does not match the available samples very well after around 1e+03. This can be amended by tweaking the breakpoints, as follows (in fact, only the first breakpoint was changed):

    Limit, Name
    4020, Medium
    17408, Asynchronous

Re-running the analysis results in a much smoother regression:

breakpoints_after.png

The curve now clearly fits the samples very well.

4.1 Generated output

The output generated by this example is an XML file, which configures SMPI for use with SimGrid. It can be included in your platform file.

The output itself looks like this:

5 Varying experimental results

The experiments conducted will only account for the state of the system at the time the experiments were run. This means that even seconds later, the same configuration may yield different results.

The following four figures represent this behavior exemplarily; here, the system is a notebook and is only used for demonstration purposes - all processes were ran on this machine. (When running the calibration on multiple nodes, you should see less variation) The first three figures differ vastly in the part colored in blue, including different values for the regression in pictures 1, 2 and 3. The fourth picture, on the other hand, shows that large messages required much more time than during the other runs.

same_configuration_different_results_1.png

Figure 7: Run 1

same_configuration_different_results_2.png

Figure 8: Run 2

same_configuration_different_results_3.png

Figure 9: Run 3

same_configuration_different_results_4.png

Figure 10: Run 4