SimGrid  3.19.1
Versatile Simulation of Distributed Systems
SimGrid Tutorial with MSG

Introduction

Settings

Warning
Before you take this tutorial, you should remember that the MSG interface is currently deprecated. It means that it will remain as is, inchanged, for a few years, but that new developments should use the new S4U interface instead. Unfortunately, the S4U tutorial is not written yet. Sorry about that.

This tutorial will guide your create and run your first SimGrid simulator. Let's consider the following scenario:

Assume we have a (possibly large) bunch of (possibly large) data to process and which originally reside on a server (a.k.a. master). For sake of simplicity, we assume all input file require the same amount of computation. We assume the server can be helped by a (possibly large) set of worker machines. What is the best way to organize the computations ?

image/svg+xml Master T T Worker Worker Worker Worker Worker How should the masterdistribute the tasks? T T T T T T T T T T T T T T T T T T T T T T T T T ?

Raised Questions

Although this looks like a very simple setting it raises several interesting questions:

  • Which algorithm should the master use to send workload?

    The provided code sends the tasks to the workers with a trivial round-robin algorithm. It would probably be more efficient if the workers were asking for tasks, to let the master distribute the tasks in a more cleaver way.

  • Should the worker specify how many tasks they want? Or should the master decide everything?

    The workers will starve if they don't get the tasks fast enough. One possibility to reduce latency would be to send tasks in pools instead of one by one. But if the pools are too big, the load balancing will likely get uneven, in particular when distributing the last tasks.

  • How does the quality of such algorithm dependent on the platform characteristics and on the task characteristics?

    Whenever the input communication time is very small compared to processing time and workers are homogeneous, it is likely that the round-robin algorithm performs very well. Would it still hold true when transfer time is not negligible and the platform is, say, a volunteer computing system ? What if some tasks are performed faster on some specific nodes?

  • The network topology interconnecting the master and the workers may be quite complicated. How does such a topology impact the previous result?

    When data transfers are the bottleneck, it is likely that a good modeling of the platform becomes essential. The SimGrid platform models are particularly handy to account for complex platform topologies.

  • What topology to use for the application?

    Is a flat master worker deployment sufficient? Should we go for a hierarchical algorithm, with some forwarders taking large pools of tasks from the master, each of them distributing their tasks to a sub-pool of workers? Or should we introduce super-peers, dupplicating the master's role in a peer-to-peer manner? Do the algorithms require a perfect knowledge of the network?

  • How is such an algorithm sensitive to external workload variation?

    What if bandwidth, latency and computing speed can vary with no warning? Shouldn't you study whether your algorithm is sensitive to such load variations?

  • Although an algorithm may be more efficient than another, how does it interfere with other applications?
  • Etc, etc.

As you can see, this very simple setting may need to evolve way beyond what you initially imagined. And this is a good news.

But don't believe the fools saying that all you need to study such settings is a simple discrete event simulator. Do you really want to reinvent the wheel, write your own tool, debug it, optimize it and validate its models against real settings for ages, or do you prefer to sit on the shoulders of a giant?
With SimGrid, you can forget about most technical details (but not all), and focus on your algorithm. The whole simulation mechanism is already working.

Envisionned Study

The following figure is a screenshot of triva visualizing a SimGrid simulation of two master worker applications (one in light gray and the other in dark gray) running in concurrence and showing resource usage over a long period of time.

sc3-description.png
Test

Getting Started

Prerequisite

In this example, we use Pajeng and Vite to visualize the result of SimGrid simulations. These external tools are usually very easy to install. On Debian and Ubuntu for example, you can get them as follows:

sudo apt-get install pajeng vite

Setting up and Compiling

The corresponding source files can be obtained online on GitLab. There is a button on the top right to download the whole directory in one archive file. If you wish, other platform files are available from this GitLab directory.

As you can see, there is already a little Makefile that compiles everything for you. If you struggle with the compilation, then you should double check your SimGrid installation. On need, please refer to the Troubleshooting your project setup section.

Discovering the provided simulator

Please compile and execute the provided simulator as follows:

make masterworker
./masterworker examples/platforms/small_platform.xml deployment0.xml

For a more "fancy" output, you can use simgrid-colorizer.

./masterworker examples/platforms/small_platform.xml deployment0.xml 2>&1 | simgrid-colorizer

If you installed SimGrid to a non-standard path, you may have to specify the full path to simgrid-colorizer on the above line, such as /opt/simgrid/bin/simgrid-colorizer. If you did not install it at all, you can find it in <simgrid_root_directory>/bin/colorize.

For a classical Gantt-Chart visualization, you can produce a Paje trace:

./masterworker platforms/platform.xml deployment0.xml --cfg=tracing:yes \
--cfg=tracing/msg/process:yes
pajeng simgrid.trace

Alternatively, you can use vite.

./masterworker platforms/platform.xml deployment0.xml --cfg=tracing:yes \
--cfg=tracing/msg/process:yes --cfg=tracing/basic:yes
vite simgrid.trace

Understanding this source code

Explore the doc/tuto-msg/masterworker.c source file. It contains 3 functions:

  • master: that's the code executed by the master process.
    It creates a large array containing all tasks, dispatches all tasks to the workers and then dispatch specific tasks which name is "finalize".
  • worker: each workers will execute this function.
    That's an infinite loop waiting for incomming tasks. We exit the loop if the name of the received task is "finalize", or process the task otherwise.
  • main: this setups the simulation.

How does SimGrid know that we need one master and several workers? Because it's written in the deployment file (called deployment0.xml), that we pass to MSG_create_environment() during the setup.

<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
<platform version="4.1">
<!-- The master actor (with some arguments) -->
<actor host="Tremblay" function="master">
<argument value="20"/> <!-- Number of tasks -->
<argument value="50000000"/> <!-- Computation size of tasks -->
<argument value="1000000"/> <!-- Communication size of tasks -->
<argument value="Jupiter"/> <!-- First worker -->
<argument value="Fafard"/> <!-- Second worker -->
<argument value="Ginette"/> <!-- Third worker -->
<argument value="Bourassa"/> <!-- Last worker -->
<argument value="Tremblay"/> <!-- Me! I can work too! -->
</actor>
<!-- The worker actor (with no argument) -->
<actor host="Tremblay" function="worker" on_failure="RESTART"/>
<actor host="Jupiter" function="worker" on_failure="RESTART"/>
<actor host="Fafard" function="worker" on_failure="RESTART"/>
<actor host="Ginette" function="worker" on_failure="RESTART"/>
<actor host="Bourassa" function="worker" on_failure="RESTART"/>
</platform>

Exercise 1: Simplifying the deployment file

In the provided example, the deployment file deployment0.xml is tightly connected to the platform file small_platform.xml and adding more workers quickly becomes a pain: You need to start them (at the bottom of the file), add to inform the master that they are available (in the master parameters list).

Instead, modify the simulator masterworker.c into masterworker-exo1.c so that the master launches a worker process on all the other machines at startup. The new deployment file deployment1.xml should be as simple as:

<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
<platform version="4.1">
<!-- The master actor (with some arguments) -->
<actor host="Tremblay" function="master">
<argument value="20"/> <!-- Number of tasks -->
<argument value="50000000"/> <!-- Computation size of tasks -->
<argument value="1000000"/> <!-- Communication size of tasks -->
</actor>
</platform>

For that, the master needs to retrieve the list of hosts declared in the platform, with the following functions (follow the links for their documentation):

Then, the master should start the worker processes with the following function:

Increasing configurability

The worker processes wait for incomming messages on a channel which name they need to know beforehand. In the provided code, each worker uses the name of its host as a channel name. You can see this in the receiver source code:

xbt_assert(res == MSG_OK, "MSG_task_receive failed");

This way, you can have at most one worker per host. To later study the behavior of concurrent applications on the platform, we need to alleviate this. Several solutions exist:

Now that the the master creates the workers, it knows their PID (process ID – given by MSG_process_get_pid()), so you could use it in the channel name.

Another possibility for the master is to determine a channel name before the process creation, and give that name as a parameter to the starting process. This is what the data parameter of MSG_process_create is meant for. You can pass any arbitrary pointer, and the created process can retrieve this value later with the MSG_process_get_data and MSG_process_self functions. Since we want later to study concurrent applications, it is advised to use a channel name such as master_name:worker_name.

A third possibility would be to inverse the communication architecture and have the workers pulling work from the master. This require to pass the master's channel to the workers.

Wrap up

In this exercise, we reduced the amount of configuration that our simulator requests. This is both a good idea, and a dangerous trend. This simplification is an application of the good old DRY/SPOT programming principle (Don't Repeat Yourself / Single Point Of Truth – more on wikipedia), and you really want your programming artefacts to follow these software engineering principles.

But at the same time, you should be careful in separating your scientific contribution (the master/wokers algorithm) and the artefacts used to test it (platform, deployment and workload). This is why SimGrid forces you to expres your platform and deployment files in XML instead of using a programming interface: it forces a clear separation of concerns between things that are of very different nature.

If you struggle with this exercise, have a look at our solution in doc/tuto-msg/masterworker-sol1.c This is not perfect at all, and many other solutions would have been possible, of course.

Exercise 2: Infinite amount of work, fixed experiment duration

In the current version, the number of tasks is defined through the worker arguments. Hence, tasks are created at the very beginning of the simulation. Instead, have the master dispatching tasks for a predetermined amount of time. The tasks must now be created on demand instead of beforehand.

Of course, usual time functions like gettimeofday will give you the time on your real machine, which is prety useless in the simulation. Instead, retrieve the time in the simulated world with MSG_get_clock.

You can still stop your workers with a specific task as previously, but other methods exist. You can forcefully stop processes with the following functions, but be warned that SimGrid traditionnally had issues with forcefully stopping procsses involved in computations or communications. We hope that it's better now, but YMMV.

int MSG_process_killall(int reset_PIDs);

Anyway, the new deployment deployment2.xml file should thus look like this:

<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
<platform version="4.1">
<actor host="Tremblay" function="master">
<argument value="600"/> <!-- Simulation timeout, in seconds -->
<argument value="50000000"/> <!-- Computation size of tasks -->
<argument value="1000000"/> <!-- Communication size of tasks -->
</actor>
</platform>

Controlling the message verbosity

Not all messages are equally informative, so you probably want to change most of the XBT_INFO into XBT_DEBUG so that they are hidden by default. You could for example show only the total number of tasks processed by default. You can still see the debug messages as follows:

./masterworker examples/platforms/small_platform.xml deployment2.xml --log=msg_test.thres:debug

Wrap up

Our imperfect solution to this exercise is available as doc/tuto-msg/masterworker-sol2.c But there is still much to improve in that code.

Exercise 3: Understanding how competing applications behave

It is now time to start several applications at once, with the following deployment3.xml file.

<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
<platform version="4.1">
<actor host="Tremblay" function="master">
<argument value="600"/> <!-- Simulation timeout -->
<argument value="50000000"/> <!-- Computation size of tasks -->
<argument value="10"/> <!-- Communication size of tasks -->
</actor>
<actor host="Fafard" function="master">
<argument value="600"/> <!-- Simulation timeout -->
<argument value="50000000"/> <!-- Computation size of tasks -->
<argument value="10"/> <!-- Communication size of tasks -->
</actor>
<actor host="Jupiter" function="master">
<argument value="600"/> <!-- Simulation timeout -->
<argument value="50000000"/> <!-- Computation size of tasks -->
<argument value="10"/> <!-- Communication size of tasks -->
</actor>
</platform>

Things happen when you do so, but it remains utterly difficult to understand what's happening exactely. Even visualizations with pajeng and Vite contain too much information to be useful: it is impossible to understand which task belong to which application. To fix this, we will categorize the tasks.

For that, first let each master create its own category of tasks with TRACE_category(), and then assign this category to each task using MSG_task_set_category().

The outcome can then be visualized as a Gantt-chart as follows:

./masterworker examples/platforms/small_platform.xml deployment3.xml --cfg=tracing:yes --cfg=tracing/msg/process:yes
vite simgrid.trace

Going further

vite is not enough to understand the situation, because it does not deal with categorization. That is why you should switch to R to visualize your outcomes, as explained on this page.

As usual, you can explore our imperfect solution, in doc/tuto-msg/masterworker-sol3.c.

Exercise 4: Better scheduling: FCFS

You don't need a very advanced visualization solution to notice that round-robin is completely suboptimal: most of the workers keep waiting for more work. We will move to a First-Come First-Served mechanism instead.

For that, your workers should explicitely request for work with a message sent to a channel that is specific to their master. The name of their private channel name should be attached (using the last parameter of MSG_task_create()) to the message sent, so that their master can answer.

The master should serve requests in a round-robin manner, until the time is up. Things get a bit more complex to stop the workers afterward: the master cannot simply send a terminating task, as the workers are blocked until their request for work is accepted. So instead, the master should wait for incomming requests even once the time is up, and answer with a terminating task.

Once it works, you will see that such as simple FCFS schema allows to double the amount of tasks handled over time in this case.

Going further

From this, many things can easily be added. For example, you could:

  • Allow workers to have several pending requests so as to overlap communication and computations as much as possible. Non-blocking communication will probably become handy here.
  • Add a performance measurement mechanism, enabling the master to make smart scheduling choices.
  • Test your code on other platforms, from the examples/platforms directory in your archive.
    What is the largest number of tasks requiring 50e6 flops and 1e5 bytes that you manage to distribute and process in one hour on g5k.xml (you should use deployment_general.xml)?
  • Optimize not only for the amount of tasks handled, but also for the total energy dissipated.
  • And so on. If you come up with a really nice extension, please share it with us so that we can extend this tutorial.

Where to go from here?

This tutorial is now terminated. You could keep reading the online documentation or tutorials, or you could head up to the example section to read some code.

TODO: Points to improve for the next time

  • Propose equivalent exercises and skeleton in java.
  • Propose a virtualbox image with everything (simgrid, pajeng, ...) already set up.
  • Ease the installation on mac OS X (binary installer) and windows.
  • Explain that programming in C or java and having a working development environment is a prerequisite.