SimGrid  3.15
Versatile Simulation of Distributed Systems
Getting Started: SimGrid Main Concepts

SimGrid is a framework to simulate distributed computer systems.

It can be used to either assess abstract algorithms, or to profile and debug real distributed applications. SimGrid enables studies in the domains of (data-)Grids, IaaS Clouds, Clusters, High Performance Computing, Volunteer Computing and Peer-to-Peer systems.

Technically speaking, SimGrid is a library. It is neither a graphical interface nor a command-line simulator running user scripts. The interaction with SimGrid is done by writing programs with the exposed functions to build your own simulator.

SimGrid offers many features, many options and many possibilities. The documentation aims at smoothing the learning curve. But nothing's perfect, and this documentation is really no exception here. Please help us improving it by reporting any issue that you see and proposing the content that is still missing.

SimGrid is a Free Software distributed under the LGPL licence. You are thus welcome to use it as you wish, or even to modify and distribute your version (as long as your version is as free as ours). It also means that SimGrid is developed by a vivid community of users and developers. We hope that you will come and join us!

SimGrid is the result of over 15 years of research from several groups, both in France and in the USA. It benefited of many funding from various research instances, including the ANR, Inria, CNRS, University of Lorraine, University of Hawai'i at Manoa, ENS Rennes and many others. Many thanks to our generous sponsors!

Typical Study based on SimGrid

Any SimGrid study entails the following components:

  • The studied Application. This can be either a distributed algorithm described in our simple APIs, or a full featured real parallel application using the MPI interface (or other).
  • The Virtual Platform. This is a description of a given distributed system (machines, links, disks, clusters, etc). Most of the platform files are written in XML althrough a Lua interface is under development. SimGrid makes it easy to augment the Virtual Platform with a Dynamic Scenario where for example the links are slowed down (because of external usage), the machines fail. You have even support to specify the applicative workload that you want to feed to your application.
  • The application's Deployment Description. In SimGrid terminology, the application is an inert set of source files and binaries. To make it run, you have to describe how your application should be deployed on the virtual platform. Specify which process is located on which host, along with its parameters.
  • The Platform Models. They describe how the virtual platform reacts to the actions of the application. For example, they compute the time taken by a given communication on the virtual platform. These models are already included in SimGrid, and you only need to pick one and maybe tweak its configuration to get your results.

These components are put together to run a simulation, that is an experiment or a probe. The result of one or many simulation provides an outcome (logs, visualization, statistical analysis) that help answering the question targeted by this study.

The questions that SimGrid can solve include the following:

  • Compare an Application to another. This is the classical use case for scientists, who use SimGrid to test how the solution that they contribute compares to the existing solutions from the literature.
  • Design the best Virtual Platform for a given Application. Tweaking the platform file is much easier than building a new real platform for testing purpose. SimGrid also allows co-design of the platform and the application by modifying both of them.
  • Debug Real Applications. With real systems, is sometimes difficult to reproduce the exact run leading to the bug that you are tracking. SimGrid gives you experimental reproducibility, clairevoyance (you can explore every part of the system, and your probe will not change the simulated state). It also makes it easy to mock some parts of the real system that are not under study.

SimGrid Execution Gears

Depending on the intended study, SimGrid can be run in several gears, that are different execution modes.

Simulation Gear. This is the most common gear, where you want to study how your application behaves on the virtual platform under the experimental scenario.

In this gear, SimGrid can provide information about the time taken by your application, the amount of energy dissipated by the platform to run your application and the detailed usage of each resource.

Model-Checking Gear. This can be seen as a sort of exhaustive testing gear, where every possible outcome of your application is explored. In some sense, this gear tests your application for all possible platforms that you could imagine (and more).

You just provide the application and its deployment (amount of processes and parameters), and the model-checker will litterally explore all possible outcomes by testing all possible message interleaving: if at some point a given process can either receive the message A first or the message B depending on the platform characteristics, the model-checker will explore the scenario where A arrives first, and then rewind to the same point to explore the scenarion where B arrives first.

This is a very powerful gear, where you can evaluate the correction of your application. It can verify either safety properties (asserts) or liveless properties stating for example that if a given event occures, then another given event will occur in a finite amount of steps. This gear is not only usable with the abstract algorithms developed on top of the SimGrid APIs, but also with real MPI applications (to some extend).

The main limit of Model Checking lays in the huge amount of scenarios to explore. SimGrid tries to explore only non-redundent scenarios thanks to classical reduction techniques (such as DPOR and statefull exploration) but the exploration may well never finish if you don't carefully adapt your application to this gear.

Another limit of this gear is that it does not use the performance models of the simulation gear. Time becomes discrete: You can say for example that the application took 42 steps to run, but there is no way to know the amount of seconds that it took or the amount of watts that it dissipated.

Finally, the model checker only explores the interleavings of computations and communications. Other factors such as thread execution interleaving are not considered by the SimGrid model checker.

The model checker may well miss existing issues, as it computes the possible outcomes from a given initial situation. There is no way to prove the correction of your application in all generality with this tool.

Benchmark Recording Gear. During debug sessions, continuous integration testing and other similar use cases, you are often only interested in the control flow. If your application apply filters to huge images split in small blocks, the filtered image is probably not what you are interested in. You are probably looking for a way to run each computation kernel only once, save on disk the time it takes and some other metadata. This code block can then be skipped in simulation and replaced by a synthetic block using the cached information. The virtual platform will take this block into account without requesting the real hosting machine to benchmark it.

SimGrid Success Stories

TBD

  • Many publications
  • Accurate speedup prediction for the Mont-Blanc cluster
  • It already happened that a divergence between the simulated outcome and the reality resulted from a testbed misconfiguration. In some sense, we fixed the reality because it was not getting the result that SimGrid correctly computed :)
  • Star-PU, BigDFT, TomP2P use SimGrid to chase their bugs and improve their efficiency.

SimGrid Limits

This framework is by no means the perfect holly grail able to solve every problem on earth.

SimGrid scope is limited to distributed systems. Real-time multithreaded systems are not in the scope. You could probably tweak SimGrid for such studies (or the framework could possibily be extended in this direction), but another framework specifically targeting this usecase would probably be more suited.

There is currently no support for IoT studies and wireless networks. The framework could certainly be improved in this direction, but this is still to be done.

There is no perfect model, only models adapted to your study. The SimGrid models target fast, large studies yet requesting a realistic results. In particular, our models abstract away parameters and phenomenon that are often irrelevant to the realism in our context.

SimGrid is simply not intended to any study that would mandate the abstracted phenomenon. Here are some studies that you should not do with SimGrid:

  • Studying the effect of L3 vs L2 cache effects on your application
  • Comparing variantes of TCP
  • Exploring pathological cases where TCP breaks down, resulting in abnormal executions.
  • Studying security aspects of your application, in presence of malicious agents.

Where to proceed next?

Now that you know about the basic concepts of SimGrid, you can give it a try. If it's not done yet, first install it. Then, proceed to the section on describing the application that you want to study.