Computational ScienceofComputer Systems -...
Embed Size (px)
Transcript of Computational ScienceofComputer Systems -...

Computational Science of Computer SystemsMéthodologies d'expérimentation pour l'informatique distribuée à large échelle
Martin QuinsonUniversité de Lorraine, Inria Nancy
Joint work with many colleagues: P. Bédaride, H. Casanova, P.N. Clauss, G. Corona,
A. Degomme, F. Desprez, S. Genaud, A. Giersch, M. Guthmuller, Arnaud Legrand,
Stephan Merz, L. Nussbaum, C. Rosa, M. Stillwell, Frédéric Suter, C. Thiéry, and many interns.
January 21th 2015Rennes

About Me
Curriculum VitæI 1999: Maîtrise (informatics) at Université de Saint ÉtienneI 2003: PhD (informatics) at ENS-LyonI 2004: Post-Doctoral Researcher at University of California, Santa Barbara.I 2004: Temporary teaching assistant at Université de Grenoble.I Since 2005: Assistant professor at Université de Lorraine / Telecom NancyI 2011 � 2013: On leave at Inria Nancy � Grand EstI March 2013: Habilitation ThesisI 2013 � 2014: Leader of the Algorille joint team (Algorithms for the Grid)I 2015 �: Member of the VeriDis joint team (Veri�cation of Distributed Systems)I Future? Professor in ENS-Rennes (team Myriads) ?
Research Common Theme since 15 years
I Discovery and Modeling of Large-scale HPC Systems (since my M.S work!)I Make them usable by others: e.g., provide performance models to schedulersI One of the main contributors to SimGrid, a scienti�c instrument for such studies
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 2/20

Modern Computers are Large and Complex
Massive Parallelism
1. Tianhe-2 (China) 3,120,000 cores 18MW2. Titan (USA) 560,640 cores 8MW3. Sequoia (USA) 1,572,864 cores 8MW4. K Computer (Japan) 705,024 cores 13MW5. Mira (USA) 786,432 cores 4MW
Computational Science ; ExaScale Systems
I Huge impact in all sciences and techniques and industries and businessesI 1 Exa�op = 1018 operations. One million million million operations. . .
Not only in Computational Science
I Google dissipates 300MW ; Botnets control millions of zombie computersI In addition, these systems are heterogeneous and dynamic
So, how do we study these beasts?
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 3/20

Computational Science of Computer Systems
My Research Field: Methodologies of Experimentation
I Goal: assess the performance and correctness of large-scale computer systemsI Question: Are we really producing scienti�cally sound results?I Main contribution: SimGrid, a simulator of large-scale computer system
My approach: I am a physicist
I Empirically consider large-scale computer systems as natural objectsI Eminently arti�cial artifacts, but complexity reaches �natural� levelsI Other sciences routinely use computers to understand complex systems
[PPL'09], cited 52 times.
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 4/20

Simulating Distributed Systems
Simulation: Fastest Path from Idea to DataI Get preliminary results from partial implementationsI Experimental campaign with thousands of runs within the weekI Test your scienti�c idea, don't �ddle with technical subtleties (yet)
Idea or MPI code
Experimental Setup
+ ⇝Scientific Results
Models
Simulation
Simulation: Easiest Way to Study Distributed ApplicationsI Everything is actually centralized: Partially mock parts of your protocolI No heisenbug: (Simulated) time does not change when you capture more dataI Clairevoyance: Observe every bits of your application and platformI High Reproducibility: No or very few variabilityI Capacity planning: What if network or CPU were fasterI Don't waste resources to debug and test
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 5/20

Simulating Distributed Systems
Simulation: Fastest Path from Idea to DataI Get preliminary results from partial implementationsI Experimental campaign with thousands of runs within the weekI Test your scienti�c idea, don't �ddle with technical subtleties (yet)
Idea or MPI code
Experimental Setup
+ ⇝Scientific Results
Models
Simulation
Simulation: Easiest Way to Study Distributed ApplicationsI Everything is actually centralized: Partially mock parts of your protocolI No heisenbug: (Simulated) time does not change when you capture more dataI Clairevoyance: Observe every bits of your application and platformI High Reproducibility: No or very few variabilityI Capacity planning: What if network or CPU were fasterI Don't waste resources to debug and test
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 5/20

Simulation Challenges
Idea or MPI code
Experimental Setup
+ ⇝Scientific Results
Models
Simulation
Challenges for the Tool Makers
I Validity: Get realistic results (controlled experimental bias)I Scalability: Fast enough and Big enough
I Open Science: Integrated lab notes, runner, post-processing (data provenance)
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 6/20

Simulation of Parallel/Distributed Systems
Network Protocols: Standards emerged: GTNetS, DaSSF, OmNet++, NS3Huge amount of non-standard tools in other domains:
I Grid Computing OptorSim ChicagoSim GridSim JFreeSim . . .
I Peer-to-peer P2Psim SimP2P PeerSim OverSim . . .
I Volunteer Computing SimBA EmBOINC SimBOINC . . .
I HPC/MPI Dimemas PSinS BigSim LogGoPSim XSim SST . . .
I Cloud Computing CloudSim GroudSim iCanCloud GreenCloud . . .
This raises severe methodological/reproducibility issues:I Short-lived, badly supported (software QA), sparse validity assessment
SimGrid: a 15 years old joint projectI Versatile: Grid, P2P, Clouds, HPC, VolunteerI Collaborative: (ANR, CNRS, Univ., Inria) Open Source: active communityI Widely used: 150 publications by 120 individuals, 30 contributors
http://simgrid.org/ [UKSim'08] cited 350, [JPDC'14]
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 7/20

Simulation of Parallel/Distributed Systems
Network Protocols: Standards emerged: GTNetS, DaSSF, OmNet++, NS3Huge amount of non-standard tools in other domains:
I Grid Computing OptorSim ChicagoSim GridSim JFreeSim . . .
I Peer-to-peer P2Psim SimP2P PeerSim OverSim . . .
I Volunteer Computing SimBA EmBOINC SimBOINC . . .
I HPC/MPI Dimemas PSinS BigSim LogGoPSim XSim SST . . .
I Cloud Computing CloudSim GroudSim iCanCloud GreenCloud . . .
This raises severe methodological/reproducibility issues:I Short-lived, badly supported (software QA), sparse validity assessment
SimGrid: a 15 years old joint projectI Versatile: Grid, P2P, Clouds, HPC, VolunteerI Collaborative: (ANR, CNRS, Univ., Inria) Open Source: active communityI Widely used: 150 publications by 120 individuals, 30 contributors
http://simgrid.org/ [UKSim'08] cited 350, [JPDC'14]
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 7/20

SimGrid Key Features: Fluid Network Model
I Packet level models: Full net stack. Inherently slow, hard to instantiateI Simple models: Delay-based, distribution, coordinates
Very scalable, but no topology, no network congestion
I Fluid models: Share bandwidth between �ows on macroscopic evts
Bandwidth sharing as an optimization problem∑if �ow i uses link j
ρi 6 Cj
I Max-Min objective function: max (min (ρi ))
I Reno fairness: max(∑
arctan (ρi ))
I Vegas fairness: max(∑
log (ρi ))
We implement, (in)validate and optimize these models since 10 years
I The classical "Observe, Analyze, Hypothesis, Test" loop
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 8/20

SimGrid Key Features: Fluid Network Model
I Packet level models: Full net stack. Inherently slow, hard to instantiateI Simple models: Delay-based, distribution, coordinates
Very scalable, but no topology, no network congestion
I Fluid models: Share bandwidth between �ows on macroscopic evts
Bandwidth sharing as an optimization problem∑if �ow i uses link j
ρi 6 Cj
I Max-Min objective function: max (min (ρi ))
I Reno fairness: max(∑
arctan (ρi ))
I Vegas fairness: max(∑
log (ρi ))
We implement, (in)validate and optimize these models since 10 years
I The classical "Observe, Analyze, Hypothesis, Test" loop
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 8/20

Validity Success Stories
unmodi�ed NAS CG on a TCP/Ethernet cluster (Grid'5000)
B
10
15
20
25
30
CG
6432 128168Number of nodes
Spe
edup
Modelq Real
SimGrid
LogGPS
Key aspects to obtain this result
I Network Topology: Contention (large msg) and Synchronization (small msg)I Applicative (collective) operations (stolen from real implementations)I Instantiate Platform models (matching e�ects, not docs)I All included in SimGrid but the instantiation (remains manual for now)
[IPDPS'11] cited 35, [SC'13]Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 9/20

Validity Success Stories
unmodi�ed NAS CG on a TCP/Ethernet cluster (Grid'5000)
B
qqq
10
15
20
25
30
CG
6432 128168Number of nodes
Spe
edup
Modelq RealNetwork collapse
SimGrid
LogGPS
Discrepency between Simulation and Real Experiment. Why?
I Massive switch packet drops lead to 200ms timeouts in TCP!I Tightly coupled: the whole application hangs until timeoutI Noise easy to model in the simulator, but useless for that very studyI Our prediction performance is more interesting to detect the real issue
[IPDPS'11] cited 35, [SC'13]Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 9/20

Agenda
Introduction
Computational Science of Computer Systems (CS2)
Simulation Models
Parallel Simulation of Discrete Event Systems
Dynamic Veri�cation of Distributed Applications
Conclusion
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 9/20

Parallel Simulation of Discrete Event Systems
Classical Parallel Schema: split the whole applicative modelLP #1 LP #2
LP #3 LP #4
I Leads to good speedups (but still poor performance)dPeerSim: 4h → 1h when 2 LPs → 16 LPs (but 50s in sequential PeerSim)
New approach: Split at Virtualization layer (not in simulation engine)I Virtualization contains threads (user's stack)I Engine & Models remains sequential
Models + EnginesVirtualization + SynchroUser
tnMU1
U2
U3
tn+1 tn+2
I Synchronization costs of paramount importance
Sim
ulation
Workload User Code
Virtualization Layer
Networking Models
Simulation Engine
ExecutionEnvironment
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 10/20

Parallel Simulation of Discrete Event Systems
Classical Parallel Schema: split the whole applicative modelLP #1 LP #2
LP #3 LP #4
I Leads to good speedups (but still poor performance)dPeerSim: 4h → 1h when 2 LPs → 16 LPs (but 50s in sequential PeerSim)
New approach: Split at Virtualization layer (not in simulation engine)I Virtualization contains threads (user's stack)I Engine & Models remains sequential
Models + EnginesVirtualization + SynchroUser
tnMU1
U2
U3
tn+1 tn+2
I Synchronization costs of paramount importanceSim
ulation
Workload User Code
Virtualization Layer
Networking Models
Simulation Engine
ExecutionEnvironment
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 10/20

E�cient Parallel Fine-Grained SimulationSimGrid is an Operating System
Simcalls separate processes, alleviating locking issues
I Very similar to syscalls in an operating system
Functional View
Process Process Process
SimCall Interface
MaestroSimulation Modelske
rnel
Temporal View
Models+Engines
Virtualization + SynchroUser (isolated)
simcallrequest answer
actual interaction
M
U2
U1
U3
Leveraging Multicores
⇒ More processes than cores ; Worker Threads (that execute co-routines ;)
Worker Worker Worker
MaestroSimulation Modelske
rnel
Processes
Functional View
T1tn
T2
tn+1M
Temporal View
... ...T2
Tn
T1
fetch_add()futex_wait()futex_wake()
Ideal Algorithm
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 11/20

E�cient Parallel Fine-Grained SimulationSimGrid is an Operating System
Simcalls separate processes, alleviating locking issues
I Very similar to syscalls in an operating system
Functional View
Process Process Process
SimCall Interface
MaestroSimulation Modelske
rnel
Temporal View
Models+Engines
Virtualization + SynchroUser (isolated)
simcallrequest answer
actual interaction
M
U2
U1
U3
Leveraging Multicores
⇒ More processes than cores ; Worker Threads (that execute co-routines ;)
Worker Worker Worker
MaestroSimulation Modelske
rnel
Processes
Functional View
T1tn
T2
tn+1M
Temporal View
... ...T2
Tn
T1
fetch_add()futex_wait()futex_wake()
Ideal Algorithm
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 11/20

Performance Results
I Scenario: Initialize Chord, and simulate 1000 seconds of protocolI Arbitrary Time Limit: 12 hours (kill simulation afterward)
0
10000
20000
30000
40000
0 500000 1e+06 1.5e+06 2e+06
Run
ning
tim
e in
sec
onds
Number of nodes
OverSim (OMNeT++)PeerSim
OverSim (simple underlay)SimGrid (sequential)SimGrid (4 threads)
Largest simulated scenarioSize Time
Omnet++ 10k 1h40PeerSim 100k 4h36OverSim 300k 10h
SimGrid, seq10k 32s300k 32mn2M 6h18
SimGrid//10k 130s300k 40mn2M 5h55
Memory Usage18kiB /process (stack: 12kiB)
First time that PDES is (a little) faster than DES [CCGrid'12], cited 17
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 12/20

Agenda
Introduction
Computational Science of Computer Systems (CS2)
Simulation Models
Parallel Simulation of Discrete Event Systems
Dynamic Veri�cation of Distributed Applications
Conclusion
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 12/20

Assessing the Correctness of HPC codes?
Writing Distributed Apps is notoriously di�cult, but
The Good Old Days
I MPI codes circumvented issues with rigid communication patterns
I Performance First: fast code that rarely fail-stop ≫ correct slow code
These Days are Now Over
I But rigid patterns do not scale! We now have to release the gripI But this is dangerous! We now have to explicitly seek for correctness
Slowly, old ignored problems resurface
I When Tests are not enough anymore, turn to Formal Methods
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 13/20

Model Checking and Dynamic Veri�cation
Automated Formal MethodsI Try to assess the correctness of a system by actively searching for faultsI If no fault found after an exhaustive search, correctness experimentally provedI Dynamic Veri�cation: Model Checking applied to real applications
Exhaustive Exploration1
2
iSend
3
WaitTimeout
1 1
iRecv
4
iRecv
5
Test FALSE
6
MC_RANDOM
7
MC_RANDOM
8
MC_RANDOM
9
MC_RANDOM
Test FALSE
1 2
Wait
1 3
iRecv
2 4
Test TRUE
1 4
WaitTimeout
1 5
Test TRUE
1 6
iSend
1 7
iRecv
1 8
Test FALSE
1 9
MC_RANDOM
2 0
MC_RANDOM
2 1
MC_RANDOM
2 2
MC_RANDOM
Test FALSE
2 5
iRecv
2 6
WaitTimeout
2 9
iSend
2 7
iSend
iRecv
3 0
Wait
3 8
iRecv
3 1
iRecv
3 2
Test FALSE
3 3
MC_RANDOM
3 4
MC_RANDOM
3 5
MC_RANDOM
3 6
MC_RANDOM
Test FALSE
3 9
Wait
4 0
iRecv
4 1
Test FALSE
4 2
MC_RANDOM
4 3
MC_RANDOM
4 4
MC_RANDOM
4 5
MC_RANDOM
Test FALSE
Model Checking: the Big Idea
I My preferred outcome: a counter-exampleI I tend to bug �nding, not certi�cation
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 14/20

Model Checking and Dynamic Veri�cation
Automated Formal MethodsI Try to assess the correctness of a system by actively searching for faultsI If no fault found after an exhaustive search, correctness experimentally provedI Dynamic Veri�cation: Model Checking applied to real applications
Exhaustive Exploration1
2
iSend
3
WaitTimeout
1 1
iRecv
4
iRecv
5
Test FALSE
6
MC_RANDOM
7
MC_RANDOM
8
MC_RANDOM
9
MC_RANDOM
Test FALSE
1 2
Wait
1 3
iRecv
2 4
Test TRUE
1 4
WaitTimeout
1 5
Test TRUE
1 6
iSend
1 7
iRecv
1 8
Test FALSE
1 9
MC_RANDOM
2 0
MC_RANDOM
2 1
MC_RANDOM
2 2
MC_RANDOM
Test FALSE
2 5
iRecv
2 6
WaitTimeout
2 9
iSend
2 7
iSend
iRecv
3 0
Wait
3 8
iRecv
3 1
iRecv
3 2
Test FALSE
3 3
MC_RANDOM
3 4
MC_RANDOM
3 5
MC_RANDOM
3 6
MC_RANDOM
Test FALSE
3 9
Wait
4 0
iRecv
4 1
Test FALSE
4 2
MC_RANDOM
4 3
MC_RANDOM
4 4
MC_RANDOM
4 5
MC_RANDOM
Test FALSE
Model Checking: the Big Idea
I My preferred outcome: a counter-exampleI I tend to bug �nding, not certi�cation
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 14/20

SimGridMC: Formal Methods in SimGrid
Verify any application that would run in SimGrid
I Reuse the simulator's light virtualization to mediate apps' actionsI Replace the simulation kernel underneath with a model checkerI Tests all causally possible orders of events to dynamically verify the appI System-level checkpoints the app to then rewind and explore another pathI This works with SMPI, and MSG (our simple API to study CSP algorithms)
SIMIX
SURF
MSG SMPI SIMDAG
User Code
Platform
Description372435work
remaining
variable
530530
50664
245245
Concurrentprocesses
Synchro.abstractions
...
...
...
App. spec. as concurrent code
App. spec. astask graph
...
x1x2
x3
x3
+
xn
...
+
+ xn
Variables ResourceConstraints
6 CLm
6 CL2
6 CP
6 CL1x1
... ...
Activities
...
{
CL2
CLm CL1
Cp
1
2
iSend
3
iRecv
4
Wait
5
iRecv
1 5 3
Test TRUE
6
Test TRUE
7
iSend
8
Wait
9
iRecv
1 0
Test FALSE
1 1
MC_RANDOM (0)
1 1 5
MC_RANDOM (1)
1 2 2
iRecv
1 2
MC_RANDOM (0)
1 0 5
MC_RANDOM (1) 1 1 2
iRecv
1 3
MC_RANDOM (0)MC_RANDOM (1)
1 0 7
iRecv
MC_RANDOM (0)
1 5
MC_RANDOM (1)
2 4
iRecv
1 6
iSend
1 7
iRecv
1 8
Wait
1 9
Test TRUE
2 0
iSend
2 1
Wait
2 2
iRecv
Test FALSE
2 5
MC_RANDOM (0)
1 0 3
MC_RANDOM (1)
2 6
Test FALSE
2 7
MC_RANDOM (0)
6 6
MC_RANDOM (1)
2 8
MC_RANDOM (0)
6 2
MC_RANDOM (1)
2 9
MC_RANDOM (0)MC_RANDOM (1)
MC_RANDOM (0)
3 1
MC_RANDOM (1)
3 2
iSend
3 5
Test FALSE
3 3
Wait
Test TRUE
3 6
iSend
5 8
MC_RANDOM (0)
6 0
MC_RANDOM (1)
3 7
Wait
5 4
MC_RANDOM (0)
5 6
MC_RANDOM (1)
3 8
MC_RANDOM (0)
5 2
MC_RANDOM (1)
3 9
MC_RANDOM (0)
4 4
MC_RANDOM (1)
4 0
MC_RANDOM (0)
MC_RANDOM (1)
4 1
MC_RANDOM (0) MC_RANDOM (1)
4 2
Test TRUE
iSend
4 5
Test TRUE
4 6
iSend
4 7
Wait
4 8
iRecv
Test FALSE
Test TRUE
Wait Wait
iSend iSend
6 3
Test FALSE
Test FALSE
6 7
iSend
7 5
Test FALSE
6 8
Wait
6 9
Test TRUE
7 0
iSend
7 1
Wait
7 2
iSend
7 3
iRecv
Test FALSE
7 6
iSend
9 9
MC_RANDOM (0)
1 0 1
MC_RANDOM (1)
7 7
Wait
9 5
MC_RANDOM (0)
9 7
MC_RANDOM (1)
7 8
MC_RANDOM (0)
9 3
MC_RANDOM (1)
7 9
MC_RANDOM (0)
8 4
MC_RANDOM (1)
8 0
MC_RANDOM (0)
MC_RANDOM (1)
8 1
MC_RANDOM (0) MC_RANDOM (1)
8 2
Test TRUE
iSend
8 5
Test TRUE
8 6
iSend
8 7
Wait
8 8
iSend
8 9
iRecv
Test FALSE
Test TRUE
Wait Wait
iSend iSend
iSend
Test FALSE
MC_RANDOM (0)
1 0 9
MC_RANDOM (1)
Test FALSE
MC_RANDOM (0)
MC_RANDOM (1)
1 1 6
iSend
1 1 7
iRecv
1 1 8
Wait
1 1 9
Test TRUE
1 2 0
iSend
Wait
MC_RANDOM (0)
1 2 4
MC_RANDOM (1)
1 2 5
iSend
1 2 8
Test FALSE
1 2 6
Wait
Test TRUE
1 2 9
iSend
1 4 9
MC_RANDOM (0)
1 5 1
MC_RANDOM (1)
1 3 0
Wait
1 4 5
MC_RANDOM (0)
1 4 7
MC_RANDOM (1)
1 3 1
MC_RANDOM (0)
1 4 3
MC_RANDOM (1)
1 3 2
MC_RANDOM (0)
1 3 7
MC_RANDOM (1)
1 3 3
MC_RANDOM (0)
MC_RANDOM (1)
1 3 4
MC_RANDOM (0) MC_RANDOM (1)
1 3 5
Test TRUE
iSend
1 3 8
Test TRUE
1 3 9
iSend
Wait
Test TRUE
Wait Wait
iSend iSend
iRecv
1
2
iSend
3
WaitTimeout
1 1
iRecv
4
iRecv
5
Test FALSE
6
MC_RANDOM
7
MC_RANDOM
8
MC_RANDOM
9
MC_RANDOM
Test FALSE
1 2
Wait
1 3
iRecv
2 4
Test TRUE
1 4
WaitTimeout
1 5
Test TRUE
1 6
iSend
1 7
iRecv
1 8
Test FALSE
1 9
MC_RANDOM
2 0
MC_RANDOM
2 1
MC_RANDOM
2 2
MC_RANDOM
Test FALSE
2 5
iRecv
2 6
WaitTimeout
2 9
iSend
2 7
iSend
iRecv
3 0
Wait
3 8
iRecv
3 1
iRecv
3 2
Test FALSE
3 3
MC_RANDOM
3 4
MC_RANDOM
3 5
MC_RANDOM
3 6
MC_RANDOM
Test FALSE
3 9
Wait
4 0
iRecv
4 1
Test FALSE
4 2
MC_RANDOM
4 3
MC_RANDOM
4 4
MC_RANDOM
4 5
MC_RANDOM
Test FALSE
1
2
[(1)c-1.me] iRecv
3
[(2)c-2.me] iSend
4
[(1)c-1.me] Wait [(2)->(1)]
5
[(1)c-1.me] iSend
6
[(2)c-2.me] Wait [(2)->(1)]
7
[(2)c-2.me] iRecv
8
[(1)c-1.me] Wait [(1)->(2)]
9
[(1)c-1.me] iRecv
10
[(2)c-2.me] Wait [(1)->(2)]
11
[(2)c-2.me] iSend
12
[(1)c-1.me] Wait [(2)->(1)]
13
[(1)c-1.me] iRecv
14
[(2)c-2.me] Wait [(2)->(1)]
15
[(2)c-2.me] iSend
[(1)c-1.me] Wait [(2)->(1)]
17
[(2)c-2.me] Wait [(2)->(1)]
18
[(1)c-1.me] Wait [(2)->(1)]
[(1)c-1.me] iSend
20
[(2)c-2.me] iRecv
21
[(1)c-1.me] iSend
22
[(1)c-1.me] Wait [(1)->(2)]
23
[(1)c-1.me] iRecv
[(2)c-2.me] Wait [(1)->(2)]
25
[(3)c-3.me] iSend
26
[(1)c-1.me] Wait [(3)->(1)]
27
[(1)c-1.me] iRecv
28
[(2)c-2.me] Wait [(1)->(2)]
29
[(2)c-2.me] iSend
30
[(1)c-1.me] Wait [(2)->(1)]
31
[(1)c-1.me] iRecv
32
[(2)c-2.me] Wait [(2)->(1)]
33
[(2)c-2.me] iSend
34
[(1)c-1.me] Wait [(2)->(1)]
35
[(1)c-1.me] iSend
36
[(2)c-2.me] Wait [(2)->(1)]
37
[(2)c-2.me] iRecv
38
[(1)c-1.me] Wait [(1)->(2)]
39
[(1)c-1.me] iRecv
40
[(2)c-2.me] Wait [(1)->(2)]
41
[(2)c-2.me] iSend
[(1)c-1.me] Wait [(2)->(1)]
43
[(2)c-2.me] Wait [(2)->(1)]
44
[(1)c-1.me] Wait [(2)->(1)]
[(1)c-1.me] iRecv
46
[(2)c-2.me] iSend
47
[(1)c-1.me] iRecv
48
[(1)c-1.me] Wait [(2)->(1)]
49
[(1)c-1.me] iSend
[(2)c-2.me] Wait [(2)->(1)]
51
[(3)c-3.me] Wait [(3)->(1)]
52
[(2)c-2.me] Wait [(2)->(1)]
53
[(2)c-2.me] iRecv
54
[(1)c-1.me] Wait [(1)->(2)]
55
[(1)c-1.me] iRecv
56
[(2)c-2.me] Wait [(1)->(2)]
57
[(2)c-2.me] iSend
58
[(1)c-1.me] Wait [(2)->(1)]
59
[(1)c-1.me] iRecv
60
[(2)c-2.me] Wait [(2)->(1)]
61
[(2)c-2.me] iSend
62
[(1)c-1.me] Wait [(2)->(1)]
63
[(1)c-1.me] iSend
[(2)c-2.me] Wait [(2)->(1)]
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 15/20

Example: Out of order receive
I Two processes send a message to a third oneI The receiver expects the message to be in orderI This may happen. . . or not
rank1
rank2
rank0
send(1)
send(2)
x ← 1 y ← 2
x < y
if (MPI_rank() == 0) {
MPI_Recv(&x , MPI_ANY_SOURCE);
MPI_Recv(&y , MPI_ANY_SOURCE);
MC_assert(x < y);
} else {
MPI_Send (&rank , 0);
}
rank1
rank2
rank0
send(1)
send(2)
x ← 2 y ← 1
x 6< y
**************************
*** PROPERTY NOT VALID ***
**************************
Counter-example execution trace:
[(1)recver] iRecv (dst=recver, buff=(verbose only), size=(verbose only))
[(3)sender] iSend (src=sender, buff=(verbose only), size=(verbose only))
[(1)recver] Wait (comm=(verbose only) [(3)sender -> (1)recver])
[(1)recver] iRecv (dst=recver, buff=(verbose only), size=(verbose only))
[(2)sender] iSend (src=sender, buff=(verbose only), size=(verbose only))
[(1)recver] Wait (comm=(verbose only) [(2)sender -> (1)recver])
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 16/20

Mitigating the State Space ExplosionMany execution paths are redundant ; cut exploration when possible
Dynamic Partial Ordering Reduction (DPOR)
I Works on histories: test only one transitions' interleaving if independentI Independence theorems: Local events are independent; iSend+iRecv also; . . .I Must be conservative (exploration soundness at risk!)I It works well (for safety properties)
System-Level State Equality
I Works on states: detect when a given space was previously exploredI Complementary to DPOR (but not compatible yet)I Introspect the C/C++/Fortran app just like gdb (+some black magic)
[AVOCS'10] [PDP'15]
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 17/20

Some Results
Wild safety bug in our Chord implementation (≈ 500 lines of C)
I Simulation: bug on large instances only; MC �nds small trace (1s with DPOR)
Mocked liveness bug
I Buggy centralized mutual exclusion: last client never obtains the CSI About 100 lines � state snapshot size: 5Mib; Veri�ed with up to 7 processes
Verifying MPICH3 complience tests
I Looking for assertion failures, deadlocks and non-progressive cyclesI 6 tests; ≈ 1300 LOCs (per test) � State snapshot size: ≈ 4MB
I We veri�ed several MPI2 collectives too © (but all good so far /)
Protocol-wide Properties
I e.g, Send-deteministic: On each node, send and recv evts always in same order(allows more e�cient application checkpointing)
I Even harder than liveness properties (not LTL), but doable in SimGridComputational Science of Computer Systems Introduction CS2 Models PDES Formal CC 18/20

Much more to say about SimGrid (too little time)
Hybrid Network Models
I Fluid model: model contention in steady state for large messagesI LogOP model: model intra-node delays and synchronizationI Also: MPI collectives, TCP (slow-start, cross-tra�c), soon IB
Realistic EmulationI SMPI: Study real MPI applications within SimGridI Simterpose: Study real arbitrary applications (ongoing)
High Performance Simulation
I Fast Enough: Innovative PDES; E�cient algorithms and implementationsI Big Enough: Scalable and versatile platform representation
Formal Veri�cation of Distributed Apps
I Safety, Liveness or CTL properties, with DPOR or state equality
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 19/20

Research Project for the Next 10 Years
Make Large-Scale Distributed Systems Easier to UseI They are pervasive in our connected societies, yet almost uncontrolledI Research Plan: Computational Science of Computer Systems
I Leveraging computers to understand computers
I Expected Visible Outcome: Propose a valgrind-like for Distributed SystemsI This is perfectly in line with what I'm doing since 15 years
Why in Myriads?
I Distributed Systems, with focus on experimentation (Grid'5000, etc)I Many works that solve hard OS-level issues to help distributed systems
Why at ENS Rennes?
I We need more teachers that are pro�cient with Systems Internals
I Teaching of paramount importance to me. Lots of activities on teaching CS:I Unplugged activities; Programming exercisers; Research groups; National Days
I More on this on Friday 12:30 :)
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 20/20

What is SMPI?I Reimplementation of MPI on top of SimGridI Imagine a VM running real MPI applications on
platforms that does not existI Horrible over-simpli�cation, but you get the idea
I Computations run for real on your laptop,Communications are faked
What is it good for?I Performance Prediction (�what-if?� scenarios)
I Platform dimensioning; Apps' parameter tuning
I Teaching parallel programming and HPCI Reduced technical burdenI No need for real hardware, or hack your hardware
Studies that you should NOT attempt with SMPI
I Predict the impact of L2 caches' size on your codeI Interactions of TCP Reno vs. TCP Vegas vs. UDPI Claiming a simulation of 1000 billions nodes
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 20/20

SimGrid Network ModelMeasurements
Small
Medium1Medium2
Detached
Small
Medium1Medium2
Detached
MPI_Send MPI_Recv
1e−04
1e−02
1e+01 1e+03 1e+05 1e+01 1e+03 1e+05Message size (bytes)
Dur
atio
n (s
econ
ds) group
Small
Medium1
Medium2
Detached
Large
Hybrid Model
Asynchronous (k 6 Sa)
T3
Pr
Ps
T1
T2
Detached (Sa < k 6 Sd)
Ps
Pr
T2T4
T1
Synchronous (k > Sd)
Ps
Pr
T4 T2
Fluid model: account for contention and network topology
1-39 40-74 105-14475-104
1G10G
;
DownUp DownUp DownUp DownUp
10G1G
1−39 40−74 105−14475−104
13G
10G
Limiter
... ...... ...1.5G
1G
Limiter
DownUp
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 20/20

SimGrid Modeling of MPIMPI Collectives
I SimGrid implements more than 120 algorithms for the 10 main MPI collectivesI Selection logic from OpenMPI, MPICH can be reproduced
HPC TopologiesEmpty+coords
Full
Full
Dijkstra
Floyd
Rule−based
Rule−based
Rule−based
basedRule−
AS1
AS2
AS4
AS5
AS7
AS6
AS5−3
AS5−1 AS5−2
AS5−4
Torus Fat-trees Hierarchies of ASesBut also
I External load (availability changes), Host and link failures, Energy (DVFS)I Virtual Machines, that can be migrated; Random platform generators
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 20/20

SimTerpose Project
Dream: Simulate any applications on top of SimGrid
Simulated Setup
P1
Port
80
Host A
P2
Host B
Network
Port ??
Take 1: ptrace plumbering
P1 P2SimTerpose
10042 Real Port
Hosting Computer
1
2
3
4
ptrace interceptor
10042
1
2
3
connect(host,10042)
connect(A,80)accept(80)
accept(10042)
Take 2: Full Emulation
P1 P2 P3
ptrace interceptor
Threads in simulator
Direct memory access
Current StateI Functional POC: send/recv exchangeI Need to handle the other 200 syscalls
I Intercept, store metadataI Inform simulator, report e�ect on procs
I Time and DNS need love at link timeI We are redeveloping a libC! (in strange way ;)
Computational Science of Computer Systems Introduction CS2 Models PDES Formal CC 20/20