sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

14
sOMP: Simulating OpenMP Task-Based Applications with NUMA Effects DAOUDI Idriss 1 , VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier [email protected] September 23, 2020 DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected]) sOMP: Simulating OpenMP Task-Based Applications with NUMA Effects September 23, 2020 1 / 14

Transcript of sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Page 1: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

sOMP: Simulating OpenMP Task-Based Applicationswith NUMA Effects

DAOUDI Idriss1, VIROULEAU Philippe, THIBAULT Samuel,GAUTIER Thierry, AUMAGE Olivier

[email protected]

September 23, 2020

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 1 / 14

Page 2: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Context and objectives

Context:

anticipating applications behavior, studying, and designing algorithms

experiment with various scheduling algorithms in a reproduciblecontext

predictive tools to evaluate applications performances on existing andnon-existing systems

SimGrid → simulation of distributed infrastructures

MPI, tasks, threads...

OpenMP?In order to simulate OpenMP:

1 parallel tasks with data dependency2 parallel loops

Objectives:

predict the performances of task-based applications

take into account Non-Uniform Memory Access (NUMA) effects

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 2 / 14

Page 3: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

State of the art

Distributed memory simulators:

SimGrid, Dimemas, BigSim, xSim...

Shared-memory simulators:

for specific architectures: Aversa et al. (hybrid MPI/OpenMP onSMP), Simany (multicore)...full system simulators: SimNUMA...for task-based applications: HLSMN tool (without considering taskdependencies)...

None of these tools takes into account task dependency andNUMA effects!

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 3 / 14

Page 4: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Methodology

1 OpenMP application runtime data collection (tracing)

2 NUMA architectures modeling

3 Building a task-based application simulator with SimGrid

4 Construction of a performance model to take into account memoryaccesses

5 Simulation of the behavior of OpenMP applications using theperformance model

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 4 / 14

Page 5: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Tracing with TiKKi

uses the OMPT API integrated in OpenMP 5.0

captures all the events required→ to build an OpenMP application task graph

generate several output forms of execution traces→ task graph, Gantt chart, sOMP trace format

sequence of parallel regions→ where the events of each task is recorded

one trace file per encountered parallel region

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 5 / 14

Page 6: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Modeling NUMA architectures with SimGrid

exploits the possibilities offered by SimGrid’s S4U API

NUMA node → cores + memory controller + links

exploit links parameters → model contention and concurrency

compromise between precision and simulation cost!→ around 70 times faster than real execution

(a) NUMA machine modeling (b) NUMA machine modeling usingSimGrid components

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 6 / 14

Page 7: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Task-based applications simulator: sOMP

Concept:

exploit traces collected from a sequential execution

execute the application on platforms modeled in the SimGrid senseusing the collected traces

simulate the execution time for a large number of cores

Components:

parser for trace and platform files

scheduler to submit jobs

single worker per simulated core to execute the tasks

performance model to refine the simulations

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 7 / 14

Page 8: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Communications-based performance model

Default task execution model in sOMP:

task execution time obtained via the trace filesadvance SimGrid’s simulated clock

Communications-based model using SimGrids’ communications:

trace file provide the list of memory operations performed by eachtask (R, RW);take into account those memory accesses;memory access → communication to the memory controllerthe size of each communication (in bytes) impact SimGrids’ internalclock→ contention and concurrency on the crossed links;

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 8 / 14

Page 9: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Evaluation

Architectures:

dual socket Intel Xeon Gold 6240 (CascadeLake)→ 2 sockets, 2 NUMA nodes, 36 cores

dual socket AMD EPYC 7452 (AMD Infinity)→ 2 sockets, 16 NUMA nodes, 64 cores

Kastors/Plasma[IWOMP2014]:

benchmark suite to evaluate the implementation of the OpenMP taskdependency paradigm;

several matrix factorization algorithms: Cholesky, QR, LU.

Metric:

precision error (%) = Tnative−TsimulatedTnative

x 100

positive → optimistic simulation, negative → pessimistic simulation

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 9 / 14

Page 10: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Architectures overview with HWLoc

Machine (1024GB total)

Package P#0

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#0

PU P#0

L2 (512KB)

L1 (32KB)

Core P#1

PU P#1

L2 (512KB)

L1 (32KB)

Core P#2

PU P#2

L2 (512KB)

L1 (32KB)

Core P#3

PU P#3

NUMANode P#0 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#4

PU P#4

L2 (512KB)

L1 (32KB)

Core P#5

PU P#5

L2 (512KB)

L1 (32KB)

Core P#6

PU P#6

L2 (512KB)

L1 (32KB)

Core P#7

PU P#7

NUMANode P#1 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#8

PU P#8

L2 (512KB)

L1 (32KB)

Core P#9

PU P#9

L2 (512KB)

L1 (32KB)

Core P#10

PU P#10

L2 (512KB)

L1 (32KB)

Core P#11

PU P#11

NUMANode P#2 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#12

PU P#12

L2 (512KB)

L1 (32KB)

Core P#13

PU P#13

L2 (512KB)

L1 (32KB)

Core P#14

PU P#14

L2 (512KB)

L1 (32KB)

Core P#15

PU P#15

NUMANode P#3 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#16

PU P#16

L2 (512KB)

L1 (32KB)

Core P#17

PU P#17

L2 (512KB)

L1 (32KB)

Core P#18

PU P#18

L2 (512KB)

L1 (32KB)

Core P#19

PU P#19

NUMANode P#4 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#20

PU P#20

L2 (512KB)

L1 (32KB)

Core P#21

PU P#21

L2 (512KB)

L1 (32KB)

Core P#22

PU P#22

L2 (512KB)

L1 (32KB)

Core P#23

PU P#23

NUMANode P#5 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#24

PU P#24

L2 (512KB)

L1 (32KB)

Core P#25

PU P#25

L2 (512KB)

L1 (32KB)

Core P#26

PU P#26

L2 (512KB)

L1 (32KB)

Core P#27

PU P#27

NUMANode P#6 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#28

PU P#28

L2 (512KB)

L1 (32KB)

Core P#29

PU P#29

L2 (512KB)

L1 (32KB)

Core P#30

PU P#30

L2 (512KB)

L1 (32KB)

Core P#31

PU P#31

NUMANode P#7 (64GB)

Package P#1

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#32

PU P#32

L2 (512KB)

L1 (32KB)

Core P#33

PU P#33

L2 (512KB)

L1 (32KB)

Core P#34

PU P#34

L2 (512KB)

L1 (32KB)

Core P#35

PU P#35

NUMANode P#8 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#36

PU P#36

L2 (512KB)

L1 (32KB)

Core P#37

PU P#37

L2 (512KB)

L1 (32KB)

Core P#38

PU P#38

L2 (512KB)

L1 (32KB)

Core P#39

PU P#39

NUMANode P#9 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#40

PU P#40

L2 (512KB)

L1 (32KB)

Core P#41

PU P#41

L2 (512KB)

L1 (32KB)

Core P#42

PU P#42

L2 (512KB)

L1 (32KB)

Core P#43

PU P#43

NUMANode P#10 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#44

PU P#44

L2 (512KB)

L1 (32KB)

Core P#45

PU P#45

L2 (512KB)

L1 (32KB)

Core P#46

PU P#46

L2 (512KB)

L1 (32KB)

Core P#47

PU P#47

NUMANode P#11 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#48

PU P#48

L2 (512KB)

L1 (32KB)

Core P#49

PU P#49

L2 (512KB)

L1 (32KB)

Core P#50

PU P#50

L2 (512KB)

L1 (32KB)

Core P#51

PU P#51

NUMANode P#12 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#52

PU P#52

L2 (512KB)

L1 (32KB)

Core P#53

PU P#53

L2 (512KB)

L1 (32KB)

Core P#54

PU P#54

L2 (512KB)

L1 (32KB)

Core P#55

PU P#55

NUMANode P#13 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#56

PU P#56

L2 (512KB)

L1 (32KB)

Core P#57

PU P#57

L2 (512KB)

L1 (32KB)

Core P#58

PU P#58

L2 (512KB)

L1 (32KB)

Core P#59

PU P#59

NUMANode P#14 (64GB)

L3 (16MB)

L2 (512KB)

L1 (32KB)

Core P#60

PU P#60

L2 (512KB)

L1 (32KB)

Core P#61

PU P#61

L2 (512KB)

L1 (32KB)

Core P#62

PU P#62

L2 (512KB)

L1 (32KB)

Core P#63

PU P#63

NUMANode P#15 (64GB)

AMD EPYC 7452

Machine (192GB total)

Package P#0

L3 (25MB)

L2 (1024KB)

L1 (32KB)

Core P#0

PU P#0

L2 (1024KB)

L1 (32KB)

Core P#1

PU P#1

L2 (1024KB)

L1 (32KB)

Core P#2

PU P#2

L2 (1024KB)

L1 (32KB)

Core P#3

PU P#3

L2 (1024KB)

L1 (32KB)

Core P#4

PU P#4

L2 (1024KB)

L1 (32KB)

Core P#5

PU P#5

L2 (1024KB)

L1 (32KB)

Core P#6

PU P#6

L2 (1024KB)

L1 (32KB)

Core P#7

PU P#7

L2 (1024KB)

L1 (32KB)

Core P#8

PU P#8

L2 (1024KB)

L1 (32KB)

Core P#9

PU P#9

L2 (1024KB)

L1 (32KB)

Core P#10

PU P#10

L2 (1024KB)

L1 (32KB)

Core P#11

PU P#11

L2 (1024KB)

L1 (32KB)

Core P#12

PU P#12

L2 (1024KB)

L1 (32KB)

Core P#13

PU P#13

L2 (1024KB)

L1 (32KB)

Core P#14

PU P#14

L2 (1024KB)

L1 (32KB)

Core P#15

PU P#15

L2 (1024KB)

L1 (32KB)

Core P#16

PU P#16

L2 (1024KB)

L1 (32KB)

Core P#17

PU P#17

NUMANode P#0 (96GB)

Package P#1

L3 (25MB)

L2 (1024KB)

L1 (32KB)

Core P#18

PU P#18

L2 (1024KB)

L1 (32KB)

Core P#19

PU P#19

L2 (1024KB)

L1 (32KB)

Core P#20

PU P#20

L2 (1024KB)

L1 (32KB)

Core P#21

PU P#21

L2 (1024KB)

L1 (32KB)

Core P#22

PU P#22

L2 (1024KB)

L1 (32KB)

Core P#23

PU P#23

L2 (1024KB)

L1 (32KB)

Core P#24

PU P#24

L2 (1024KB)

L1 (32KB)

Core P#25

PU P#25

L2 (1024KB)

L1 (32KB)

Core P#26

PU P#26

L2 (1024KB)

L1 (32KB)

Core P#27

PU P#27

L2 (1024KB)

L1 (32KB)

Core P#28

PU P#28

L2 (1024KB)

L1 (32KB)

Core P#29

PU P#29

L2 (1024KB)

L1 (32KB)

Core P#30

PU P#30

L2 (1024KB)

L1 (32KB)

Core P#31

PU P#31

L2 (1024KB)

L1 (32KB)

Core P#32

PU P#32

L2 (1024KB)

L1 (32KB)

Core P#33

PU P#33

L2 (1024KB)

L1 (32KB)

Core P#34

PU P#34

L2 (1024KB)

L1 (32KB)

Core P#35

PU P#35

NUMANode P#1 (96GB)

Intel Xeon Gold 6240DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 10 / 14

Page 11: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Results: Msize = 16384 x 16384, Bsize = 512 x 512

→ Cholesky algorithm on different architectures

-10

0

10

20

30

40

50

5 10 15 20 25 30 35

Pre

cis

ion e

rror

(%)

Number of cores

Intel Xeon Gold 6240 - Bloc size = 512 x 512

sOMPsOMP + Communications model

(c) on Intel

-10

0

10

20

30

40

50

10 20 30 40 50 60P

recis

ion e

rror

(%)

Number of cores

AMD EPYC 7452 - Bloc size = 512 x 512

sOMPsOMP + Communications model

(d) on AMD

→ less than 10% precision error with the communications-based model→ ideal for memory-bound applications→ similar results on LU algorithm

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 11 / 14

Page 12: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Results: Bsize = 768 x 768, on AMD machine

→ QR algorithm for different matrix sizes

-10

0

10

20

30

40

50

10 20 30 40 50 60

Pre

cis

ion e

rror

(%)

Number of cores

AMD EPYC 7452 - Matrix size = 8192 x 8192

sOMPsOMP + Communications model

-10

0

10

20

30

40

50

10 20 30 40 50 60

Pre

cis

ion e

rror

(%)

Number of cores

AMD EPYC 7452 - Matrix size = 16384 x 16384

sOMPsOMP + Communications model

→ impact of granularity and number of threads→ communications-based model is less efficient: QR is compute-bound

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 12 / 14

Page 13: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Summary

We developed sOMP for reproducible researches!

TiKKi→ obtaining sequential trace of application

sOMP→ simulate the application on modeled NUMA architectures

Communications-based model→ improve simulations by taking into account NUMA effects

Result→ very good precision for some applications!

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 13 / 14

Page 14: sOMP: Simulating OpenMP Task-Based Applications with NUMA ...

Future work

Model data movements inside a cache→ introduce a level of cache (L3) in the simulator

Take into account various hardware affinity policies

Introduce other scheduling policies

Combine this work with MPI and GPUs simulations to model hybridMPI/OpenMP applications

Sources:

TiKKi: https://gitlab.inria.fr/openmp/tikki/-/wikis/home

sOMP: https://gitlab.inria.fr/idaoudi/omps/-/wikis/home

DAOUDI Idriss, VIROULEAU Philippe, THIBAULT Samuel, GAUTIER Thierry, AUMAGE Olivier ([email protected])sOMP: Simulating OpenMP Task-Based Applications with NUMA EffectsSeptember 23, 2020 14 / 14