Doctorat européen ParisTech T H È S E l'École nationale ... · Les caract eristiques cl es de...

T

H

Egrave

S

E

INSTITUT DES SCIENCES ET TECHNOLOGIES

Eacutecole doctorale nO432 Sciences des Meacutetiers de lrsquoIngeacutenieur

Doctorat europeacuteen ParisTech

T H Egrave S Epour obtenir le grade de docteur deacutelivreacute par

lrsquoEacutecole nationale supeacuterieure des mines de Paris

Speacutecialiteacute laquo Informatique temps-reacuteel robotique et automatique raquo

preacutesenteacutee et soutenue publiquement par

Dounia KHALDI27 novembre 2013

Paralleacutelisation automatique et statique de tacircchessous contraintes de ressources

ndash une approche geacuteneacuterique ndashsim sim sim

Automatic Resource-Constrained Static Task Parallelizationndash A Generic Approach ndash

Directeur de thegravese Franccedilois IRIGOINEncadrement de la thegravese Corinne ANCOURT

JuryCorinne ANCOURT Chargeacutee de recherche CRI MINES ParisTech ExaminateurCeacutedric BASTOUL Professeur Universiteacute de Strasbourg ExaminateurHenri-Pierre CHARLES Ingeacutenieur chercheur en informatique CEA RapporteurFranccedilois IRIGOIN Directeur de recherche CRI MINES ParisTech Directeur de thegravesePaul H J KELLY Professeur SPO Imperial College London Rapporteur

MINES ParisTechCentre de recherche en informatique

35 rue Saint-Honoreacute 77305 Fontainebleau Cedex France

Abstract

This thesis intends to show how to efficiently exploit the parallelism presentin applications in order to enjoy the performance benefits that multiproces-sors can provide using a new automatic task parallelization methodologyfor compilers The key characteristics we focus on are resource constraintsand static scheduling This methodology includes the techniques requiredto decompose applications into tasks and generate equivalent parallel codeusing a generic approach that targets both different parallel languages andarchitectures We apply this methodology in the existing tool PIPS a com-prehensive source-to-source compilation platform

This thesis mainly focuses on three issues First since extracting taskparallelism from sequential codes is a scheduling problem we design andimplement an efficient automatic scheduling algorithm called BDSC forparallelism detection the result is a scheduled SDG a new task graph datastructure In a second step we design a new generic parallel intermediaterepresentation extension called SPIRE in which parallelized code may beexpressed Finally we wrap up our goal of automatic parallelization in anew BDSC- and SPIRE-based parallel code generator which is integratedwithin the PIPS compiler framework It targets both shared and distributedmemory systems using automatically generated OpenMP and MPI code

Resume

Le but de cette these est drsquoexploiter efficacement le parallelisme presentdans les applications informatiques sequentielles afin de beneficier des perfor-mances fournies par les multiprocesseurs en utilisant une nouvelle methodologiepour la parallelisation automatique des taches au sein des compilateursLes caracteristiques cles de notre approche sont la prise en compte descontraintes de ressources et le caractere statique de lrsquoordonnancement destaches Notre methodologie contient les techniques necessaires pour la decompositiondes applications en taches et la generation de code parallele equivalent enutilisant une approche generique qui vise differents langages et architec-tures paralleles Nous implementons cette methodologie dans le compilateursource-a-source PIPS Cette these repond principalement a trois questionsPrimo comme lrsquoextraction du parallelisme de taches des codes sequentielsest un probleme drsquoordonnancement nous concevons et implementons unalgorithme drsquoordonnancement efficace que nous nommons BDSC pour ladetection du parallelisme le resultat est un SDG ordonnance qui est unenouvelle structure de donnees de graphe de taches Secondo nous pro-posons une nouvelle extension generique des representations intermediairessequentielles en des representations intermediaires paralleles que nous nom-mons SPIRE pour la representation des codes paralleles Enfin nous developponsen utilisant BDSC et SPIRE un generateur de code que nous integrons dansPIPS Ce generateur de code cible les systemes a memoire partagee et amemoire distribuee via des codes OpenMP et MPI generes automatique-ment

Contents

Abstract ii

Resume iv

1 Introduction 111 Context 112 Motivation 213 Contributions 414 Thesis Outline 5

2 Perspectives Bridging Parallel Architectures and SoftwareParallelism 821 Parallel Architectures 8

211 Multiprocessors 9212 Memory Models 11

22 Parallelism Paradigms 1323 Mapping Paradigms to Architectures 1424 Program Dependence Graph (PDG) 15

241 Control Dependence Graph (CDG) 15242 Data Dependence Graph (DDG) 17

25 List Scheduling 18251 Background 18252 Algorithm 19

26 PIPS Automatic Parallelizer and Code Transformation Frame-work 21261 Transformer Analysis 22262 Precondition Analysis 23263 Array Region Analysis 24264 ldquoComplexityrdquo Analysis 26265 PIPS (Sequential) IR 26

27 Conclusion 28

3 Concepts in Task Parallel Programming Languages 2931 Introduction 2932 Mandelbrot Set Computation 3133 Task Parallelism Issues 32

331 Task Creation 32332 Synchronization 33333 Atomicity 33

34 Parallel Programming Languages 34

CONTENTS vii

341 Cilk 34

342 Chapel 35

343 X10 and Habanero-Java 37

344 OpenMP 39

345 MPI 40

346 OpenCL 42

35 Discussion and Comparison 44

36 Conclusion 48

4 SPIRE A Generic Sequential to Parallel Intermediate Rep-resentation Extension Methodology 49

41 Introduction 50

42 SPIRE a Sequential to Parallel IR Extension 51

421 Design Approach 52

422 Execution 53

423 Synchronization 56

424 Data Distribution 60

43 SPIRE Operational Semantics 62

431 Sequential Core Language 62

432 SPIRE as a Language Transformer 64

433 Rewriting Rules 67

44 Validation SPIRE Application to LLVM IR 68

45 Related Work Parallel Intermediate Representations 72

46 Conclusion 74

5 BDSC A Memory-Constrained Number of Processor-BoundedExtension of DSC 75

51 Introduction 75

52 The DSC List-Scheduling Algorithm 76

521 The DSC Algorithm 76

522 Dominant Sequence Length Reduction Warranty 79

53 BDSC A Resource-Constrained Extension of DSC 80

531 DSC Weaknesses 80

532 Resource Modeling 81

533 Resource Constraint Warranty 82

534 Efficient Task-to-Cluster Mapping 83

535 The BDSC Algorithm 84

54 Related Work Scheduling Algorithms 87

55 Conclusion 89

6 BDSC-Based Hierarchical Task Parallelization 91

61 Introduction 91

62 Hierarchical Sequence Dependence DAG Mapping 92

621 Sequence Dependence DAG (SDG) 94

viii CONTENTS

622 Hierarchical SDG Mapping 96

63 Sequential Cost Models Generation 100

631 From Convex Polyhedra to Ehrhart Polynomials 100

632 From Polynomials to Values 103

64 Reachability Analysis The Path Transformer 106

641 Path Definition 107

642 Path Transformer Algorithm 107

643 Operations on Regions using Path Transformers 115

65 BDSC-Based Hierarchical Scheduling (HBDSC) 116

651 Closure of SDGs 117

652 Recursive Top-Down Scheduling 118

653 Iterative Scheduling for Resource Optimization 121

654 Complexity of HBDSC Algorithm 123

655 Parallel Cost Models 125

66 Related Work Task Parallelization Tools 126

67 Conclusion 128

7 SPIRE-Based Parallel Code Transformations and Genera-tion 130

71 Introduction 130

72 Parallel Unstructured to Structured Transformation 133

721 Structuring Parallelism 133

722 Hierarchical Parallelism 137

73 From Shared Memory to Distributed Memory Transformation 139

731 Related Work Communications Generation 139

732 Difficulties 140

733 Equilevel and Hierarchical Communications 144

734 Equilevel Communications Insertion Algorithm 146

735 Hierarchical Communications Insertion Algorithm 150

736 Communications Insertion Main Algorithm 151

737 Conclusion and Future Work 153

74 Parallel Code Generation 155

741 Mapping Approach 155

742 OpenMP Generation 157

743 SPMDization MPI Generation 159

75 Conclusion 161

8 Experimental Evaluation with PIPS 163

81 Introduction 163

82 Implementation of HBDSC- and SPIRE-Based ParallelizationProcesses in PIPS 164

821 Preliminary Passes 167

822 Task Parallelization Passes 168

823 OpenMP Related Passes 169

CONTENTS ix

824 MPI Related Passes SPMDization 17083 Experimental Setting 171

831 Benchmarks ABF Harris Equake IS and FFT 171832 Experimental Platforms Austin and Cmmcluster 174833 Effective Cost Model Parameters 174

84 Protocol 17485 BDSC vs DSC 176

851 Experiments on Shared Memory Systems 176852 Experiments on Distributed Memory Systems 179

86 Scheduling Robustness 18087 Faust Parallel Scheduling vs BDSC 182

871 OpenMP Version in Faust 182872 Faust Programs Karplus32 and Freeverb 182873 Experimental Comparison 183

88 Conclusion 183

9 Conclusion 18591 Contributions 18592 Future Work 188

List of Tables

21 Multicores proliferation 1122 Execution time estimations for Harris functions using PIPS

complexity analysis 26

31 Summary of parallel languages constructs 48

41 Mapping of SPIRE to parallel languages constructs (terms inparentheses are not currently handled by SPIRE) 53

61 Execution and communication time estimations for Harris us-ing PIPS default cost model (N and M variables representthe input image size) 104

62 Comparison summary between different parallelization tools 128

81 CPI for the Austin and Cmmcluster machines plus the trans-fer cost of one byte β on the Cmmcluster machine (in cycles) 175

82 Run-time sensitivity of BDSC with respect to static cost es-timation (in ms for Harris and ABF in s for equake and IS) 181

List of Figures

11 Sequential C implementation of the main function of Harris 3

12 Harris algorithm data flow graph 3

13 Thesis outline blue indicates thesis contributions an ellipsea process and a rectangle results 7

21 A typical multiprocessor architectural model 9

22 Memory models 11

23 Rewriting of data parallelism using the task parallelism model 14

24 Construction of the control dependence graph 16

25 Example of a C code and its data dependence graph 17

26 A Directed Acyclic Graph (left) and its associated data (right) 19

27 Example of transformer analysis 23

28 Example of precondition analysis 24

29 Example of array region analysis 25

210 Simplified Newgen definitions of the PIPS IR 27

31 Result of the Mandelbrot set 31

32 Sequential C implementation of the Mandelbrot set 32

33 Cilk implementation of the Mandelbrot set (P is the numberof processors) 34

34 Chapel implementation of the Mandelbrot set 36

35 X10 implementation of the Mandelbrot set 37

36 A clock in X10 (left) and a phaser in Habanero-Java (right) 38

37 C OpenMP implementation of the Mandelbrot set 39

38 MPI implementation of the Mandelbrot set (P is the numberof processors) 41

39 OpenCL implementation of the Mandelbrot set 43

310 A hide-and-seek game (X10 HJ Cilk) 46

311 Example of an atomic directive in OpenMP 47

312 Data race on ptr with Habanero-Java 47

41 OpenCL example illustrating a parallel loop 54

42 forall in Chapel and its SPIRE core language representation 54

43 A C code and its unstructured parallel control flow graphrepresentation 55

44 parallel sections in OpenMP and its SPIRE core lan-guage representation 55

45 OpenCL example illustrating spawn and barrier statements 57

46 Cilk and OpenMP examples illustrating an atomically-synchronizedstatement 57

xii

LIST OF FIGURES xiii

47 X10 example illustrating a future task and its synchronization 58

48 A phaser in Habanero-Java and its SPIRE core languagerepresentation 59

49 Example of Pipeline Parallelism with phasers 59

410 MPI example illustrating a communication and its SPIREcore language representation 61

411 SPIRE core language representation of a non-blocking send 61

412 SPIRE core language representation of a non-blocking receive 61

413 SPIRE core language representation of a broadcast 62

414 Stmt and SPIRE(Stmt) syntaxes 63

415 Stmt sequential transition rules 63

416 SPIRE(Stmt) synchronized transition rules 66

417 if statement rewriting using while loops 68

418 Simplified Newgen definitions of the LLVM IR 70

419 A loop in C and its LLVM intermediate representation 70

420 SPIRE (LLVM IR) 71

421 An example of a spawned outlined sequence 71

51 A Directed Acyclic Graph (left) and its scheduling (right)starred top levels () correspond to the selected clusters 78

52 Result of DSC on the graph in Figure 51 without (left) andwith (right) DSRW 80

53 A DAG amenable to cluster minimization (left) and its BDSCstep-by-step scheduling (right) 84

54 DSC (left) and BDSC (right) cluster allocation 85

61 Abstract syntax trees Statement syntax 92

62 Parallelization process blue indicates thesis contributions anellipse a process and a rectangle results 93

63 Example of a C code (left) and the DDG D of its internal Ssequence (right) 97

64 SDGs of S (top) and S0 (bottom) computed from the DDG(see the right of Figure 63) S and S0 are specified in the leftof Figure 63 98

65 SDG G for a part of equake S given in Figure 67 Gbody isthe SDG of the body Sbody (logically included into G via H) 99

66 Example of the execution time estimation for the functionMultiplY of the code Harris each comment provides thecomplexity estimation of the statement below (N and M areassumed to be global variables) 103

67 Instrumented part of equake (Sbody is the inner loop sequence)105

68 Numerical results of the instrumented part of equake (instrumen-ted equakein) 106

69 An example of a subtree of root forloop0 108

xiv LIST OF FIGURES

610 An example of a path execution trace from Sb to Se in thesubtree forloop0 108

611 An example of a C code and the path transformer of thesequence between Sb and Se M is perp since there are no loopsin the code 111

612 An example of a C code and the path transformer of thesequence of calls and test between Sb and Se M is perp sincethere are no loops in the code 112

613 An example of a C code and the path transformer betweenSb and Se 115

614 Closure of the SDG G0 of the C code S0 in Figure 63 withadditional import entry vertices 118

615 topsort(G) for the hierarchical scheduling of sequences 120

616 Scheduled SDG for Harris using P=3 cores the schedulinginformation via cluster(τ) is also printed inside each vertexof the SDG 122

617 After the first application of BDSC on S = sequence(S0S1)failing with ldquoNot enough memoryrdquo 123

618 Hierarchically (viaH) Scheduled SDGs with memory resourceminimization after the second application of BDSC (whichsucceeds) on sequence(S0S1) (κi = cluster(σ(Si))) Tokeep the picture readable only communication edges are fig-ured in these SDGs 124

71 Parallel code transformations and generation blue indicatesthis chapter contributions an ellipse a process and a rect-angle results 132

72 Unstructured parallel control flow graph (a) and event (b)and structured representations (c) 134

73 A part of Harris and one possible SPIRE core language rep-resentation 137

74 SPIRE representation of a part of Harris non optimized (left)and optimized (right) 138

75 A simple C code (hierarchical) and its SPIRE representation 138

76 An example of a shared memory C code and its SPIRE codeefficient communication primitive-based distributed memoryequivalent 141

77 Two examples of C code with non-uniform dependences 142

78 An example of a C code and its read and write array regionsanalysis for two communications from S0 to S1 and to S2 143

79 An example of a C code and its in and out array regionsanalysis for two communications from S0 to S1 and to S2 143

LIST OF FIGURES xv

710 Scheme of communications between multiple possible levelsof hierarchy in a parallel code the levels are specified withthe superscripts on τ rsquos 145

711 Equilevel and hierarchical communications 146712 A part of an SDG and the topological sort order of predeces-

sors of τ4 148713 Communications generation from a region using region to coms

function 150714 Equilevel and hierarchical communications generation and

SPIRE distributed memory representation of a C code 154715 Graph illustration of communications made in the C code of

Figure 714 equilevel in green arcs hierarchical in black 155716 Equilevel and hierarchical communications generation after

optimizations the current cluster executes the last nestedspawn (left) and barriers with one spawn statement are re-moved (right) 156

717 SPIRE representation of a part of Harris and its OpenMPtask generated code 159

718 SPIRE representation of a C code and its OpenMP hierar-chical task generated code 159

719 SPIRE representation of a part of Harris and its MPI gener-ated code 161

81 Parallelization process blue indicates thesis contributions anellipse a pass or a set of passes and a rectangle results 165

82 Executable (harristpips) for harrisc 16683 A model of an MPI generated code 17184 Scheduled SDG of the main function of ABF 17385 Steps for the rewriting of data parallelism to task parallelism

using the tiling and unrolling transformations 17686 ABF and equake speedups with OpenMP 17787 Speedups with OpenMP impact of tiling (P=3) 17888 Speedups with OpenMP for different class sizes (IS) 17889 ABF and equake speedups with MPI 179810 Speedups with MPI impact of tiling (P=3) 180811 Speedups with MPI for different class sizes (IS) 180812 Run-time comparison (BDSC vs Faust for Karplus32) 183813 Run-time comparison (BDSC vs Faust for Freeverb) 184

Chapter 1

IntroductionI live and donrsquot know how long Irsquoll die and donrsquot know when I am going and

donrsquot know where I wonder that I am happy Martinus Von Biberach

11 Context

Moorersquos law [81] states that over the history of computing hardware thenumber of transistors on integrated circuits doubles approximately everytwo years Recently the shift from fabricating microprocessors to parallelmachines designs was partly because of the growth in power consumptiondue to high clock speeds required to improve performance in single processor(or core) chips The transistor count is still increasing in order to integratemore cores and ensure proportional performance scaling

In spite of the validity of Moorersquos law till now the performance of coresstopped increasing after 2003 Thus the scalability of applications whichcalls for increased performance when resources are added is not guaranteedTo understand this issue it is necessary to take a look at Amdahlrsquos law [16]This law states that the speedup of a program using multiple processorsis bounded by the time needed for its sequential part in addition to thenumber of processors the algorithm also limits the speedup

In order to enjoy the performance benefits that multiprocessors can pro-vide one should exploit efficiently the parallelism present in applicationsThis is a tricky task for programmers especially if this programmer is aphysicist or a mathematician or is a computer scientist for whom under-standing the application is difficult since it is not his Of course we couldsay to the programmer ldquothink parallelrdquo But humans tend to think se-quentially Therefore one past problem is and will be for some time todetect parallelism in a sequential code and automatically or not write itsequivalent efficient parallel code

A parallel application expressed in a parallel programming model shouldbe as portable as possible ie one should not have to rewrite parallel codewhen targeting an other machine Yet the proliferation of parallel program-ming models makes the choice of one general model not obvious unless thereis a parallel language that is efficient and able to be compiled to all types ofcurrent and future architectures which is not yet the case Therefore thiswriting of a parallel code needs to be based on a generic approach ie howwell a range of different languages can be handled in order to offer somechance of portability

2 CHAPTER 1 INTRODUCTION

12 Motivation

In order to achieve good performance when programming for multiproces-sors and go beyond having to ldquothink parallelrdquo a platform for automaticparallelization is required to exploit cores efficiently Also the proliferationof multi-core processors with shorter pipelines and lower clock rates and thepressure that the simpler data parallelism model imposes on their memorybandwidth have made coarse grained parallelism inevitable for improvingperformance

The first step in a parallelization process is to expose concurrency bydecomposing applications into pipelined tasks since an algorithm is oftendescribed as a collection of tasks To illustrate the importance of task par-allelism we give a simple example the Harris algorithm [94] which is basedon pixelwise autocorrelation using a chain of functions namely Sobel Mul-tiplication Gauss and Coarsity Figure 11 shows this process Severalworks already parallelized and mapped this algorithm on parallel architec-tures such as the CELL processor [93]

Figure 12 shows a possible instance of a manual partitioning of Harris(see also [93]) The Harris algorithm can be decomposed into a successionof two to three concurrent tasks (ellipses in the figure) assuming no ex-ploitation of data parallelism Automating this intuitive approach and thusthe automatic task extraction and parallelization of such applications is ourmotivation in this thesis

However detecting task parallelism and generating efficient code rest oncomplex analyses of data dependences communication synchronization etcThese different analyses can be provided by compilation frameworks Thusdelegating the issue of ldquoparallel thinkingrdquo to software is appealing becauseit can be built upon existing sophisticated analyses of data dependencesand can handle the granularity present in sequential codes and the resourceconstraints such as memory size or the number of CPUs that also impactthis process

Automatic task parallelization has been studied around for almost halfa century The problem of extracting an optimal parallelism with optimalcommunications is NP-complete [46] Several works have tried to automatethe parallelization of programs using different granularities Metis [61] andother graph partitioning tools aim at assigning the same amount of pro-cessor work with small quantities of interprocessor communication but thestructure of the graph does not differentiate between loops function callsetc The crucial steps of graph construction and parallel code generationare missing Sarkar [96] implements a compile-time method for the parti-tioning problem for multiprocessors A program is partitioned into paralleltasks at compile time and then these are merged until a partition with thesmallest parallel execution time in the presence of overhead (scheduling andcommunication overhead) is found Unfortunately this algorithm does not

12 MOTIVATION 3

void main(int argc char argv [])

float (Gx)[NM] (Gy)[NM] (Ixx)[NM]

(Iyy)[NM] (Ixy)[NM] (Sxx)[NM]

(Sxy)[NM] (Syy)[NM] (in)[NM]

in = InitHarris ()

Now we run the Harris procedure

Sobel

SobelX(Gx in)

SobelY(Gy in)

Multiply

MultiplY(Ixx Gx Gx)

MultiplY(Iyy Gy Gy)

MultiplY(Ixy Gx Gy)

Gauss

Gauss(Sxx Ixx)

Gauss(Syy Iyy)

Gauss(Sxy Ixy)

Coarsity

CoarsitY(out Sxx Syy Sxy)

return

Figure 11 Sequential C implementation of the main function of Harris

in

SobelX

SobelY

Gx

Gy

MultiplY

MultiplY

MultiplY

Ixx

Ixy

Iyy

Gauss

Gauss

Gauss

Sxx

Sxy

Syy

Coarsity out

Figure 12 Harris algorithm data flow graph

address resource constraints which are important factors for targeting realarchitectures

All parallelization tools are dedicated to a particular programming modelthere is a lack of a generic abstraction of parallelism (multithreading syn-chronization and data distribution) They thus do not usually address theissue of portability

In this thesis we develop an automatic task parallelization methodologyfor compilers the key characteristics we focus on are resource constraintsand static scheduling It includes the techniques required to decompose ap-plications into tasks and generate equivalent parallel code using a genericapproach that targets different parallel languages and thus different architec-


tures We apply this methodology in the existing tool PIPS [58] a compre-hensive source-to-source compilation platform

13 Contributions

Our goal is to develop a new tool and algorithms for automatic task par-allelization that take into account resource constraints and which can alsobe of general use for existing parallel languages For this purpose the maincontributions of this thesis are

1 a new BDSC-based hierarchical scheduling algorithm (HBDSC) thatuses

bull a new data structure called the Sequence Data Dependence Graph(SDG) to represent partitioned parallel programs

bull ldquoBounded DSCrdquo (BDSC) an extension of the DSC [108] schedul-ing algorithm that simultaneously handles two resource constraintsnamely a bounded amount of memory per processor and a boundednumber of processors which are key parameters when schedulingtasks on actual multiprocessors

bull a new cost model based on execution time complexity estimationconvex polyhedral approximations of data array sizes and codeinstrumentation for the labeling of SDG vertices and edges

2 a new approach for the adaptation of automatic parallelization plat-forms to parallel languages via

bull SPIRE a new simple parallel intermediate representation (IR)extension methodology for designing the parallel IRs used in com-pilation frameworks

bull the deployment of SPIRE for automatic task-level paralleliza-tion of explicitly parallel programs on the PIPS [58] compilationframework

3 an implementation in the PIPS source-to-source compilation frame-work of

bull BDSC-based parallelization for programs encoded using SPIRE-based PIPS IR

bull SPIRE-based parallel code generation into two parallel languagesOpenMP [4] and MPI [3]

4 performance measurements for our parallelization approach based on

14 THESIS OUTLINE 5

bull five significant programs targeting both shared and distributedmemory architectures the image and signal processing bench-marks actually sample constituent algorithms hopefully repre-sentative of classes of applications Harris [53] and ABF [51] theSPEC2001 benchmark equake [22] the NAS parallel benchmarkIS [85] and an FFT code [8]

bull their automatic translations into two parallel languages OpenMP [4]and MPI [3]

bull and finally a comparative study between our task parallelizationimplementation in PIPS and that of the audio signal processinglanguage Faust [86] using two Faust applications Karplus32 andFreeverb We choose Faust since it is an open-source compiler andit also generates automatically parallel tasks in OpenMP

14 Thesis Outline

This thesis is organized in nine chapters Figure 13 shows how these chap-ters can be put into the context of a parallelization chain

Chapter 2 provides a short background to the concepts of architecturesparallelism languages and compilers used in the next chapters

Chapter 3 surveys seven current and efficient parallel programming lan-guages in order to determine and classify their parallel constructs Thishelps us to define a core parallel language that plays the role of a genericparallel intermediate representation when parallelizing sequential programs

Our proposal called the SPIRE methodology (Sequential to Parallel In-termediate Representation Extension) is developed in Chapter 4 SPIREleverages existing infrastructures to address both control and data parallellanguages while preserving as much as possible existing analyses for sequen-tial codes To validate this approach in practice we use PIPS a compre-hensive source-to-source compilation platform as a use case to upgrade itssequential intermediate representation to a parallel one

Since the main goal of this thesis is automatic task parallelization ex-tracting task parallelism from sequential codes is a key issue in this processwhich we view as a scheduling problem Chapter 5 introduces a new efficientautomatic scheduling heuristic called BDSC (Bounded Dominant SequenceClustering) for parallel programs in the presence of resource constraints onthe number of processors and their local memory size

The parallelization process we introduce in this thesis uses BDSC tofind a good scheduling of program tasks on target machines and SPIRE togenerate parallel source code Chapter 6 equips BDSC with a sophisticatedcost model to yield a new parallelization algorithm

Chapter 7 describes how we can generate equivalent parallel codes in twotarget languages OpenMP and MPI selected from the survey presented in


Chapter 3 Thanks to SPIRE we show how code generation is simple andefficient

To verify the efficiency and robustness of our work experimental resultsare presented in Chapter 8 They suggest that BDSC- and SPIRE-basedparallelization focused on efficient resource management leads to significantparallelization speedups on both shared and distributed memory systemsWe also compare our implementation of the generation of omp task in PIPSwith the generation of omp sections in Faust

We conclude in Chapter 9

14 THESIS OUTLINE 7

C Source Code

PIPS Analyses(Chapter 2)

Complexities Convex Array Regions

(Chapter 2)

Sequential IR(Chapter 2)

DDG(Chapter 2)

BDSC-BasedParallelizer

(Chapters 5 and 6)

SPIRE(PIPS IR)(Chapter 4)

OpenMPGeneration(Chapter 7)

MPIGeneration(Chapter 7)

OpenMP Code(Chapter 3)

MPI Code(Chapter 3)

Run

Performance Results(Chapter 8)

Figure 13 Thesis outline blue indicates thesis contributions an ellipse aprocess and a rectangle results

Chapter 2

Perspectives BridgingParallel Architectures and

Software ParallelismYou talk when you cease to be at peace with your thoughts Kahlil Gibran

ldquoAnyone can build a fast CPU The trick is to build a fast systemrdquo At-tributed to Seymour Cray this quote is even more pertinent when lookingat multiprocessor systems that contain several fast processing units parallelsystem architectures introduce subtle system features to achieve good per-formance Real world applications which operate on large amounts of datamust be able to deal with constraints such as memory requirements codesize and processor features These constraints must also be addressed by par-allelizing compilers that is related to such applications from the domains ofscientific signal and image processing and translate sequential codes intoefficient parallel ones The multiplication of hardware intricacies increasesthe importance of software in order to achieve adequate performance

This thesis was carried out under this perspective Our goal is to de-velop a prototype for automatic task parallelization that generates a parallelversion of an input sequential code using an algorithm of scheduling Wealso address code generation and parallel intermediate representation issuesMore generally we study how parallelism can be detected represented andfinally generated

21 Parallel Architectures

Up to 2003 the performance of machines has been improved by increasingclock frequencies Yet relying on single-thread performance has been in-creasingly disappointing due to access latency to main memory Moreoverother aspects of performance have been of importance lately such as powerconsumption energy dissipation and number of cores For these reasonsthe development of parallel processors with a large number of cores thatrun at slower frequencies to reduce energy consumption is adopted and hashad a significant impact on performance

The market of parallel machines is wide from vector platforms acceler-ators Graphical Processing Unit (GPUs) and dualcore to manycore1 with

1Contains 32 or more cores

21 PARALLEL ARCHITECTURES 9

thousands of on-chip cores However gains obtained when parallelizing andexecuting applications are limited by input-output issues (disk readwrite)data communications among processors memory access time load unbal-ance etc When parallelizing an application one must detect the availableparallelism and map different parallel tasks on the parallel machine In thisdissertation we consider a task as a static notion ie a list of instruc-tions while processes and threads (see Section 212) are running instancesof tasks

In our parallelization approach we make trade-offs between the avail-able parallelism and communicationslocality this latter point is related tothe memory model in the target machine Besides extracted parallelismshould be correlated to the number of processorscores available in the tar-get machine We also take into account the memory size and the number ofprocessors

211 Multiprocessors

The generic model of multiprocessors is a collection of cores each potentiallyincluding both CPU and memory attached to an interconnection networkFigure 21 illustrates an example of a multiprocessor architecture that con-tains two clusters each cluster is a quadcore multiprocessor In a multicorecores can transfer data between their cache memory (level 2 cache gener-ally) without having to go through the main memory Moreover they mayor may not share caches (CPU-local level 1 cache in Figure 21 for example)This is not possible at the cluster level where processor cache memories arenot linked Data can be transfered to and from the main memory only

CPU

L1 Cache

CPU

L1 Cache

CPU

CPU

L1 Cache L1 Cache

L2 Cache L2 Cache

L1 Cache

CPU

L1 Cache

CPU

CPU

L1 CacheL1 Cache

CPU

Main Memory

Interconnect

Main Memory

Figure 21 A typical multiprocessor architectural model

Flynn [45] classifies multiprocessors in three main groups Single In-struction Single Data (SISD) Single Instruction Multiple Data (SIMD) and

10 CHAPTER 2 PERSPECTIVES BRIDGING PARALLEL

Multiple Instruction Multiple Data (MIMD) Multiprocessors are generallyrecognized to be MIMD architectures We present the important branches ofcurrent multiprocessors MPSoCs and multicore processors since they arewidely used in many applications such as signal processing and multimedia

Multiprocessors System-on-Chip

A multiprocessor system-on-chip (MPSoC) integrates multiple CPUs in ahardware system MPSoCs are commonly used for applications such asembedded multimedia on cell phones The first MPSoC often consideredto be is the Lucent Daytona [9] One distinguishes two important forms ofMPSoCs homogeneous and heterogeneous multiprocessors Homogeneousarchitectures include multiple cores that are similar in every aspect On theother side heterogeneous architectures use processors that are different interms of the instruction set architecture functionality frequency etc FieldProgrammable Gate Array (FPGA) platforms and the CELL processor2 [42]are examples of heterogeneous architectures

Multicore Processors

This model implements the shared memory model The programming modelin a multicore processor is the same for all its CPUs The first multicoregeneral purpose processors were introduced in 2005 (Intel Core Duo) IntelXeon Phi is a recent multicore chip with 60 homogeneous cores The num-ber of cores on a chip is expected to continue to grow to construct morepowerful computers This emerging hardware approach will deliver petaflopand exaflop performance with the efficient capabilities needed to handlehigh-performance emerging applications Table 21 illustrates the rapid pro-liferation of multicores and summarizes the important existing ones notethat IBM Sequoia the second system in the Top500 list (top500org) inNovember 2012 is a petascale Blue GeneQ supercomputer that contains98304 compute nodes where each computing node is a 16-core PowerPC A2processor chip (sixth entry in the table) See [25] for an extended survey ofmulticore processors

In this thesis we provide tools that automatically generate parallel pro-grams from sequential ones targeting homogeneous multiprocessors or mul-ticores Since increasing the number of cores on a single chip creates chal-lenges with memory and cache coherence as well as communication betweenthe cores our aim in this thesis is to deliver the benefit and efficiency of mul-ticore processors by the introduction of an efficient scheduling algorithmto map parallel tasks of a program on CPUs in the presence of resourceconstraints on the number of processors and their local memory size

2The Cell configuration has one Power Processing Element (PPE) on the core actingas a controller for eight physical Synergistic Processing Elements (SPEs)


Year Number of cores Company Processor

The early 1970s The 1st microprocessor Intel 4004

2005 2-core machine Intel Xeon Paxville DP

2005 2 AMD Opteron E6-265

2009 6 AMD Phenom II X6

2011 8 Intel Xeon E7-2820

2011 16 IBM PowerPC A2

2007 64 Tilera TILE64

2012 60 Intel Xeon Phi

Table 21 Multicores proliferation

212 Memory Models

The choice of a proper memory model to express parallel programs is animportant issue in parallel language design Indeed the ways processesand threads communicate using the target architecture and impact the pro-grammerrsquos computation specifications affect both performance and ease ofprogramming There are currently three main approaches (see Figure 22extracted from [63])

Distributed Memory

ThreadProcess

Memory Access

Address Space

Messages

Shared Memory PGAS

Figure 22 Memory models

Shared Memory

Also called global address space this model is the simplest one to use [11]Here the address spaces of the threads are mapped onto the global memoryno explicit data passing between threads is needed However synchroniza-tion is required between the threads that are writing and reading the same


data to and from the shared memory The OpenMP [4] and Cilk [26] lan-guages use the shared memory model Intelrsquos Larrabee [99] is homogeneousgeneral-purpose many-core machine with cache-coherent shared memory

Distributed Memory

In a distributed-memory environment each processor is only able to addressits own memory The message passing model that uses communication li-braries is usually adopted to write parallel programs for distributed-memorysystems These libraries provide routines to initiate and configure the mes-saging environment as well as sending and receiving data packets Currentlythe most popular high-level message-passing system for scientific and engi-neering applications is MPI (Message Passing Interface) [3] OpenCL [69]uses a variation of the message passing memory model for GPUs IntelSingle-chip Cloud Computer (SCC) [55] is a homogeneous general-purposemany-core (48 cores) chip implementing the message passing memory model

Partitioned Global Address Space (PGAS)

PGAS-based languages combine the programming convenience of sharedmemory with the performance control of message passing by partitioninglogically a global address space into places each thread is local to a placeFrom the programmerrsquos point of view programs have a single address spaceand one task of a given thread may refer directly to the storage of a differentthread UPC [33] Fortress [13] Chapel [6] X10 [5] and Habanero-Java [30]use the PGAS memory model

In this thesis we target both shared and distributed memory systemssince efficiently allocating the tasks of an application on a target architec-ture requires reducing communication overheads and transfer costs for bothshared and distributed memory architectures By convention we call run-ning instances of tasks processes in the case of distributed memory systemsand threads for shared memory and PGAS systems

If reducing communications is obviously meaningful for distributed mem-ory systems it is also worthwhile on shared memory architectures since thismay impact cache behavior Also locality is an other important issue aparallelization process should be able to make trade-offs between localityin cache memory and communications Indeed mapping two tasks to thesame processor keeps the data in its local memory and even possibly itscache This avoids its copying over the shared memory bus Thereforetransmission costs are decreased and bus contention is reduced

Moreover we take into account the memory size since this is an impor-tant factor when parallelizing applications that operate on large amount ofdata or when targeting embedded systems

22 PARALLELISM PARADIGMS 13

22 Parallelism Paradigms

Parallel computing has been looked at since computers exist The marketdominance of multi- and many-core processors and the growing importanceand increasing number of clusters in the Top500 list are making parallelisma key concern when implementing large applications such as weather mod-eling [39] or nuclear simulations [36] These important applications requirea lot of computational power and thus need to be programmed to run onparallel supercomputers

Parallelism expression is the fundamental issue when breaking an appli-cation into concurrent parts in order to take advantage of a parallel computerand to execute these simultaneously on different CPUs Parallel architec-tures support multiple levels of parallelism within a single chip the PE(Processing Element) level [64] the thread level (eg Symmetric multi-processing (SMT)) the data level (eg Single Instruction Multiple Data(SIMD)) the instruction level (eg Very Long Instruction Word (VLIW))etc We present here the three main types of parallelism (data task andpipeline)

Task Parallelism

This model corresponds to the distribution of the execution processes orthreads across different parallel computing coresprocessors A task canbe defined as a unit of parallel work in a program We can classify itsimplementation of scheduling in two categories dynamic and static

bull With dynamic task parallelism tasks are dynamically created at run-time and added to the work queue The runtime scheduler is respon-sible for scheduling and synchronizing the tasks across the cores

bull With static task parallelism the set of tasks is known statically andthe scheduling is applied at compile time the compiler generates andassigns different tasks for the thread groups

Data Parallelism

This model implies that the same independent instruction is performed re-peatedly and simultaneously on different data Data parallelism or looplevel parallelism corresponds to the distribution of the data across differ-ent parallel coresprocessors dividing the loops into chunks3 Data-parallelexecution puts a high pressure on the memory bandwidth of multi-core pro-cessors Data parallelism can be expressed using task parallelism constructsan example is provided in Figure 23 where loop iterations are distributed

3A chunk is a set of iterations


differently among the threads contiguous order generally in the left andinterleaved order in the right

forall (i = 0 i lt height ++i)

for (j = 0 j lt width ++j)

kernel ()

(a) Data parallel example

for (i = 0 i lt height ++i)

launchTask


kernel ()

(b) Task parallel example

Figure 23 Rewriting of data parallelism using the task parallelism model

Pipeline Parallelism

Pipeline parallelism applies to chains of producers and consumers that con-tain many steps depending on each other These dependences can be re-moved by interleaving these steps Pipeline parallelism execution leads toreduced latency and buffering but it introduces extra synchronization be-tween producers and consumers to maintain them coupled in their execution

In this thesis we focus on task parallelism often considered more difficultto handle than data parallelism since it lacks the regularity present in thelatter model processes (threads) run simultaneously different instructionsleading to different execution schedules and memory access patterns Be-sides the pressure of the data parallelism model on memory bandwidth andthe extra synchronization introduced in the pipeline parallelism have madecoarse grained parallelism inevitable for improving performance

Moreover data parallelism can be implemented using the task paral-lelism (see Figure 23) Therefore our implementation of task parallelism isalso able to handle data parallelism after transformation of the sequentialcode (see Section 84 that explains the protocol applied to task parallelizesequential applications)

Task management must address both control and data dependences inorder to reduce execution and communication times This is related toanother key enabling issue compilation which we detail in the followingsection

23 Mapping Paradigms to Architectures

In order to map efficiently a parallel program on a parallel architectureone needs to write parallel programs in the presence of constraints of dataand control dependences communications and resources Writing a parallel

24 PROGRAM DEPENDENCE GRAPH (PDG) 15

application implies the use of a target parallel language we survey in Chap-ter 3 different recent and popular task parallel languages We can splitthis parallelization process into two steps understanding the sequential ap-plication by analyzing it to extract parallelism and then generating theequivalent parallel code Before reaching the writing step decisions shouldbe made about an efficient task schedule (Chapter 5) To perform theseoperations automatically a compilation framework is necessary to provide abase for analyses implementation and experimentation Moreover to keepthis parallel software development process simple generic and language-independent a general parallel core language is needed (Chapter 4)

Programming parallel machines requires a parallel programming lan-guage with one of the previous paradigms of parallelism Since in this thesiswe focus on extracting and generating task parallelism we are interested intask parallel languages which we detail in Chapter 3 We distinguish twoapproaches for mapping languages to parallel architectures In the first casethe entry is a parallel language it means the user wrote his parallel programby hand this requires a great subsequent effort for detecting dependencesand parallelism and writing an equivalent efficient parallel program In thesecond case the entry is a sequential language this calls for an automaticparallelizer which is responsible for dependence analyses and scheduling inorder to generate an efficient parallel code

In this thesis we adopt the second approach we use the PIPS compilerpresented in Section 26 as a practical tool for code analysis and paralleliza-tion We target both shared and distributed memory systems using twoparallel languages OpenMP and MPI They provide high-level program-ming constructs to express task creation termination synchronization etcIn Chapter 7 we show how we generate parallel code for these two languages

24 Program Dependence Graph (PDG)

Parallelism detection is based on the analyses of data and control depen-dences between instructions encoded in an intermediate representation for-mat Parallelizers use various dependence graphs to represent such con-straints A program dependence graph (PDG) is a directed graph wherethe vertices represent blocks of statements and the edges represent essentialcontrol or data dependences It encodes both control and data dependenceinformation

241 Control Dependence Graph (CDG)

The CDG represents a program as a graph of statements that are controldependent on the entry to the program Control dependences representcontrol flow relationships that must be respected by any execution of theprogram whether parallel or sequential


Ferrante et al [44] define control dependence as follows Statement Yis control dependent on X when there is a path in the control flow graph(CFG) (see Figure 24a) from X to Y that does not contain the immedi-ate forward dominator Z of X Z forward dominates Y if all paths from Yinclude Z The immediate forward dominator is the first forward domina-tor (closest to X) In the forward dominance tree (FDT) (see Figure 24b)vertices represent statements edges represent the immediate forward domi-nance relation the root of the tree is the exit of the CFG For the CDG (seeFigure 24c) construction one relies on explicit control flow and postdomi-nance information Unfortunately when programs contain goto statementsexplicit control flow and postdominance information are prerequisites forcalculating control dependences

1

2X

3Y

4

5

Z

6

(a) Control flow graph(CFG)

6

5

23

4

1

(b) Forward dominancetree (FDT)

E

1

25

6

34

(c) Control dependencegraph (CDG)

Figure 24 Construction of the control dependence graph

Harrold et al [54] have proposed an algorithm that does not requireexplicit control flow or postdominator information to compute exact controldependences Indeed for structured programs the control dependences areobtained directly from Abstract Syntax Tree (AST) Programs that containwhat are called by the authors structured transfers of control such as breakreturn or exit can be handled as special cases [54] For more unstructuredtransfers of control via arbitrary gotos additional processing based on thecontrol flow graph needs to be performed

In this thesis we only handle structured parts of a code ie the onesthat do not contain goto statements Therefore regarding this contextPIPS implements control dependences in its IR since it is in form of an ASTas above (for structured programs CDG and AST are equivalent)


242 Data Dependence Graph (DDG)

The DDG is a subgraph of the program dependence graph that encodesdata dependence relations between instructions that usedefine the samedata element Figure 25 shows an example of a C code and its data de-pendence graph This DDG has been generated automatically with PIPSData dependences in a DDG are basically of three kinds [88] as shown inthe figure

bull true dependences or true data-flow dependences when a value of aninstruction is used by a subsequent one (aka RAW) colored red in thefigure

bull output dependences between two definitions of the same item (akaWAW) colored blue in the figure

bull anti-dependences between a use of an item and a subsequent write(aka WAR) colored green in the figure

void main()

int a[10] b[10]

int idc

c=42

for(i=1i lt=10i++)

a[i] = 0

for(i=1i lt=10i++)

a[i] = bar(a[i])

d = foo(c)

b[i] = a[i]+d

return

Figure 25 Example of a C code and its data dependence graph

Dependences are needed to define precedence constraints in programsPIPS implements both control (its IR in form of an AST) and data (in formof a dependence graph) dependences These data structures are the basisfor scheduling instructions an important step in any parallelization process


25 List Scheduling

Task scheduling is the process of specifying the order and assignment of startand end times to a set of tasks to be run concurrently on a multiprocessorsuch that the completion time of the whole application is as small as possi-ble while respecting the dependence constraints of each task Usually thenumber of tasks exceeds the number of processors thus some processors arededicated to multiple tasks Since finding the optimal solution of a generalscheduling problem is NP-complete [46] providing an efficient heuristic tofind a good solution is needed This is a problem we address in Chapter 5

251 Background

Scheduling approaches can be categorized in many ways In preemptivetechniques the current executing task can be preempted by an other higher-priority task while in a non-preemptive scheme a task keeps its processoruntil termination Preemptive scheduling algorithms are instrumental inavoiding possible deadlocks and for implementing real-time systems wheretasks must adhere to specified deadlines however preemptions generaterun-time overhead Moreover when static predictions of task characteristicssuch as execution time and communication cost exist task scheduling canbe performed statically (offline) Otherwise dynamic (online) schedulersmust make run-time mapping decisions whenever new tasks arrive this alsointroduces run-time overhead

In the context of the automatic parallelization of scientific applicationswe focus on in this thesis we are interested in non-preemptive static schedul-ing policies of parallelized code4 Even though the subject of static schedul-ing is rather mature (see Section 54) we believe the advent and widespreaduse of multi-core architectures with the constraints they impose warrantto take a fresh look at its potential Indeed static scheduling mechanismshave first the strong advantage of reducing run-time overheads a key fac-tor when considering execution time and energy usage metrics One otherimportant advantage of these schedulers over dynamic ones at least overthose not equipped with detailed static task information is that the ex-istence of efficient schedules is ensured prior to program execution Thisis usually not an issue when time performance is the only goal at stakebut much more so when memory constraints might prevent a task beingexecuted at all on a given architecture Finally static schedules are pre-dictable which helps both at the specification (if such a requirement hasbeen introduced by designers) and debugging levels Given the breadth ofthe literature on scheduling we introduce in this section the notion of list-scheduling heuristics [72] since we use it in this thesis which is a class of

4Note that more expensive preemptive schedulers would be required if fairness concernswere high which is not frequently the case in the applications we address here

25 LIST SCHEDULING 19

scheduling heuristics

252 Algorithm

A labeled direct acyclic graph (DAG) G is defined as G = (TEC) where(1) T = vertices(G) is a set of n tasks (vertices) τ annotated with anestimation of their execution time task time(τ) (2) E = edges(G) a setof m directed edges e = (τi τj) between two tasks annotated with theirdependences edge regions(e)5 and (3) C a n times n sparse communicationedge cost matrix edge cost(e) task time(τ) and edge cost(e) are assumedto be numerical constants although we show how we lift this restriction inSection 63 The functions successors(τ G) and predecessors(τ G)return the list of immediate successors and predecessors of a task τ in theDAG G Figure 26 provides an example of a simple graph with vertices τivertex times are listed in the vertex circles while edge costs label arrows

entry 0

τ1 1 τ4 2

τ2 3

τ3 2

exit 0

0

0

2

1 1

0

0

step task tlevel blevel

1 τ4 0 72 τ3 3 23 τ1 0 54 τ2 4 3

Figure 26 A Directed Acyclic Graph (left) and its associated data (right)

A list scheduling process provides from a DAG G a sequence of itsvertices that satisfies the relationship imposed by E Various heuristics tryto minimize the schedule total length possibly allocating the various verticesin different clusters which ultimately will correspond to different processesor threads A cluster κ is thus a list of tasks if τ isin κ we note cluster(τ)= κ List scheduling is based on the notion of vertex priorities The priorityfor each task τ is computed using the following attributes

bull The top level tlevel(τG) of a vertex τ is the length of the longest

5A region is defined in Section 263


path from the entry vertex6 of G to τ The length of a path is the sumof the communication cost of the edges and the computational time ofthe vertices along the path Top levels are used to estimate the starttimes of vertices on processors the top level is the earliest possiblestart time Scheduling in an ascending order of top levels schedulesvertices in a topological order The algorithm for computing the toplevel of a vertex τ in a graph is given in Algorithm 1

ALGORITHM 1 The top level of Task τ in Graph G

function tlevel(τ G)

tl = 0

foreach τi isin predecessors(τ G)

level = tlevel(τi G)+ task_time(τi)+ edge_cost(τi τ )if (tl lt level) then tl = level

return tl

end

bull The bottom level blevel(τG) of a vertex τ is the length of the longestpath from τ to the exit vertex of G The maximum of the bottom levelof vertices is the length cpl(G) of a graphrsquos critical path which hasthe longest path in the DAG G The latest start time of a vertex τ isthe difference (cpl(G)minusblevel(τG)) between the critical path lengthand the bottom level of τ Scheduling in a descending order of bottomlevels tends to schedule critical path vertices first The algorithm forcomputing the bottom level of τ in a graph is given in Algorithm 2

ALGORITHM 2 The bottom level of Task τ in Graph G

function blevel(τ G)

bl = 0

foreach τj isin successors(τ G)

level = blevel(τj G)+ edge_cost(τ τj )if (bl lt level) then bl = level

return bl+task_time(τ )end

To illustrate these notions the top levels and bottom levels of each vertexof the graph presented in the left of Figure 26 are provided in the adjacenttable

6One can always assume that a unique entry vertex exists by adding it if need be andconnecting it to the original ones with null-cost edges This remark also applies to theexit vertex (see Figure 26)

25 PIPS AUTOMATIC PARALLELIZER AND CODE 21

The general algorithmic skeleton for list scheduling a graph G on Pclusters (P can be infinite and is assumed to be always strictly positive) isprovided in Algorithm 3 first priorities priority(τ) are computed for allcurrently unscheduled vertices then the vertex with the highest priorityis selected for scheduling finally this vertex is allocated to the cluster thatoffers the earliest start time Function f characterizes each specific heuristicwhile the set of clusters already allocated to tasks is clusters Prioritiesneed to be computed again for (a possibly updated) graph G after eachscheduling of a task task times and communication costs change when tasksare allocated to clusters This is performed by the update priority values

function call Part of the allocate task to cluster procedure is to ensurethat cluster(τ) = κ which indicates that Task τ is now scheduled onCluster κ

ALGORITHM 3 List scheduling of Graph G on P processors

procedure list_scheduling(G P)

clusters = emptyforeach τi isin vertices(G)

priority(τi) = f (tlevel(τi G) blevel(τi G))

UT = vertices(G) unscheduled tasks

while UT 6= emptyτ = select_task_with_highest_priority(UT)

κ = select_cluster(τ G P clusters )

allocate_task_to_cluster(τ κ G)

update_graph(G)

update_priority_values(G)

UT = UT -τ end

In this thesis we introduce a new non-preemptive static list-schedulingheuristic called BDSC (Chapter 5) to extract task-level parallelism in thepresence of resource constraints on the number of processors and their localmemory size it is based on a precise cost model and addresses both sharedand distributed parallel memory architectures We implement BDSC inPIPS

26 PIPS Automatic Parallelizer and Code Trans-formation Framework

We chose PIPS [58] to showcase our parallelization approach since it isreadily available well-documented and encodes both control and data de-pendences PIPS is a source-to-source compilation framework for analyzingand transforming C and Fortran programs PIPS implements polyhedral


analyses and transformations that are inter-procedural and provides detailedstatic information about programs such as use-def chains data dependencegraph symbolic complexity memory effects7 call graph interproceduralcontrol flow graph etc

Many program analyses and transformations can be applied the mostimportant are listed below

Parallelization Several algorithms are implemented such as Allen amp Ken-nedyrsquos parallelization algorithm [14] that performs loop distributionand vectorization selecting innermost loops for vector units and coarsegrain parallelization based on the convex array regions of loops (seebelow) to avoid loop distribution and restrictions on loop bodies

Scalar and array privatization This analysis discovers variables whosevalues are local to a particular scope usually a loop iteration

Loop transformations [88] These include unrolling interchange normal-ization distribution strip mining tiling index set splitting

Function transformations PIPS supports for instance the inlining out-lining and cloning of functions

Other transformations These classical code transformations include dead-code elimination partial evaluation 3-address code generation controlrestructuring loop invariant code motion forward substitution etc

We use many of these analyses for implementing our parallelizationmethodology in PIPS eg ldquocomplexityrdquo and data dependence analysesThese static analyses use convex polyhedra to represent abstractions of thevalues within various domains A convex polyhedron over Z is a set ofinteger points in the n-dimensional space Zn and can be represented bya system of linear inequalities These analyses give impressive informationand results which can be leveraged to perform automatic task parallelization(Chapter 6)

We present below the precondition transformer array region and com-plexity analyses that are directly used in this thesis to compute dependencesexecution and communication times then we present the intermediate rep-resentation (IR) of PIPS since we extend it in this thesis to a parallelintermediate representation (Chapter 4) which is necessary for simple andgeneric parallel code generation

261 Transformer Analysis

A transformer is an affine relation between variables values in the memorystore before (old values) and after (new values) the execution of a state-ment This analysis [58] is implemented in PIPS where transformers are

7A representation of data read and written by a given statement


represented by convex polyhedra Indeed a transformer T is defined bya list of arguments and a predicate system labeled by T a transformer isempty (perp) when the set of the affine constraints of the system is not feasiblea transformer is equal to the identity when no variables are modified

This relation models the transformation by the execution of a statementof an input memory store into an output memory store For instance inFigure 27 the transformer of k T (k) before and after the execution of thestatement k++ is the relation k = kinit+1 where kinit is the valuebefore executing the increment statement its value before the loop is kinit= 0

T (i k) k = 0int i k=0

T (i k) i = k + 1 kinit= 0for(i = 1 ilt= 10 i++)

T () a[i] = f(a[i])

T (k) k = kinit+1k++

Figure 27 Example of transformer analysis

The Transformer analysis is used to compute preconditions (see next sec-tion Section 262) and to compute the Path Transformer (see Section 64)that connects two memory stores This analysis is necessary for our depen-dence test

262 Precondition Analysis

A precondition analysis is a function that labels each statement in a programwith its precondition A precondition provides information about the valuesof variables it is a condition that holds true before the execution of thestatement

This analysis [58] is implemented in PIPS where preconditions are rep-resented by convex polyhedra Preconditions are determined before thecomputing of array regions (see next section Section 263) in order to haveprecise information on the variables values They are also used to performa sophisticated static complexity analysis (see Section 264) In Figure 28for instance the precondition P over the variables i and j before the secondloop is i = 11 which is indeed the value of i after the execution of thefirst loop Note that no information is available for j at this step


int ij

P (i j) for(i = 1 i lt= 10 i++)

P (i j) 1 le i i le 10a[i]=f(a[i])

P (i j) i = 11for(j = 6 j lt= 20 j++)

P (i j) i = 11 6 le j j le 20b[j]=g(a[j]a[j+1])

P (i j) i = 11 j = 21

Figure 28 Example of precondition analysis

263 Array Region Analysis

PIPS provides a precise intra- and inter-procedural analysis of array dataflow This latter is important for computing dependences for each arrayelement This analysis of array regions [35] is implemented in PIPS A regionr of an array T is an abstraction of a subset of its elements Formally aregion r is a quadruplet of (1) a reference v (variable referenced in theregion) (2) a region type t (3) an approximation a for the region EXACTwhen the region exactly represents the requested set of array elements orMAY if it is an over-or under-approximation and (4) c a convex polyhedroncontaining equalities and inequalities where parameters are the variablesvalues in the current memory store We write lt vminus tminusaminusc gt for Region r

We distinguish four types of sets of regions RrRwRi and Ro ReadRr and Write Rw regions contain the array elements respectively read orwritten by a statement In regions Ri contain the arrays elements read andimported by a statement Out regions Ro contain the array elements writtenand exported by a statement

For instance in Figure 29 PIPS is able to infer the following sets ofregions

Rw1 = lt a(φ1)minusW minus EXACTminus 1 le φ1 φ1 le 10 gtlt b(φ1)minusW minus EXACTminus 1 le φ1 φ1 le 10 gt

Rr2 = lt a(φ1)minusRminus EXACTminus 6 le φ1 φ1 le 21 gt

where the Write regions Rw1 of arrays a and b modified in the first loopare EXACTly the array elements of a with indices in the interval The Readregions Rr2 of arrays a in the second loop represents EXACTly the elementswith indices in [621]


lt a(φ1)minusRminus EXACTminus 1 le φ1 φ1 le 10 gt lt a(φ1)minusW minus EXACTminus 1 le φ1 φ1 le 10 gt lt b(φ1)minusW minus EXACTminus 1 le φ1 φ1 le 10 gt

for(i = 1 i lt= 10 i++)

a[i] = f(a[i])

b[i] = 42

lt a(φ1)minusRminus EXACTminus 6 le φ1 φ1 le 21 gt lt b(φ1)minusW minus EXACTminus 6 le φ1 φ1 le 20 gt

for(j = 6 j lt= 20 j++)

b[j]=g(a[j]a[j+1])

Figure 29 Example of array region analysis

We define the following set of operations on set of regions Ri (convexpolyhedra)

1 regions intersection(R1R2) is a set of regions each region r inthis set is the intersection of two regions r1 isin R1 of type t1 and r2 isin R2

of type t2 with the same reference The type of r is t1t2 and the convexpolyhedron of r is the intersection of the two convex polyhedra of r1and r2 which is also a convex polyhedron

2 regions difference(R1R2) is a set of regions each region r in thisset is the difference of two regions r1 isin R1 and r2 isin R2 with the samereference The type of r is t1t2 and the convex polyhedron of r is theset difference between the two convex polyhedra of r1 and r2 Sincethis is not generally a convex polyhedron an under approximation iscomputed in order to obtain a convex representation

3 regions union(R1R2) is a set of regions each region r in this set isthe union of two regions r1 isin R1 of type t1 and r2 isin R2 of type t2 withthe same reference The type of r is t1t2 and the convex polyhedronof r is the union of the two convex polyhedra r1 and r2 which is againnot necessarily a convex polyhedron An approximated convex hull isthus computed in order to return the smallest enclosing polyhedron

In this thesis array regions are used for communication cost estimationdependence computation between statements and data volume estimationfor statements For example if we need to compute array dependences be-tween two compound statements S1 and S2 we search for the array elementsthat are accessed in both statements Therefore we are dealing with twostatements set of regions R1 and R2 the result is the intersection betweenthe two sets of regions However two sets of regions should be in the same


memory store to make it possible to compare them we should to bring aset of region R1 to the store of R2 by combining R1 with the changes per-formed ie the transformer that connects the two memory stores Thus wecompute the path transformer between S1 and S2 (see Section 64) More-over we use operations on array regions for our cost model generation (seeSection 63) and communications generation (see Section 73)

264 ldquoComplexityrdquo Analysis

ldquoComplexitiesrdquo are symbolic approximations of the static execution time es-timation of statements in term of number of cycles They are implementedin PIPS using polynomial approximations of execution times Each state-ment is thus labeled with an expression represented by a polynomial overprogram variables assuming that each basic operation (addition multiplica-tion) has a fixed architecture-dependent execution time Table 22 showsthe complexities formulas generated for each function of Harris (see the codein Figure 11) using PIPS complexity analysis where the N and M variablesrepresent the input image size

Function Complexity (polynomial)

InitHarris 9timesN timesMSobelX 60timesN timesMSobelY 60timesN timesMMultiplY 17timesN timesMGauss 85timesN timesM

CoarsitY 34timesN timesM

Table 22 Execution time estimations for Harris functions using PIPS com-plexity analysis

We use this analysis provided by PIPS to determine an approximateexecution time for each statement in order to guide our scheduling heuristic

265 PIPS (Sequential) IR

PIPS intermediate representation (IR) [32] of sequential programs is a hier-archical data structure that embeds both control flow graphs and abstractsyntax trees We provide in this section a high-level description of the inter-mediate representation of PIPS which targets the Fortran and C imperativeprogramming languages it is specified using Newgen [60] a Domain SpecificLanguage for the definition of set equations from which a dedicated API isautomatically generated to manipulate (creation access IO operations)data structures implementing these set elements This section contains onlya slightly simplified subset of the intermediate representation of PIPS the


part that is directly related to the parallel paradigms addressed in this thesisThe Newgen definition of this part is given in Figure 210

instruction = call + forloop + sequence + unstructured

statement = instruction x declarationsentity

entity = namestring x type x initialvalue

call = functionentity x argumentsexpression

forloop = indexentity x

lowerexpression x upperexpression x

stepexpression x bodystatement

sequence = statementsstatement

unstructured = entrycontrol x exitcontrol

control = statement x

predecessorscontrol x successorscontrol

expression = syntax x normalized

syntax = reference + call + cast

reference = variableentity x indicesexpression

type = area + void

Figure 210 Simplified Newgen definitions of the PIPS IR

bull Control flow in PIPS IR is represented via instructions members of thedisjoint union (using the ldquo+rdquo symbol) set instruction An instructioncan be either a simple call or a compound instruction ie a for loopa sequence or a control flow graph A call instruction represents built-in or user-defined function calls for instance assign statements arerepresented as calls to the ldquo=rdquo function

bull Instructions are included within statements which are members ofthe cartesian product set statement that also incorporates the decla-rations of local variables a whole function body is represented in PIPSIR as a statement All named objects such as user variables or built-in functions in PIPS are members of the entity set (the value setdenotes constants while the ldquordquo symbol introduces Newgen list sets)

bull Compound instructions can be either (1) a loop instruction whichincludes an iteration index variable with its lower upper and incre-ment expressions and a loop body (2) a sequence ie a successionof statements encoded as a list or (3) a control flow graph (unstruc-tured) In Newgen a given set component such as expression can bedistinguished using a prefix such as lower and upper here

bull Programs that contain structured (exit break and return) and un-structured (goto) transfers of control are handled in the PIPS inter-mediate representation via the unstructured set An unstructured


instruction has one entry and one exit control vertices a control is avertex in a graph labeled with a statement and its lists of predecessorand successor control vertices Executing an unstructured instructionamounts to following the control flow induced by the graph successorrelationship starting at the entry vertex while executing the vertexstatements until the exit vertex is reached if at all

In this thesis we use the IR of PIPS [58] to showcase our approachSPIRE as a parallel extension formalism for existing sequential intermediaterepresentations (Chapter 4)

27 Conclusion

The purpose of this chapter is to present the context of existing hardwarearchitectures and parallel programming models according for which our the-sis work is developed We also outline how to bridge these two levels usingtwo approaches for writing parallel languages manual and automatic ap-proaches Since in this thesis we adopt the automatic approach this chapterpresents also the notions of intermediate representation control and datadependences scheduling and the framework of compilation PIPS where weimplement our automatic parallelization methodology

In the next chapter we survey in detail seven recent and popular parallelprogramming languages and different parallel constructs This analysis willhelp us to find a uniform core language that can be designed as a uniquebut generic parallel intermediate representation by the development of themethodology SPIRE (see Chapter 4)

Chapter 3

Concepts in Task ParallelProgramming Languages

Lrsquoinconscient est structure comme un langage Jacques Lacan

Existing parallel languages present portability issues such as OpenMP thattargets only shared memory systems Parallelizing sequential programs tobe run on different types of architectures requires the writing of the sameapplication in different parallel programming languages Unifying the widevariety of existing programming models in a generic core language as a par-allel intermediate representation can simplify the portability problem Todesign this parallel IR we use a survey of existing parallel language con-structs which provide a trade-off between expressibility and conciseness ofrepresentation to design a generic parallel intermediate representation Thischapter surveys seven popular and efficient parallel language designs thattackle this difficult issue Cilk Chapel X10 Habanero-Java OpenMP MPIand OpenCL Using as single running example a parallel implementation ofthe computation of the Mandelbrot set this chapter describes how the fun-damentals of task parallel programming ie collective and point-to-pointsynchronization and mutual exclusion are dealt with in these languagesWe discuss how these languages allocate and distribute data over memoryOur study suggests that even though there are many keywords and notionsintroduced by these languages they all boil down as far as control issuesare concerned to three key task concepts creation synchronization andatomicity Regarding memory models these languages adopt one of threeapproaches shared memory distributed memory and PGAS (PartitionedGlobal Address Space)

31 Introduction

This chapter is a state-of-the-art of important existing task parallel pro-gramming languages to help us to achieve two of our goals in this thesis of(1) the development of a parallel intermediate representation which shouldbe as generic as possible to handle different parallel programming languagesThis is possible using the taxonomy produced at the end of this chapter (2)the languages we use as a back-end of our automatic parallelization processThese are determined by the knowledge of existing target parallel languagestheir limits and their extensions

30 CHAPTER 3 CONCEPTS IN TASK PARALLEL PROG

Indeed programming parallel machines as effectively as sequential oneswould ideally require a language that provides high-level programming con-structs to avoid the programming errors frequent when expressing paral-lelism Programming languages adopt one of two ways to deal with this is-sue (1) high-level languages hide the presence of parallelism at the softwarelevel thus offering a code easy to build and port but the performance ofwhich is not guaranteed and (2) low-level languages use explicit constructsfor communication patterns and specifying the number and placement ofthreads but the resulting code is difficult to build and not very portablealthough usually efficient Recent programming models explore the besttrade-offs between expressiveness and performance when addressing paral-lelism

Traditionally there are two general ways to break an application intoconcurrent parts in order to take advantage of a parallel computer and exe-cute them simultaneously on different CPUs data and task parallelisms Indata parallelism the same instruction is performed repeatedly and simul-taneously on different data In task parallelism the execution of differentprocesses (threads) is distributed across multiple computing nodes Taskparallelism is often considered more difficult to specify than data paral-lelism since it lacks the regularity present in the latter model processes(threads) run simultaneously different instructions leading to different ex-ecution schedules and memory access patterns Task management mustaddress both control and data dependences in order to minimize executionand communication times

This chapter describes how seven popular and efficient parallel pro-gramming language designs either widely used or recently developed atDARPA1 tackle the issue of task parallelism specification Cilk ChapelX10 Habanero-Java OpenMP MPI and OpenCL They are selected basedon the richness of their constructs and their popularity they provide simplehigh-level parallel abstractions that cover most of the parallel programminglanguage design spectrum We use a popular parallel problem the compu-tation of the Mandelbrot set [1] as a running example We consider this aninteresting test case since it exhibits a high-level of embarrassing parallelismwhile its iteration space is not easily partitioned if one wants to have tasksof balanced run times Our goal with this chapter is to study and compareaspects around task parallelism for our automatic parallelization aim

After this introduction Section 32 presents our running example Wediscuss the parallel language features specific to task parallelism namelytask creation synchronization and atomicity in Section 33 In Section 34a selection of current and important parallel programming languages aredescribed Cilk Chapel X10 Habanero Java OpenMP MPI and OpenCL

1Three programming languages developed as part of the DARPA HPCS programChapel Fortress [13] X10

32 MANDELBROT SET COMPUTATION 31

For each language an implementation of the Mandelbrot set algorithm ispresented Section 35 contains comparison discussion and classification ofthese languages We conclude in Section 36

32 Mandelbrot Set Computation

The Mandelbrot set is a fractal set For each complex c isin C the setof complex numbers zn(c) is defined by induction as follows z0(c) = cand zn+1(c) = z2n(c) + c The Mandelbrot set M is then defined as c isinC limnrarrinfin zn(c) lt infin thus M is the set of all complex numbers c forwhich the series zn(c) converges One can show [1] that a finite limit forzn(c) exists only if the modulus of zm(c) is less than 2 for some positivem We give a sequential C implementation of the computation of the Man-delbrot set in Figure 32 Running this program yields Figure 31 in whicheach complex c is seen as a pixel its color being related to its convergenceproperty the Mandelbrot set is the black shape in the middle of the figure

Figure 31 Result of the Mandelbrot set

We use this base program as our test case in our parallel implemen-tations in Section 34 for the parallel languages we selected This is aninteresting case for illustrating parallel programming languages (1) it is anembarrassingly parallel problem since all computations of pixel colors canbe performed simultaneously and thus is obviously a good candidate forexpressing parallelism but (2) its efficient implementation is not obvioussince good load balancing cannot be achieved by simply grouping localizedpixels together because convergence can vary widely from one point to thenext due to the fractal nature of the Mandelbrot set


unsigned long min_color = 0 max_color = 16777215

unsigned int width = NPIXELS height = NPIXELS

N = 2 maxiter = 10000

double r_min = -N r_max = N i_min = -N i_max = N

double scale_r = (r_max - r_min )width

double scale_i = (i_max - i_min ) heigth

double scale_color = (max_color - min_color ) maxiter

Display display Window win GC gc

for (row = 0 row lt height ++row)

for (col = 0 col lt width ++col)

zr = zi = 0

Scale c as display coordinates of current point

cr = r_min + (( double) col scale_r )

ci = i_min + (( double) (height -1-row) scale_i )

Iterates z = zz+c while |z| lt N or maxiter is reached

k = 0

do

temp = zrzr - zizi + cr

zi = 2zrzi + ci zr = temp

++k

while (zrzr + zizi lt (NN) ampamp k lt maxiter )

Set color and display point

color = (ulong) ((k-1) scale_color) + min_color

XSetForeground (display gc color )

XDrawPoint (display win gc col row)

Figure 32 Sequential C implementation of the Mandelbrot set

33 Task Parallelism Issues

Among the many issues related to parallel programming the questions oftask creation synchronization atomicity are particularly important whendealing with task parallelism our focus in this thesis

331 Task Creation

In this thesis a task is a static notion ie a list of instructions whileprocesses and threads are running instances of tasks Creation of system-level task instances is an expensive operation since its implementation viaprocesses requires allocating and later possibly releasing system-specific re-sources If a task has a short execution time this overhead might make theoverall computation quite inefficient Another way to introduce parallelismis to use lighter user-level tasks called threads In all languages addressedin this chapter task management operations refer to such user-level tasksThe problem of finding the proper size of tasks and hence the number oftasks can be decided at compile or run times using heuristics

In our Mandelbrot example the parallel implementations we provide

33 TASK PARALLELISM ISSUES 33

below use a static schedule that allocates a number of iterations of the looprow to a particular thread we interleave successive iterations into distinctthreads in a round-robin fashion in order to group loop body computationsinto chunks of size heightP where P is the (language-dependent) numberof threads Our intent here is to try to reach a good load balancing betweenthreads

332 Synchronization

Coordination in task-parallel programs is a major source of complexity It isdealt with using synchronization primitives for instance when a code frag-ment contains many phases of execution where each phase should wait forthe precedent ones to proceed When a process or a thread exits beforesynchronizing on a barrier that other processes are waiting on or when pro-cesses operate on different barriers using different orders a deadlock occursProgrammers must avoid these situations (and be deadlock-free) Differentforms of synchronization constructs exist such as mutual exclusion whenaccessing shared resources using locks join operations that terminate childthreads multiple synchronizations using barriers2 and point-to-point syn-chronization using counting semaphores [95]

In our Mandelbrot example we need to synchronize all pixel computa-tions before exiting we also need to use synchronization to deal with theatomic section (see the next section) Even though synchronization is rathersimple in this example caution is always needed an example that may leadto deadlocks is mentioned in Section 35

333 Atomicity

Access to shared resources requires atomic operations that at any giventime can be executed by only one process or thread Atomicity comes in twoflavors weak and strong [73] A weak atomic statement is atomic only withrespect to other explicitly atomic statements no guarantee is made regard-ing interactions with non-isolated statements (not declared as atomic) Byopposition strong atomicity enforces non-interaction of atomic statementswith all operations in the entire program It usually requires specializedhardware support (eg atomic ldquocompare and swaprdquo operations) althougha software implementation that treats non-explicitly atomic accesses as im-plicitly atomic single operations (using a single global lock) is possible

In our Mandelbrot example display accesses require connection to the Xserver drawing a given pixel is an atomic operation since GUI-specific calls

2The term ldquobarrierrdquo is used in various ways by authors [87] we consider here thatbarriers are synchronization points that wait for the termination of sets of threads definedin a language-dependent manner


need synchronization Moreover two simple examples of atomic sections areprovided in Section 35

34 Parallel Programming Languages

We present here seven parallel programming languages and describe howthey deal with the concepts introduced in the previous section We alsostudy how these languages distribute data over different processors Giventhe large number of parallel languages that exist we focus primarily onlanguages that are in current use and popular and that support simple high-level task-oriented parallel abstractions

341 Cilk

Cilk [103] developed at MIT is a multithreaded parallel language based onC for shared memory systems Cilk is designed for exploiting dynamic andasynchronous parallelism A Cilk implementation of the Mandelbrot set isprovided in Figure 333

cilk_lock_init(display_lock )

for (m = 0 m lt P m++)

spawn compute_points(m)

sync

cilk void compute_points(uint m)

for (row = m row lt height row += P)


Initialization of c k and z

do



++k



cilk_lock(display_lock )



cilk_unlock(display_lock )

Figure 33 Cilk implementation of the Mandelbrot set (P is the number ofprocessors)

3From now on variable declarations are omitted unless required for the purpose of ourpresentation

34 PARALLEL PROGRAMMING LANGUAGES 35

Task Parallelism

The cilk keyword identifies functions that can be spawned in parallel ACilk function may create threads to execute functions in parallel The spawnkeyword is used to create child tasks such as compute points in our examplewhen referring to Cilk functions

Cilk introduces the notion of inlets [2] which are local Cilk functionsdefined to take the result of spawned tasks and use it (performing a reduc-tion) The result should not be put in a variable in the parent function Allthe variables of the function are available within an inlet Abort allows toabort a speculative work by terminating all of the already spawned childrenof a function it must be called inside an inlet Inlets are not used in ourexample

Synchronization

The sync statement is a local barrier used in our example to ensure task ter-mination It waits only for the spawned child tasks of the current procedureto complete and not for all tasks currently being executed

Atomic Section

Mutual exclusion is implemented using locks of type cilk lockvar suchas display lock in our example The function cilk lock is used to test alock and block if it is already acquired the function cilk unlock is usedto release a lock Both functions take a single argument which is an objectof type cilk lockvar cilk lock init is used to initialize the lock objectbefore it is used

Data Distribution

In Cilkrsquos shared memory model all variables declared outside Cilk functionsare shared Also variables addressed indirectly by passing of pointers tospawned functions are shared To avoid possible non-determinism due todata races the programmer should avoid the situation when a task writesa variable that may be read or written concurrently by another task oruse the primitive cilk fence that ensures that all memory operations of athread are committed before the next operation execution

342 Chapel

Chapel [6] developed by Cray supports both data and control flow paral-lelism and is designed around a multithreaded execution model based onPGAS for shared and distributed-memory systems A Chapel implementa-tion of the Mandelbrot set is provided in Figure 34


coforall loc in Locales do

on loc

for row in locid height by numLocales do

for col in 1 width do


do



k = k+1



atomic



Figure 34 Chapel implementation of the Mandelbrot set

Task Parallelism

Chapel provides three types of task parallelism [6] two structured ones andone unstructured cobeginstmts creates a task for each statement instmts the parent task waits for the stmts tasks to be completed coforallis a loop variant of the cobegin statement where each iteration of thecoforall loop is a separate task and the main thread of execution does notcontinue until every iteration is completed Finally in beginstmt theoriginal parent task continues its execution after spawning a child runningstmt

Synchronization

In addition to cobegin and coforall used in our example which have animplicit synchronization at the end synchronization variables of type sync

can be used for coordinating parallel tasks A sync [6] variable is eitherempty or full with an additional data value Reading an empty variable andwriting in a full variable suspends the thread Writing to an empty variableatomically changes its state to full Reading a full variable consumes thevalue and atomically changes the state to empty

Atomic Section

Chapel supports atomic sections atomicstmt executes stmt atomicallywith respect to other threads The precise semantics is still ongoing work [6]


Data Distribution

Chapel introduces a type called locale to refer to a unit of the machineresources on which a computation is running A locale is a mapping ofChapel data and computations to the physical machine In Figure 34 Ar-ray Locales represents the set of locale values corresponding to the machineresources on which this code is running numLocales refers to the numberof locales Chapel also introduces new domain types to specify array distri-bution they are not used in our example

343 X10 and Habanero-Java

X10 [5] developed at IBM is a distributed asynchronous dynamic parallelprogramming language for multi-core processors symmetric shared-memorymultiprocessors (SMPs) commodity clusters high end supercomputers andeven embedded processors like Cell An X10 implementation of the Man-delbrot set is provided in Figure 35

Habanero-Java [30] under development at Rice University is derivedfrom X10 and introduces additional synchronization and atomicity primi-tives surveyed below

finish

for (m = 0 m lt placeMAX_PLACES m++)

place pl_row = placeplaces(m)

async at (pl_row)

for (row = m row lt height row += placeMAX_PLACES )



do



++k



atomic



Figure 35 X10 implementation of the Mandelbrot set

Task Parallelism

X10 provides two task creation primitives (1) the async stmt constructcreates a new asynchronous task that executes stmt while the current threadcontinues and (2) the future exp expression launches a parallel task thatreturns the value of exp


Synchronization

With finish stmt the current running task is blocked at the end of thefinish clause waiting till all the children spawned during the execution ofstmt have terminated The expression fforce() is used to get the actualvalue of the ldquofuturerdquo task f

X10 introduces a new synchronization concept the clock It acts as abarrier for a dynamically varying set of tasks [100] that operate in phases ofexecution where each phase should wait for previous ones before proceedingA task that uses a clock must first register with it (multiple clocks can beused) It then uses the statement next to signal to all the tasks that areregistered with its clocks that it is ready to move to the following phaseand waits until all the clocks with which it is registered can advance Aclock can advance only when all the tasks that are registered with it haveexecuted a next statement (see the left side of Figure 36)

Habanero-Java introduces phasers to extend this clock mechanism Aphaser is created and initialized to its first phase using the function newThe scope of a phaser is limited to the immediately enclosing finish state-ment this constraint guarantees the deadlock-freedom safety property ofphasers [100] A task can be registered with zero or more phasers using oneof four registration modes the first two are the traditional SIG and WAITsignal operations for producer-consumer synchronization the SIG WAITmode implements barrier synchronization while SIG WAIT SINGLE en-sures in addition that its associated statement is executed by only onethread As in X10 a next instruction is used to advance each phaser thatthis task is registered with to its next phase in accordance with this taskrsquosregistration mode and waits on each phaser that task is registered with witha WAIT submode (see the right side of Figure 36) Note that clocks andphasers are not used in our Mandelbrot example since a collective barrierbased on the finish statement is sufficient

finish async

clock cl = clockmake ()

for(j = 1j lt= nj++)

async clocked(cl)

S

next

Sprime

finish

phaser ph=new phaser ()


async phased(ph ltSIG_WAIT gt)

S

next

Sprime

Figure 36 A clock in X10 (left) and a phaser in Habanero-Java (right)


Atomic Section

When a thread enters an atomic statement no other thread may enter ituntil the original thread terminates it

Habanero-Java supports weak atomicity using the isolated stmt primi-tive for mutual exclusion The Habanero-Java implementation takes a single-lock approach to deal with isolated statements

Data Distribution

In order to distribute data across processors X10 and HJ introduce a typecalled place A place is an address space within which a task may rundifferent places may however refer to the same physical processor and sharephysical memory The program address space is partitioned into logicallydistinct places PlaceMAX PLACES used in Figure 35 is the number ofplaces available to a program

344 OpenMP

OpenMP [4] is an application program interface providing a multi-threadedprogramming model for shared memory parallelism it uses directives to ex-tend sequential languages A C OpenMP implementation of the Mandelbrotset is provided in Figure 37

P = omp_get_num_threads ()

pragma omp parallel shared(height width scale_r

scale_i maxiter scale_color min_color r_min i_min)

private(row col kmcolor temp zc)

pragma omp single


pragma omp task




do



++k



pragma omp critical



Figure 37 C OpenMP implementation of the Mandelbrot set


Task Parallelism

OpenMP allows dynamic (omp task) and static (omp section) schedulingmodels A task instance is generated each time a thread (the encounter-ing thread) encounters an omp task directive This task may either bescheduled immediately on the same thread or deferred and assigned to anythread in a thread team which is the group of threads created when anomp parallel directive is encountered The omp sections directive is anon-iterative work-sharing construct It specifies that the enclosed sectionsof code declared with omp section are to be divided among the threadsin the team these sections are independent blocks of code that the compilercan execute concurrently

Synchronization

OpenMP provides synchronization constructs that control the execution in-side a team thread barrier and taskwait When a thread encountersa barrier directive it waits until all other threads in the team reach thesame point the scope of a barrier region is the innermost enclosing paral-lel region The taskwait construct is a restricted barrier that blocks thethread until all child tasks created since the beginning of the current taskare completed The omp single directive identifies code that must be runby only one thread

Atomic Section

The critical and atomic directives are used for identifying a section of codethat must be executed by a single thread at a time The atomic directiveworks faster than critical since it only applies to single instructions andcan thus often benefit from hardware support Our implementation of theMandelbrot set in Figure 37 uses critical for the drawing function (not asingle instruction)

Data Distribution

OpenMP variables are either global (shared) or local (private) see Fig-ure 37 for examples A shared variable refers to one unique block of stor-age for all threads in the team A private variable refers to a differentblock of storage for each thread More memory access modes exist such asfirstprivate or lastprivate that may require communication or copyoperations

345 MPI

MPI [3] is a library specification for message-passing used to program sharedand distributed memory systems Both point-to-point and collective com-


munication are supported The C MPI implementation of the Mandelbrotset is provided in Figure 38

int rank

MPI_Status status

ierr = MPI_Init (ampargc ampargv)

ierr = MPI_Comm_rank(MPI_COMM_WORLD amprank)

ierr = MPI_Comm_size(MPI_COMM_WORLD ampP)


if(rank == P)




do



++k


color_msg[col] = (ulong) ((k-1) scale_color) + min_color

ierr = MPI_Send(color_msg width MPI_UNSIGNED_LONG 00

MPI_COMM_WORLD )

if(rank == 0)

for (row = 0 row lt height row ++)

ierr = MPI_Recv(data_msg width MPI_UNSIGNED_LONG

MPI_ANY_SOURCE 0 MPI_COMM_WORLD ampstatus )


color = data_msg[col]



ierr = MPI_Barrier(MPI_COMM_WORLD )

ierr = MPI_Finalize ()

Figure 38 MPI implementation of the Mandelbrot set (P is the number ofprocessors)

Task Parallelism

The starter process may be a separate process that is not part of the MPIapplication or the rank 0 process may act as a starter process to launch theremaining MPI processes of the MPI application MPI processes are createdwhen MPI Init is called that defines also the initially universe intracommu-nicator for all processes to drive various communications A communicatordetermines the scope of communications Processes are terminated using


MPI Finalize

Synchronization

MPI Barrier blocks until all processes in the communicator passed as anargument have reached this call

Atomic Section

MPI does not support atomic sections Indeed the atomic section is aconcept intrinsically linked to the shared memory In our example thedrawing function is executed by the master process (the rank 0 process)

Data Distribution

MPI supports the distributed memory model where it proceeds by point-to-point and collective communications to transfer data between processors Inour implementation of the Mandelbrot set we used the point-to-point rou-tines MPI Send and MPI Recv to communicate between the master process(rank = 0) and the other processes

346 OpenCL

OpenCL (Open Computing Language) [69] is a standard for programmingheterogeneous multiprocessor platforms where programs are divided intoseveral parts some called ldquothe kernelsrdquo that execute on separate deviceseg GPUs with their own memories and the others that execute on thehost CPU The main object in OpenCL is the command queue which isused to submit work to a device by enqueueing OpenCL commands to beexecuted An OpenCL implementation of the Mandelbrot set is provided inFigure 45

Task Parallelism

OpenCL provides the parallel construct clEnqueueTask which enqueues acommand requiring the execution of a kernel on a device by a work item(OpenCL thread) OpenCL uses two different models of execution of com-mand queues in-order used for data parallelism and out-of-order In anout-of-order command queue commands are executed as soon as possibleand no order is specified except for wait and barrier events We illustratethe out-of-order execution mechanism in Figure 45 but currently this is anoptional feature and is thus not supported by many devices


__kernel void kernel_main(complex c uint maxiter double scale_color

uint m uint P ulong color[NPIXELS ][ NPIXELS ])

for (row = m row lt NPIXELS row += P)

for (col = 0 col lt NPIXELS ++col)


do

temp = zrzr-zizi+cr

zi = 2zrzi+ci zr = temp

++k

while (zrzr+zizilt(NN) ampamp kltmaxiter )

color[row][col] = (ulong) ((k-1) scale_color )

cl_int ret = clGetPlatformIDs (1 ampplatform_id ampret_num_platforms )

ret = clGetDeviceIDs(platform_id CL_DEVICE_TYPE_DEFAULT 1

ampdevice_id ampret_num_devices )

cl_context context = clCreateContext( NULL 1 ampdevice_id

NULL NULL ampret)

cQueue=clCreateCommandQueue(context device_id

OUT_OF_ORDER_EXEC_MODE_ENABLE NULL)

P = CL_DEVICE_MAX_COMPUTE_UNITS

memc = clCreateBuffer(context CL_MEM_READ_ONLY sizeof(complex) c)

Create read -only buffers with maxiter scale_color and P too

memcolor = clCreateBuffer(context CL_MEM_WRITE_ONLY

sizeof(ulong) heightwidth NULL NULL)

clEnqueueWriteBuffer(cQueue memc CL_TRUE 0

sizeof(complex) ampc 0 NULL NULL)

Enqueue write buffer with maxiter scale_color and P too

program = clCreateProgramWithSource(context 1 ampprogram_source

NULL NULL)

err = clBuildProgram(program 0 NULL NULL NULL NULL)

kernel = clCreateKernel(program kernel_main NULL)

clSetKernelArg(kernel 0 sizeof(cl_mem)(void )amp memc)

Set kernel argument with

memmaxiter memscale_color memP and memcolor too

for(m = 0 m lt P m++)

memm = clCreateBuffer(context CL_MEM_READ_ONLY sizeof(uint) m)

clEnqueueWriteBuffer(cQueue memm CL_TRUE 0 sizeof(uint) ampm 0

NULL NULL)

clSetKernelArg(kernel 0 sizeof(cl_mem)(void )amp memm)

clEnqueueTask(cQueue kernel 0 NULL NULL)

clFinish(cQueue )

clEnqueueReadBuffer(cQueue memcolor CL_TRUE 0space color 0NULL NULL)



XSetForeground (display gc color[col][row ])


Figure 39 OpenCL implementation of the Mandelbrot set

Synchronization

OpenCL distinguishes between two types of synchronization coarse andfine Coarse grained synchronization which deals with command queueoperations uses the construct clEnqueueBarrier which defines a barriersynchronization point Fine grained synchronization which covers synchro-


nization at the GPU function call granularity level uses OpenCL events viaClEnqueueWaitForEvents calls

Data transfers between the GPU memory and the host memory viafunctions such as clEnqueueReadBuffer and clEnqueueWriteBuffer alsoinduce synchronization between blocking or non-blocking communicationcommands Events returned by clEnqueue operations can be used to checkif a non-blocking operation has completed

Atomic Section

Atomic operations are only supported on integer data via functions suchas atom add or atom xchg Currently these are only supported by somedevices as part of an extension of the OpenCL standard Since OpenCLlacks support for general atomic sections the drawing function is executedby the host in Figure 45

Data Distribution

Each work item can either use (1) its private memory (2) its local mem-ory which is shared between multiple work items (3) its constant mem-ory which is closer to the processor than the global memory and thusmuch faster to access although slower than local memory and (4) globalmemory shared by all work items Data is only accessible after being trans-ferred from the host using functions such as clEnqueueReadBuffer andclEnqueueWriteBuffer that move data in and out of a device

35 Discussion and Comparison

This section discusses the salient features of our surveyed languages Morespecifically we look at their design philosophy and the new concepts theyintroduce how point-to-point synchronization is addressed in each of theselanguages the various semantics of atomic sections and the data distributionissues We end up summarizing the key features of all the languages coveredin this chapter

Design Paradigms

Our overview study based on a single running example namely the com-putation of the Mandelbrot set is admittedly somewhat biased since eachlanguage has been designed with a particular application framework in mindwhich may or may not be well adapted to a given application Cilk is wellsuited to deal with divide-and-conquer strategies something not put intopractice in our example On the contrary X10 Chapel and Habanero-Java

35 DISCUSSION AND COMPARISON 45

are high-level Partitioned Global Address Space languages that offer ab-stract notions such as places and locales which were put to good use inour example OpenCL is a low-level verbose language that works acrossGPUs and CPUs our example clearly illustrates that this approach is notproviding much help here in terms of shrinking the semantic gap betweenspecification and implementation The OpenMP philosophy is to add com-piler directives to parallelize parts of code on shared-memory machines thishelps programmers move incrementally from a sequential to a parallel im-plementation While OpenMP suffers from portability issues MPI tacklesboth shared and distributed memory systems using a low-level support forcommunications

New Concepts

Even though this thesis does not address data parallelism per se note thatCilk is the only language that does not provide support for data parallelismyet spawned threads can be used inside loops to simulate SIMD process-ing Also Cilk adds a facility to support speculative parallelism enablingspawned tasks abort operations via the abort statement Habanero-Javaintroduces the isolated statement to specify the weak atomicity propertyPhasers in Habanero-Java and clocks in X10 are new high-level constructsfor collective and point-to-point synchronization between varying sets ofthreads

Point-to-Point Synchronization

We illustrate the way the surveyed languages address the difficult issue ofpoint-to-point synchronization via a simple example a hide-and-seek chil-dren game in Figure 310 X10 clocks or Habanero-Java phasers help expresseasily the different phases between threads The notion of point-to-pointsynchronization cannot be expressed easily using OpenMP or Chapel4 Wewere not able to implement this game using Cilk high-level synchronizationprimitives since sync the only synchronization construct is a local barrierfor recursive tasks it synchronizes only threads spawned in the current pro-cedure and thus not the two searcher and hider tasks As mentioned abovethis is not surprising given Cilkrsquos approach to parallelism

Atomic Section

The semantics and implementations of the various proposals for dealing withatomicity are rather subtle

Atomic operations which apply to single instructions can be efficientlyimplemented eg in X10 using non-blocking techniques such as the atomic

4Busy-waiting techniques can be applied


finish async


async clocked(cl)

count_to_a_number ()

next

start_searching ()

async clocked(cl)

hide_oneself ()

next

continue_to_be_hidden ()

finish async

phaser ph = new phaser ()

async phased(ph)


next

start searching ()

async phased(ph)

hide_oneself ()

next


cilk void searcher ()


point_to_point_sync () missing

start_searching ()

cilk void hidder ()

hide_oneself ()



void main()

spawn searcher ()

spawn hidder ()

Figure 310 A hide-and-seek game (X10 HJ Cilk)

instruction compare-and-swap In OpenMP the atomic directive can bemade to work faster than the critical directive when atomic operationsare replaced with processor commands such as GLSC [71] instructions orldquogather-linked and scatter-conditionalrdquo which extend scatter-gather hard-ware to support advanced atomic memory operations therefore it is betterto use this directive when protecting shared memory during elementary op-erations Atomic operations can be used to update different elements ofa data structure (arrays records) in parallel without using many explicitlocks In the example of Figure 311 the updates of different elements ofArray x are allowed to occur in parallel General atomic sections on theother hand serialize the execution of updates to elements via one lock

With the weak atomicity model of Habanero-Java the isolated key-word is used instead of atomic to make explicit the fact that the constructsupports weak rather than strong isolation In Figure 312 Threads 1 and2 may access to ptr simultaneously since weakly atomic accesses are usedan atomic access to temp-gtnext is not enforced


pragma omp parallel for shared(x index n)

for (i=0 iltn i++)

pragma omp atomic

x[index[i]] += f(i) index is supposed injective

Figure 311 Example of an atomic directive in OpenMP

Thread 1

ptr = headnon isolated statement

isolated

ready = true

Thread 2

isolated

if(ready)

temp -gtnext = ptr

Figure 312 Data race on ptr with Habanero-Java

Data Distribution

PGAS languages offer a compromise between the fine level of control ofdata placement provided by the message passing model and the simplicityof the shared memory model However the physical reality is that differentPGAS portions although logically distinct may refer to the same physicalprocessor and share physical memory Practical performance might thus notbe as good as expected

Regarding the shared memory model despite its simplicity of program-ming programmers have scarce support for expressing data locality whichcould help improve performance in many cases Debugging is also difficultwhen data races or deadlocks occur

Finally the message passing memory model where processors have nodirect access to the memories of other processors can be seen as the mostgeneral one in which programmers can both specify data distribution andcontrol locality Shared memory (where there is only one processor manag-ing the whole memory) and PGAS (where one assumes that each portion islocated on a distinct processor) models can be seen as particular instancesof the message passing model when converting implicit write and read op-erations with explicit sendreceive message passing constructs

Summary Table

We collect in Table 31 the main characteristics of each language addressed inthis chapter Even though we have not discussed the issue of data parallelismin this chapter we nonetheless provide the main constructs used in eachlanguage to launch data parallel computations


Task Synchronization Data MemoryLangua- creation Task join Point-to- Atomic parallelism model

ge point section

Cilk (MIT) spawn sync mdash cilk lock mdash Sharedabort cilk unlock

Chapel begin sync sync sync forall PGAS(Cray) cobegin atomic coforall (Locales)

X10 (IBM) async finish next atomic foreach PGASfuture force (Places)

Habanero- async finish next atomic foreach PGASJava (Rice) future get isolated (Places)

OpenMP omp task omp taskwait mdash omp critical omp for Sharedomp section omp barrier omp atomic

OpenCL EnqueueTask Finish events atom add EnqueueND- MessageEnqueueBarrier RangeKernel passing

MPI MPI spawn MPI Finalize mdash mdash MPI Init MessageMPI Barrier passing

Table 31 Summary of parallel languages constructs

36 Conclusion

Using the Mandelbrot set computation as a running example this chapterpresents an up-to-date comparative overview of seven parallel programminglanguages Cilk Chapel X10 Habanero-Java OpenMP MPI and OpenCLThese languages are in current use popular offer rich and highly abstractfunctionalities and most support both data and task parallel execution mod-els The chapter describes how in addition to data distribution and localitythe fundamentals of task parallel programming namely task creation col-lective and point-to-point synchronization and mutual exclusion are dealtwith in these languages

This study serves as the basis for the design of SPIRE a sequential toparallel intermediate representation extension that we use to upgrade the in-termediate representations of PIPS [19] source-to-source compilation frame-work to represent task concepts in parallel languages SPIRE is presentedin the next chapter

An earlier version of the work presented in this chapter was publishedin [66]

Chapter 4

SPIRE A GenericSequential to Parallel

Intermediate RepresentationExtension Methodology

I was working on the proof of one of my poems all the morning and took out acomma In the afternoon I put it back again Oscar Wilde

The research goal addressed in this thesis is the design of efficient auto-matic parallelization algorithms that use as a backend a uniform simpleand generic parallel intermediate core language Instead of designing fromscratch one such particular intermediate language we take an indirect moregeneral route We use the survey presented in Chapter 3 to design SPIREa new methodology for the design of parallel extensions of the intermedi-ate representations used in compilation frameworks of sequential languagesIt can be used to leverage existing infrastructures for sequential languagesto address both control and data parallel constructs while preserving asmuch as possible existing analyses for sequential code In this chapter wesuggest to view this upgrade process as an ldquointermediate representationtransformerrdquo at the syntactic and semantic levels we show this can be donevia the introduction of only ten new concepts collected in three groupsnamely execution synchronization and data distribution precisely definedvia a formal semantics and rewriting rules

We use the sequential intermediate representation of PIPS a comprehen-sive source-to-source compilation platform as a use case for our approach ofthe definition of parallel intermediate languages We introduce our SPIREparallel primitives extend PIPS intermediate representation and show howexample code snippets from the OpenCL Cilk OpenMP X10 Habanero-Java MPI and Chapel parallel programming languages can be representedthis way A formal definition of SPIRE operational semantics is providedbuilt on top of the one used for the sequential intermediate representationWe assess the generality of our proposal by showing how a different sequen-tial IR namely LLVM can be extended to handle parallelism using theSPIRE methodology

Our primary goal with the development of SPIRE is to provide at a lowcost powerful parallel program representations that ease the design of effi-

50 CHAPTER 4 SPIRE A GENERIC SEQUENTIAL TO PAR

cient automatic parallelization algorithms More precisely in this chapterour intent is to describe SPIRE as a uniform simple and generic core lan-guage transformer and apply it to PIPS IR We illustrate the applicabilityof the resulting parallel IR by using it as parallel code target in Chapter 7where we also address code generation issues

41 Introduction

The growing importance of parallel computers and the search for efficientprogramming models led and is still leading to the proliferation of parallelprogramming languages such as currently Cilk [103] Chapel [6] X10 [5]Habanero-Java [30] OpenMP [4] OpenCL [69] or MPI [3] presented inChapter 3 Our automatic task parallelization algorithm which we inte-grate within PIPS [58] a comprehensive source-to-source compilation andoptimization platform should help in the adaptation of PIPS to such anevolution To reach this goal we introduce an internal intermediate rep-resentation (IR) to represent parallel programs into PIPS The choice of aproper parallel IR is of key importance since the efficiency and power ofthe transformations and optimizations the compiler PIPS can perform areclosely related to the selection of a proper program representation paradigmThis seems to suggest the introduction of a dedicated IR for each parallelprogramming paradigm Yet it would be better from a software engineer-ing point of view to find a unique parallel IR as general and simple aspossible

Existing proposals for program representation techniques already pro-vide a basis for the exploitation of parallelism via the encoding of controlandor data flow information HPIR [111] PLASMA [89] or InsPIRe [57]are instances that operate at a high abstraction level while the hierarchicaltask stream or program dependence graphs (we survey these notions in Sec-tion 45) are better suited to graph-based approaches Unfortunately manyexisting compiler frameworks such as PIPS use traditional representationsfor sequential-only programs and changing their internal data structures todeal with parallel constructs is a difficult and time-consuming task

In this thesis we choose to develop a methodology for the design ofparallel extensions of the intermediate representations used in compilationframeworks of sequential languages in general instead of developing a paral-lel intermediate representation for PIPS only The main motivation behindthe design of the methodology introduced in this chapter is to preserve themany years of development efforts invested in huge compiler platforms suchas GCC (more than 7 million lines of code) PIPS (600 000 lines of code)LLVM (more than 1 million lines of code) when upgrading their inter-mediate representations to handle parallel languages as source languagesor as targets for source-to-source transformations We provide an evolu-

42 SPIRE A SEQUENTIAL TO PARALLEL IR EXTENSION 51

tionary path for these large software developments via the introduction ofthe Sequential to Parallel Intermediate Representation Extension (SPIRE)methodology We show that it can be plugged into existing compilers in arather simple manner SPIRE is based on only three key concepts (1) theparallel vs sequential execution of groups of statements such as sequencesloops and general control-flow graphs (2) the global synchronization char-acteristics of statements and the specification of finer grain synchronizationvia the notion of events and (3) the handling of data distribution for differentmemory models To illustrate how this approach can be used in practicewe use SPIRE to extend the intermediate representation (IR) [32] of PIPS

The design of SPIRE is the result of many trade-offs between generalityand precision abstraction and low-level concerns On the one hand and inparticular when looking at source-to-source optimizing compiler platformsadapted to multiple source languages one needs to (1) represent as manyof the existing (and hopefully future) parallel constructs (2) minimize thenumber of additional built-in functions that hamper compilers optimizationpasses and (3) preserve high-level structured parallel constructs to keeptheir properties such as deadlock free while minimizing the number of newconcepts introduced in the parallel IR On the other hand keeping only alimited number of hardware-level notions in the IR while good enough todeal with all parallel constructs would entail convoluted rewritings of high-level parallel flows We used an extensive survey of key parallel languagesnamely Cilk Chapel X10 Habanero-Java OpenMP OpenCL and MPIto guide our design of SPIRE while showing how to express their relevantparallel constructs within SPIRE

The remainder of this chapter is structured as follows Our parallel ex-tension proposal SPIRE is introduced in Section 42 where we illustrateits application to PIPS sequential IR presented in Section 265 we alsoshow how simple illustrative examples written in OpenCL Cilk OpenMPX10 Habanero-Java MPI and Chapel can be easily represented withinSPIRE(PIPS IR)1 The formal operational semantics of SPIRE is given inSection 43 Section 44 shows the generality of SPIRE methodology byshowing its use on LLVM We survey existing parallel IRs in Section 45We conclude in Section 46

42 SPIRE a Sequential to Parallel IR Extension

In this section we present in detail the SPIRE methodology which is usedto add parallel concepts to sequential IRs After introducing our designphilosophy we describe the application of SPIRE on the PIPS IR presentedin Section 265 We illustrate these SPIRE-derived constructs with code

1SPIRE(PIPS IR) means that the function of transformation SPIRE is applied on thesequential IR of PIPS it yields its parallel IR


excerpts from various parallel programming languages our intent is notto provide here general rewriting techniques from these to SPIRE but toprovide hints on how such rewritings might possibly proceed Note that inSection 44 using LLVM we show that our methodology is general enoughto be adapted to other IRs

421 Design Approach

SPIRE intends to be a practical methodology to extend existing sequentialIRs to parallelism constructs either to generate parallel code from sequen-tial programs or compile explicitly parallel programming languages Inter-estingly the idea of seeing the issue of parallelism as an extension oversequential concepts is in sync with Dijkstrarsquos view that ldquoparallelism or con-currency are operational concepts that refer not to the program but to itsexecutionrdquo [40] If one accepts such a vision adding parallelism extensionsto existing IRs as advocated by our approach with SPIRE can thus at afundamental level not be seen as an afterthought but as a consequence ofthe fundamental nature of parallelism

Our design of SPIRE does not intend to be minimalist but to be asseamlessly as possible integrable within actual IRs and general enough tohandle as many parallel programming constructs as possible To be suc-cessful our design point must provide proper trade-offs between generalityexpressibility and conciseness of representation We used an extensive sur-vey of existing parallel languages to guide us during this design processTable 41 which extends the one provided in Chapter 3 summarizes themain characteristics of seven recent and widely used parallel languages CilkChapel X10 Habanero-Java OpenMP OpenCL and MPI The main con-structs used in each language to launch task and data parallel computationsperform synchronization introduce atomic sections and transfer data in thevarious memory models are listed Our main finding from this analysis isthat to be able to deal with parallel programming one simply needs toadd to a given sequential IR the ability to specify (1) the parallel execu-tion mechanism of groups of statements (2) the synchronization behaviorof statements and (3) the layout of data ie how memory is modeled in theparallel language

The last line of Table 41 summarizes the approach we propose to mapthese programming concepts to our parallel intermediate representation ex-tension SPIRE is based on the introduction of only ten key notions col-lected in three groups

bull execution via the sequential and parallel constructs

bull synchronization via the spawn barrier atomic single signal andwait constructs


Execution Synchronization MemoryLanguage Parallelism Task Task Point-to- Atomic model Data

creation join point section distribution

Cilk mdash spawn sync mdash cilk lock Shared mdash(MIT)

Chapel forall begin sync sync sync PGAS (on)(Cray) coforall atomic (Locales)

cobegin

X10 foreach async finish next atomic PGAS (at)(IBM) future force (Places)

Habanero- foreach async finish next atomic PGAS (at)Java (Rice) future get isolated (Places)

OpenMP omp for omp task omp taskwait mdash omp critical Shared privateomp sections omp section omp barrier omp atomic shared

OpenCL EnqueueND- EnqueueTask Finish events atom add Distributed ReadBufferRangeKernel EnqueueBarrier WriteBuffer

MPI MPI Init MPI spawn MPI Finalize mdash mdash Distributed MPI SendMPI Barrier MPI Recv

sequential signal SharedSPIRE parallel spawn barrier wait atomic Distributed send recv

Table 41 Mapping of SPIRE to parallel languages constructs (terms inparentheses are not currently handled by SPIRE)

bull data distribution via send and recv constructs

Small code snippets are provided below to sketch how the key constructs ofthese parallel languages can be encoded in practice within a SPIRE-extendedparallel IR

422 Execution

The issue of parallel vs sequential execution appears when dealing withgroups of statements which in our case study correspond to members of theforloop sequence and unstructured sets presented in Section 265 Toapply SPIRE to PIPS sequential IR an execution attribute is added tothese sequential set definitions

forlooprsquo = forloop x execution

sequencersquo = sequence x execution

unstructuredrsquo = unstructured x execution

The primed sets forlooprsquo (expressing data parallelism) and sequencersquo

and unstructuredrsquo (implementing control parallelism) represent SPIREd-up sets for the PIPS parallel IR Of course the lsquoprimersquo notation is used herefor pedagogical purpose only in practice an execution field is added in theexisting IR representation The definition of execution is straightforward


execution = sequentialunit + parallelunit

where unit denotes a set with one single element this encodes a simple enu-meration of cases for execution A parallel execution attribute asks forall loop iterations sequence statements and control nodes of unstructuredinstructions to be run concurrently Note that there is no implicit barrierafter the termination of these parallel statements in Section 423 we intro-duce the annotation barrier that can be used to add an explicit barrier onthese parallel statements if need be

For instance a parallel execution construct can be used to representdata parallelism on GPUs when expressed via the OpenCL clEnqueueNDRan-

geKernel function (see the top of Figure 41) This function call could beencoded within PIPS parallel IR as a parallel loop (see the bottom of Fig-ure 41) each iteration executing the kernel function as a separate taskreceiving the proper index value as an argument

Execute rsquonrsquo kernels in parallel

global_work_size [0] = n

err = clEnqueueNDRangeKernel(cmd_queue kernel

1 NULL global_work_size

NULL 0 NULL NULL)

forloop(I 1

global_work_size

1

kernel ()

parallel)

Figure 41 OpenCL example illustrating a parallel loop

An other example in the left side of Figure 42 from Chapel illustratesits forall data parallelism construct which will be encoded with a SPIREparallel loop

forall i in 1n do

t[i] = 0

forloop(i 1 n 1

t[i] = 0 parallel)

Figure 42 forall in Chapel and its SPIRE core language representation

The main difference between a parallel sequence and a parallel unstruc-tured is that in a parallel unstructured synchronizations are explicit usingarcs that represent dependences between statements An example of an un-structured code is illustrated in Figure 43 where the graph in the rightside is a parallel unstructured representation of the code in the left sideNodes are statements and edges are data dependences between them for


example the statement u=x+1 cannot be launched before the statement x=3is terminated Figure 44 from OpenMP illustrates its parallel sections

task parallelism construct which will be encoded with a SPIRE parallel se-quence We can say that a parallel sequence is only suitable for fork-joinparallelism fashion whereas any other unstructured form of parallelism canbe encoded using the control flow graph (unstructured) set

x = 3 statement 1

y = 5 statement 2

z = x + y statement 3

u = x + 1 statement 4

nopentry

x=3 y=5

z=x+yu=x+1

nopexit

Figure 43 A C code and its unstructured parallel control flow graph rep-resentation

pragma omp sections nowait

pragma omp section

x = foo()

pragma omp section

y = bar()

z = baz(xy)

sequence(

sequence(x = foo() y = bar()

parallel )

z = baz(xy)

sequential)

Figure 44 parallel sections in OpenMP and its SPIRE core languagerepresentation

Representing the presence of parallelism using only the annotation ofparallel constitutes a voluntary trade-off between conciseness and express-ibility of representation For example representing the code of OpenCL inFigure 41 in the way we suggest may induce a loss of precision regardingthe use of queues of tasks adopted in OpenCL and lead to errors greatcare must be taken when analyzing programs to avoid errors in translationbetween OpenCL and SPIRE such as scheduling of these tasks


423 Synchronization

The issue of synchronization is a characteristic feature of the run-time be-havior of one statement with respect to other statements In parallel codeone usually distinguishes between two types of synchronization (1) collec-tive synchronization between threads using barriers and (2) point-to-pointsynchronization between participating threads We suggest this can be donein two parts

Collective Synchronization

The first possibility is via the new synchronization attribute which can beadded to the specification of statements SPIRE extends sequential interme-diate representations in a straightforward way by adding a synchronizationattribute to the specification of statements

statementrsquo = statement x synchronization

Coordination by synchronization in parallel programs is often dealt with viacoding patterns such as barriers used for instance when a code fragmentcontains many phases of parallel execution where each phase should waitfor the precedent ones to proceed We define the synchronization set viahigh-level coordination characteristics useful for optimization purposes

synchronization =

noneunit + spawnentity + barrierunit +

singlebool + atomicreference

where S is the statement with the synchronization attribute

bull none specifies the default behavior ie independent with respect toother statements for S

bull spawn induces the creation of an asynchronous task S while the valueof the corresponding entity is the user-chosen number of the threadthat executes S SPIRE thus in addition to decorating parallel tasksusing the parallel execution attribute to handle coarse grain par-allelism provides the possibility of specifying each task via the spawn

synchronization attribute in order to enable the encoding of finergrain level of parallelism

bull barrier specifies that all the child threads spawned by the execution ofS are suspended before exiting until they are all finished ndash an OpenCLexample illustrating spawn to encode the clEnqueueTask instructionand barrier to encode the clEnqueueBarrier instruction is providedin Figure 45


mode = OUT_OF_ORDER_EXEC_MODE_ENABLE

commands = clCreateCommandQueue(context

device_id mode amperr)

clEnqueueTask(commands kernel_A 0 NULL NULL)

clEnqueueTask(commands kernel_B 0 NULL NULL)

synchronize so that Kernel C starts only

after Kernels A and B have finished

clEnqueueBarrier(commands )

clEnqueueTask(commands kernel_C 0 NULL NULL)

barrier(

spawn(zero kernel_A ())

spawn(one kernel_B ())

)

spawn(zero kernel_C ())

Figure 45 OpenCL example illustrating spawn and barrier statements

bull single ensures that S is executed by only one thread in its threadteam (a thread team is the set of all the threads spawned within theinnermost parallel forloop statement) and a barrier exists at the endof a single operation if its synchronization single value is true

bull atomic predicates the execution of S via the acquisition of a lock toensure exclusive access at any given time S can be executed by onlyone thread Locks are logical memory addresses represented here by amember of the PIPS IR reference set presented in Section 265 Anexample illustrating how an atomic synchronization on the reference l

in a SPIRE statement accessing Array x can be translated in Cilk (viaCilk lock and Cilk unlock) and OpenMP (critical) is provided inFigure 46

Cilk_lockvar l

Cilk_lock_init(l)

Cilk_lock(l)

x[index[i]] += f(i)

Cilk_unlock(l)

pragma omp critical

x[index[i]] += f(i)

Figure 46 Cilk and OpenMP examples illustrating an atomically-synchronized statement


Event API Point-to-Point Synchronization

The second possibility addresses the case of point-to-point synchronizationbetween participating threads Handling point-to-point synchronization us-ing decorations on abstract syntax trees is too constraining when one hasto deal with a varying set of threads that may belong to different parallelparent nodes Thus SPIRE suggests to deal with this last class of coordi-nation using a new class of values of the event type SPIRE extends theunderlying type system of the existing sequential IRs with a new basic typenamely event

typersquo = type + eventunit

Values of type event are counters in a manner reminiscent of semaphoresThe programming interface for events is defined by the following atomicfunctions

bull event newEvent(int i) is the creation function of events initializedwith the integer i that specifies how many threads can execute wait

on this event without being blocked

bull void freeEvent(event e) is the deallocation function of the event e

bull void signal(event e) increments by one the event value2 of e

bull void wait(event e) blocks the thread that calls it until the value ofe is strictly greater than 0 When the thread is released this value isdecremented by one

In a first example of possible use of this event API the construct futureused in X10 (see Figure 47) can be seen as the spawning of the computationof foo() The end result is obtained via the call to the force method sucha mechanism can be easily implemented in SPIRE using an event attachedto the running task it is signaled when the task is completed and waitedby the force method

future ltint gt Fi = futurefoo ()

int i = Fiforce ()

Figure 47 X10 example illustrating a future task and its synchronization

A second example taken from Habanero-Java illustrates how point-to-point synchronization primitives such as phasers and the next statement canbe dealt with using the Event API (see Figure 48 left) The async phased

2The void return type will be replaced by int in practice to enable the handling oferror values


keyword can be replaced by spawn In this example the next statement isequivalent to the following sequence

signal(ph)

wait(ph)

signal(ph)

where the event ph is supposed initialized to newEvent (-(n-1)) the secondsignal is used to resume the suspended tasks in a chain-like fashion

finish




S

next

Sprime

barrier(

ph=newEvent(-(n -1))

j = 1

loop(j lt= n

spawn(j

S

signal(ph)

wait(ph)

signal(ph)

Srsquo)

j = j+1)

freeEvent(ph)

)

Figure 48 A phaser in Habanero-Java and its SPIRE core language rep-resentation

In order to illustrate the importance of the constructs implementingpoint-to-point synchronization and thus the necessity to handle them weshow the code in Figure 49 extracted from [97] where phasers can be used toimplement one of main types of parallelism namely the pipeline parallelism(see Section 22)

finish

phaser [] ph = new phaser[m+1]

for (int i = 1 i lt m i++)

async phased (ph[i]ltSIG gt ph[i-1]ltWAIT gt)

for (int j = 1 j lt n j++)

a[i][j] = foo(a[i][j] a[i][j-1] a[i-1][j-1])

next

Figure 49 Example of Pipeline Parallelism with phasers


Representing high-level point-to-point synchronization such as phasersand futures using low-level functions of event type constitutes a trade-offbetween generality and conciseness we can represent any type of synchro-nization via the low-level interface using events and expressibility of repre-sentation we lose some of the expressiveness and precision of these high-levelconstructs such as the deadlock-freedom safety property of phasers [100] In-deed an Habanero-Java parallel code avoids deadlock since the scope of eachused phaser is the immediately enclosing finish However the scope of theevents is less restrictive the user can put them wherever he wants and thismay lead to deadlock appearance

424 Data Distribution

The choice of a proper memory model to express parallel programs is an im-portant issue when designing a generic intermediate representation Thereare usually two main approaches to memory modeling shared and mes-sage passing models Since SPIRE is designed to extend existing IRs forsequential languages it can be straightforwardly seen as using a sharedmemory model when parallel constructs are added By convention we saythat spawn creates processes in the case of message passing memory modelsand threads in the other case

In order to take into account the explicit distribution required by themessage passing memory model used in parallel languages such as MPISPIRE introduces the send and recv blocking functions for implementingcommunication between processes

bull void send(int dest entity buf) transfers the value in Entity buf

to the process numbered dest

bull void recv(int source entity buf) receives in buf the value sentby Process source

The MPI example in Figure 410 can be represented in SPIRE as aparallel loop with index rank of size iterations whose body is the MPIcode from MPI Comm size to MPI Finalize The communication of Variablesum from Process 1 to Process 0 can be handled with SPIRE sendrecvfunctions

Note that non-blocking communications can be easily implemented inSPIRE using the above primitives within spawned statements (see Fig-ures 411 and 412)

A non-blocking receive should indicate the completion of the communi-cation We may use for instance finish recv (see Figure 412) that can beaccessed later by the receiver to test the status of the communication

Also broadcast collective communications such as defined in MPI canbe represented via wrappers around send and recv operations When the


MPI_Init (ampargc ampargv [])

MPI_Comm_rank(MPI_COMM_WORLD amprank)

MPI_Comm_size(MPI_COMM_WORLD ampsize)

if (rank == 0)

MPI_Recv(sum sizeof(sum) MPI_FLOAT 1 1

MPI_COMM_WORLD ampstat)

else if(rank == 1)

sum = 42

MPI_Send(sum sizeof(sum) MPI_FLOAT 0 1

MPI_COMM_WORLD )

MPI_Finalize ()

forloop(rank 0size 1

if(rank==0

recv(one sum)

if(rank==1

sum =42

send(zero sum)

nop)

)

parallel)

Figure 410 MPI example illustrating a communication and its SPIRE corelanguage representation

spawn(new send(zero sum))

Figure 411 SPIRE core language representation of a non-blocking send

atomic(finish_recv = false)

spawn(new recv(one sum)

atomic(finish_recv = true)

)

Figure 412 SPIRE core language representation of a non-blocking receive

master process and receiver processes want to perform a broadcast opera-tion then if this process is the master its broadcast operation is equivalentto a loop over receivers with a call to send as body otherwise (receiver)the broadcast is a recv function Figure 413 illustrates the SPIRE repre-sentation of a a broadcast collective communication of the value of sum

Another interesting memory model for parallel programming has beenintroduced somewhat recently the Partitioned Global Address Space [110]


if(rank==0

forloop(rank 1 size 1 send(rank sum)

parallel)

recv(zero sum))

Figure 413 SPIRE core language representation of a broadcast

The uses of the PGAS memory model in languages such as UPC [33]Habanero-Java X10 and Chapel introduce various notions such as Place

or Locale to label portions of a logically-shared memory that threads mayaccess in addition to complex APIs for distributing objectsarrays over theseportions Given the wide variety of current proposals we leave the issue ofintegrating the PGAS model within the general methodology of SPIRE asfuture work

43 SPIRE Operational Semantics

The purpose of the formal definition given in this section is to provide asolid basis for program analyses and transformations It is a systematicway to specify our IR extension mechanism something seldom present inIR definitions It also illustrates how SPIRE leverages the syntactic andsemantic level of sequential constructs to parallel ones preserving the traitsof the sequential operational semantics

Fundamentally at the syntactic and semantic levels SPIRE is a method-ology for expressing representation transformers mapping the definition ofa sequential language IR to a parallel version We define the operationalsemantics of SPIRE in a two-step fashion we introduce (1) a minimal coreparallel language that we use to model fundamental SPIRE concepts andfor which we provide a small-step operational semantics and (2) rewritingrules that translate the more complex constructs of SPIRE into this corelanguage

431 Sequential Core Language

Illustrating the transformations induced by SPIRE requires the definitionof a sequential IR basis as is done above via PIPS IR Since we focus hereon the fundamentals we use as core language a simpler minimal sequentiallanguage Stmt Its syntax is given in Figure 414 where we assume thatthe sets Ide of identifiers I and Exp of expressions E are given

Sequential statements are (1) nop for no operation (2) I=E for an as-signment of E to I (3) S1S2 for a sequence and (4) loop(ES) for a whileloop

43 SPIRE OPERATIONAL SEMANTICS 63

S isin Stmt =

nop | I=E | S1S2 | loop(ES)

S isin SPIRE(Stmt )=

nop | I=E | S1S2 | loop(ES) |

spawn(IS) |

barrier(S) | barrier_wait(n) |

wait(I) | signal(I) |

send(IIprime) | recv(IIprime)

Figure 414 Stmt and SPIRE(Stmt) syntaxes

At the semantic level a statement in Stmt is a very simple memorytransformer A memory m isinMemory is a mapping in Iderarr V alue wherevalues v isin V alue = N + Bool can either be integers n isin N or booleansb isin Bool The sequential operational semantics for Stmt expressed astransition rules over configurations κ isin Configuration = Memory times Stmtis given in Figure 415 we assume that the program is syntax- and type-correct A transition (m S)rarr (mprime Sprime) means that executing the statementS in a memory m yields a new memory mprime and a new statement Sprime weposit that the ldquorarrrdquo relation is transitive Rules 41 to 45 encode typicalsequential small-step operational semantic rules for the sequential part ofthe core language We assume that ζ isin Exp rarr Memory rarr V alue is theusual function for expression evaluation

v = ζ(E)m

(m I = E)rarr (m[Irarr v] nop)(41)

(m nop S)rarr (m S) (42)

(m S1)rarr (mprime Sprime1)

(m S1 S2)rarr (mprime Sprime1 S2)(43)

ζ(E)m

(m loop(ES))rarr (m S loop(ES))(44)

notζ(E)m

(m loop(ES))rarr (m nop)(45)

Figure 415 Stmt sequential transition rules


The semantics of a whole sequential program S is defined as the memorym such that (perp S)rarr (m nop) if the execution of S terminates

432 SPIRE as a Language Transformer

Syntax

At the syntactic level SPIRE specifies how a grammar for a sequential lan-guage such as Stmt is transformed ie extended with synchronized parallelstatements The grammar of SPIRE(Stmt) in Figure 414 adds to the se-quential statements of Stmt (from now on synchronized using the defaultnone) new parallel statements a task creation spawn a termination barrier

and two wait and signal operations on events or send and recv operationsfor communication Synchronizations single and atomic are defined viarewriting (see Subsection 433) The statement barrier wait(n) addedhere for specifying the multiple-step behavior of the barrier statement inthe semantics is not accessible to the programmer Figure 48 provides theSPIRE representation of a program example

Semantic Domains

SPIRE is an intermediate representation transformer As it extends gram-mars it also extends semantics The set of values manipulated by SPIRE(Stmt)

statements extends the sequential V alue domain with events e isin Event = N that encode events current values we posit that ζ(newEvent(E))m = ζ(E)m

Parallelism is managed in SPIRE via processes (or threads) We intro-duce control state functions π isin State = Procrarr ConfigurationtimesProcs tokeep track of the whole computation mapping each process i isin Proc = Nto its current configuration (ie the statement it executes and its own viewof memory m and the set c isin Procs = P(Proc) of the process children ithas spawned during its execution

In the following we note domain(π) = i isin Procπ(i) is definedthe set of currently running processes and π[i rarr (κ c)] the state π ex-tended at i with (κ c) A process is said to be finished if and only if allits children processes in c are also finished ie when only nop is left toexecute finished(π c) = (foralli isin cexistci isin Procsexistmi isin Memoryπ(i) =((mi nop) ci) and finished(π ci))

Memory Models

The memory model of sequential languages is a unique address space foridentifier values In our parallel extension a configuration for a given pro-cess or thread includes its view of memory We suggest to use the same se-mantic rules detailed below to deal with both shared and message passingmemory rules The distinction between these models beside the additional


use of sendreceive constructs in the message passing model versus eventsin the shared one is included in SPIRE via constraints we impose on thecontrol states π used in computations Namely we introduce the variablepids isin Ide such that m( pids) isin P(Ide) that is used to harvest the

process identities and we posit that in the shared memory model for allthreads i and iprime with π(i) = ((m S) c) and π(iprime) = ((mprime Sprime) cprime) one hasforallI isin (domain(m) cap domain(mprime))minus (m( pids) cupmprime( pids))m(I) = mprime(I)No such constraint is needed for the message passing model Regarding thenotion of memory equality note that the issue of private variables in threadswould have to be introduced in full-fledged languages As mentioned abovePGAS is left for future work some sort of constraints based on the charac-teristics of the address space partitioning for placeslocales would have tobe introduced

Semantic Rules

At the semantic level SPIRE is thus a transition system transformer map-ping rules such as the ones in Figure 415 to parallel synchronized tran-sition rules in Figure 416 A transition (π[i rarr ((m S) c)]) rarr (πprime[i rarr((mprime Sprime) cprime)]) means that the i-th process when executing S in a memorym yields a new memory mprime and a new control state πprime[irarr ((mprime Sprime) cprime)] inwhich this process now will execute Sprime additional children processes mayhave been created in cprime compared to c We posit that the ldquorarrrdquo relation istransitive

Rule 46 is a key rule to specify SPIRE transformer behavior providinga bridge between the sequential and the SPIRE-extended parallel semanticsall processes can non-deterministically proceed along their sequential seman-tics ldquorarrrdquo leading to valid evaluation steps along the parallel semantics ldquorarrrdquoThe interleaving between parallel processes in SPIRE(Stmt) is a consequenceof (1) the non-deterministic choice of the value of i within domain(π) whenselecting the transition to perform and (2) the number of steps executedby the sequential semantics Note that one might want to add even morenon-determinism in our semantics indeed Rule 41 is atomic loading thevariables in E and performing the store operation to I are performed in onesequential step To simplify this presentation we do not provide the simpleintermediate steps in the sequential evaluation semantics of Rule 41 thatwould have removed this artificial atomicity

The remaining rules focus on parallel evaluation In Rule 47 if Processn does not already exist spawn adds to the state a new process n thatinherits the parent memory m in a fork-like manner the set of processesspawned by n is initially equal to empty Otherwise n keeps its memory mn

and its set of spawned processes cn In both cases the value of pids in thememory of n is updated by the identifier I n executes S and is added tothe set of processes c spawned by i Rule 48 implements a rendezvous a


κrarr κprime

π[irarr (κ c)] rarr π[irarr (κprime c)](46)

n = ζ(I)m((mn Sn) cn) = (n isin domain(π)) π(n) ((m S) empty)

mprimen = mn[ pidsrarr mn( pids) cup I]π[irarr ((m spawn(IS)) c)] rarr

π[irarr ((m nop) c cup n)][nrarr ((mprimen S) cn)]

(47)

n 6isin domain(π) cup iπ[irarr ((m barrier(S)) c)] rarr

π[irarr (m barrier wait(n)) c)][nrarr ((m S) empty)]

(48)

finished(π n)π(n) = ((mprime nop) cprime)

π[irarr ((m barrier wait(n)) c)] rarr π[irarr ((mprime nop) c)](49)

n = ζ(I)m n gt 0

π[irarr ((m wait(I)) c)] rarr π[irarr ((m[Irarr nminus 1] nop) c)](410)

n = ζ(I)m

π[irarr ((m signal(I)) c)]) rarr π[irarr ((m[Irarr n+ 1] nop) c)](411)

pprime = ζ(Pprime)m p = ζ(P)mprime

π[prarr ((m send(Prsquo I)) c)][pprime rarr ((mprime recv(P Irsquo)) cprime)] rarrπ[prarr ((m nop) c)][pprime rarr ((mprime[Irsquorarr m(I)] nop) cprime)]

(412)

Figure 416 SPIRE(Stmt) synchronized transition rules

new process n executes S while process i is suspended as long as finishedis not true indeed the rule 49 resumes execution of process i when all thechild processes spawned by n have finished In Rules 410 and 411 I is anevent that is a counting variable used to control access to a resource orto perform a point-to-point synchronization initialized via newEvent to avalue equal to the number of processes that will be granted access to it Itscurrent value n is decremented every time a wait(I) statement is executedand when π(I) = n with n gt 0 the resource can be used or the barrier


can be crossed In Rule 411 the current value nprime of I is incremented thisis a non-blocking operation In Rule 412 p and pprime are two processes thatcommunicate p sends the datum I to pprime while this latter consumes it inIrsquo

The semantics of a whole parallel program S is defined as the set of mem-ories m such that perp[0rarr ((perp S) empty)] rarr π[0rarr ((m nop) c)] if S terminates

433 Rewriting Rules

The SPIRE concepts not dealt with in the previous section are defined viatheir rewriting into the core language This is the case for both the treatmentof the execution attribute and the remaining coarse-grain synchronizationconstructs

Execution

A parallel sequence of statements S1 and S2 is a pair of independent sub-statements executed simultaneously by spawned processes I1 and I2 respec-tively ie is equivalent to

spawn(I1S1)spawn(I2S2)

A parallel forloop (see Figure 42) with index I lower expression lowupper expression up step expression step and body S is equivalent to

I=lowloop(Ilt=up spawn(IS)I=I+step)

A parallel unstructured is rewritten as follows All control nodes presentin the transitive closure of the successor relation are rewritten in the samemanner Each control node C is characterized by a statement S predecessorlist ps and successor list ss For each edge (cC) where c is a predecessor ofC in ps an event ecC initialized at newEvent(0) is created and similarly forss The whole unstructured construct is replaced by a sequential sequenceof spawn(ISc) one for each C of the transitive closure of the successorrelation starting at the entry control node where Sc is defined as follows

barrier(spawn(one wait(eps[1]C))

spawn(mwait(eps[m]C)))

S

signal(eCss[1]) signal(eCss[mrsquo])

where m and mrsquo are the lengths of the ps and ss lists L[j] is the j-thelement of L

The code corresponding to the statement z=x+y taken from the parallelunstructured graph showed in Figure 43 is rewritten as

spawn(three

barrier(spawn(one wait(e13))


spawn(two wait(e23)))


signal(e3exit)

)

Synchronization

A statement S with synchronization atomic(I) is rewritten as

wait(I)Ssignal(I)

assuming that the assignment I = newEvent(1) is performed on the event

identifier I at the very beginning of the main function A wait on an eventvariable sets it to zero if it is currently equal to one to prohibit other threadsto enter the atomic section the signal resets the event variable to one topermit further access

A statement S with a blocking synchronization single ie equal totrue is equivalent when it occurs within an enclosing innermost parallelforloop to

barrier(wait(I_S)

if(first_S

S first_S = false

nop)

signal(I_S))

where first S is a boolean variable that ensures that only one processamong those spawned by the parallel loop will execute S access to thisvariable is protected by the event I S Both first S and I S are respectivelyinitialized before loop entry to true and newEvent(1)

The conditional if(ESSrsquo) can be rewritten using the core loop con-struct as illustrated in Figure 417 The same rewriting can be used when thesingle synchronization is equal to false corresponding to a non-blockingsynchronization construct except that no barrier is needed

if(ESSprime)stop = false

loop(E and notstop S stop = true)

loop(notE and notstop Sprime stop = true)

Figure 417 if statement rewriting using while loops

44 Validation SPIRE Application to LLVM IR

Assessing the quality of a methodology that impacts the definition of a datastructure as central for compilation frameworks as an intermediate repre-

44 VALIDATION SPIRE APPLICATION TO LLVM IR 69

sentation is a difficult task This section illustrates how it can be used on adifferent IR namely LLVM with minimal changes thus providing supportregarding the generality of our methodology We show how simple it is toupgrade the LLVM IR to a ldquoparallel LLVM IRrdquo via SPIRE transformationmethodology

LLVM [75] (Low-Level Virtual Machine) is an open source compilationframework that uses an intermediate representation in Static Single Assign-ment (SSA) [37] form We chose the IR of LLVM to illustrate a second timeour approach because LLVM is widely used in both academia and indus-try another interesting feature of LLVM IR compared to PIPSrsquos is that itsports a graph approach while PIPS is mostly abstract syntax tree-basedeach function is structured in LLVM as a control flow graph (CFG)

Figure 418 provides the definition of a significant subset of the sequentialLLVM IR described in [75] (to keep notations simple in this thesis we usethe same Newgen language to write this specification)

bull a function is a list of basic blocks which are portions of code with oneentry and one exit points

bull a basic block has an entry label a list of φ nodes and a list of instruc-tions and ends with a terminator instruction

bull φ nodes which are the key elements of SSA are used to merge thevalues coming from multiple basic blocks A φ node is an assignment(represented here as a call expression) that takes as arguments anidentifier and a list of pairs (value label) it assigns to the identifierthe value corresponding to the label of the block preceding the currentone at run time

bull every basic block ends with a terminator which is a control flow-altering instruction that specifies which block to execute after ter-mination of the current one

An example of the LLVM encoding of a simple for loop in three basicblocks is provided in Figure 419 where in the first assignment which usesa φ node sum is 42 if it is the first time we enter the bb1 block (from entry)or sum otherwise (from the branch bb)

Applying SPIRE to LLVM IR is as illustrated above with PIPS achievedin three steps yielding the SPIREd parallel extension of the LLVM sequen-tial IR provided in Figure 420

bull an execution attribute is added to function and block a parallelbasic block sees all its instructions launched in parallel while all theblocks of a parallel function are seen as parallel tasks to be executedconcurrently


function = blocksblock

block = labelentity x phi_nodesphi_node x

instructionsinstruction x

terminator

phi_node = call

instruction = call

terminator = conditional_branch +

unconditional_branch +

return

conditional_branch = valueentity x label_trueentity x

label_falseentity

unconditional_branch = labelentity

return = valueentity

Figure 418 Simplified Newgen definitions of the LLVM IR

sum = 42

for(i=0 ilt10 i++)

sum = sum + 2

entry

br label bb1

bb preds = bb1

0 = add nsw i32 sum0 2

1 = add nsw i32 i0 1

br label bb1

bb1 preds = bb entry

sum0 = phi i32 [42 entry ][0bb]

i0 = phi i32 [0entry ][1bb]

2 = icmp sle i32 i0 10

br i1 2 label bb label bb2

Figure 419 A loop in C and its LLVM intermediate representation

bull a synchronization attribute is added to instruction therefore aninstruction can be annotated with spawn barrier single or atomicsynchronization attributes When one wants to consider a sequence ofinstructions as a whole this sequence is first outlined in a ldquofunctionrdquoto be called instead this new call instruction is then annotated withthe proper synchronization attribute such as spawn if the sequencemust be considered as an asynchronous task we provide an example inFigure 421 A similar technique is used for the other synchronizationconstructs barrier single and atomic

bull as LLVM provides a set of intrinsic functions [75] SPIRE functionsnewEvent freeEvent signal and wait for handling point-to-pointsynchronization and send and recv for handling data distributionare added to this set

Note that the use of SPIRE on the LLVM IR is not able to express


function prime = function x execution

block prime = block x execution

instruction prime = instruction x synchronization

Intrinsic Functions

send recv signal wait newEvent freeEvent

Figure 420 SPIRE (LLVM IR)

main()

int A B C

A = foo()

B = bar(A)

C = baz()

call_AB(int Aint B)

A = foo()

B = bar(A)

main ()

int A B C

spawn(zero call_AB (ampAampB))

C = baz()

Figure 421 An example of a spawned outlined sequence

parallel loops as easily as was the case on PIPS IR Indeed the notion of aloop does not always exist in the definition of intermediate representationsbased on control flow graphs including LLVM it is an attribute of some of itsnodes which has to be added later on by a loop-detection program analysisphase Of course such analysis could also be applied on the SPIRE-derivedIR to thus recover this information Once one knows that a particular loopis parallel this can be encoded within SPIRE using the same technique aspresented in Section 433

More generally even though SPIRE uses traditional parallel paradigmsfor code generation purposes SPIRE-derived IRs are able to deal with morespecific parallel constructs such as DOACROSS loops [56] or HELIX-like [29]approaches HELIX is a technique to parallelize a loop in the presence ofloop-carried dependences by executing sequential segments with exploitingTLP between them and using prefetching of signal synchronization Ba-sically a compiler would parse a given sequential program into sequentialIR elements Optimization compilation phases specific to particular paral-lel code generation paradigms such as those above will translate wheneverpossible (specific data and control-flow analyses will be needed here) thesesequential IR constructs into parallel loops with the corresponding synchro-nization primitives as need be Code generation will then recognize suchIR patterns and generate specific parallel instructions such as DOACROSS


45 Related Work Parallel Intermediate Repre-sentations

In this section we review several different possible representations of parallelprograms both at the high syntactic and mid graph-based levels Weprovide synthetic descriptions of the key existing IRs addressing similarissues to our thesisrsquos Birdrsquos eye view comparisons with SPIRE are alsogiven here

Syntactic approaches to parallelism expression use abstract syntax treenodes while adding specific built-in functions for parallelism For instancethe intermediate representation of the implementation of OpenMP in GCC(GOMP) [84] extends its three-address representation GIMPLE [77] TheOpenMP parallel directives are replaced by specific built-ins in low- andhigh-level GIMPLE and additional nodes in high-level GIMPLE such asthe sync fetch and add built-in function for an atomic memory accessaddition Similarly Sarkar and Zhao introduce the high-level parallel in-termediate representation HPIR [111] that decomposes Habanero-Java pro-grams into region syntax trees while maintaining additional data structureson the side region control-flow graphs and region dictionaries that containinformation about the use and the definition of variables in region nodesNew abstract syntax tree nodes are introduced AsyncRegionEntry andAsyncRegionExit are used to delimit tasks while FinishRegionEntry andFinishRegionExit can be used in parallel sections SPIRE borrows someof the ideas used in GOMP or HPIR but frames them in more structuredsettings while trying to be more language-neutral In particular we try tominimize the number of additional built-in functions which have the draw-back of hiding the abstract high-level structure of parallelism and affectingcompiler optimization passes Moreover we focus on extending existing ASTnodes rather than adding new ones (such as in HPIR) in order to not fattenthe IR and avoid redundant analyses and transformations on the same basicconstructs Applying the SPIRE approach to systems such as GCC wouldhave provided a minimal set of extensions that could have also been used forother implementations of parallel languages that rely on GCC as a backendsuch as Cilk

PLASMA is a programming framework for heterogeneous SIMD sys-tems with an IR [89] that abstracts data parallelism and vector instruc-tions It provides specific operators such as add on vectors and specialinstructions such as reduce and par While PLASMA abstracts SIMD im-plementation and compilation concepts for SIMD accelerators SPIRE ismore architecture-independent and also covers control parallelism

InsPIRe is the parallel intermediate representation at the core of thesource-to-source Insieme compiler [57] for C C++ OpenMP MPI andOpenCL programs Parallel constructs are encoded using built-ins SPIRE

45 RELATED WORK PARALLEL IR 73

intends to also cover source-to-source optimization We believe it could havebeen applied to Insieme sequential components parallel constructs being de-fined as extensions of the sequential abstract syntax tree nodes of InsPIReinstead of using numerous built-ins

Turning now to mid-level intermediate representations many systemsrely on graph structures for representing sequential code and extend themfor parallelism The Hierarchical Task Graph [48] represents the programcontrol flow The hierarchy exposes the loop nesting structure at each loopnesting level the loop body is hierarchically represented as a single nodethat embeds a subgraph that has control and data dependence informationassociated with it SPIRE is able to represent both structured and unstruc-tured control-flow dependence thus enabling recursively-defined optimiza-tion techniques to be applied easily The hierarchical nature of underlyingsequential IRs can be leveraged via SPIRE to their parallel extensions thisfeature is used in the PIPS case addressed below

A stream graph [31] is a dataflow representation introduced specificallyfor streaming languages Nodes represent data reorganization and process-ing operations between streams and edges communications between nodesThe number of data samples defined and used by each node is supposed tobe known statically Each time a node is fired it consumes a fixed numberof elements of its inputs and produces a fixed number of elements on itsoutputs SPIRE provides support for both data and control dependenceinformation streaming can be handled in SPIRE using its point-to-pointsynchronization primitives

The parallel program graph (PPDG) [98] extends the program depen-dence graph [44] where vertices represent blocks of statements and edgesessential control or data dependences mgoto control edges are added to rep-resent task creation occurrences and synchronization edges to impose or-dering on tasks Kimble IR [24] uses an intermediate representation designedalong the same lines ie as a hierarchical direct acyclic graphs (DAG) ontop of GCC IR GIMPLE Parallelism is expressed there using new typesof nodes region which is a subgraph of clusters and cluster a sequenceof statements to be executed by one thread Like PPDG and Kimble IRSPIRE adopts an extension approach to ldquoparallelizerdquo existing sequential in-termediate representations this chapter showed that this can be definedas a general mechanism for parallel IR definitions and provides a formalspecification of this concept

LLVM IR [75] represents each function as control flow graphs To en-code parallel constructs LLVM introduces the notion of metadata suchas llvmloopparallel for implementing parallel loops A metadata isa string used as an annotation on the LLVM IR nodes LLVM IR lackssupport for other parallel constructs such as starting parallel threads syn-chronizing them etc We presented in Section 44 our proposal for a parallelIR for LLVM via the application of SPIRE to LLVM


46 Conclusion

SPIRE is a new and general 3-step extension methodology for mapping anyintermediate representation (IR) used in compilation platforms for repre-senting sequential programming constructs to a parallel IR one can leverageit for the source-to-source and high- to mid-level optimization of control-parallel languages and constructs

The extension of an existing IR introduces (1) a parallel execution at-tribute for each group of statements (2) a high-level synchronization at-tribute on each statement node and an API for low-level synchronizationevents and (3) two built-ins for implementing communications in messagepassing memory systems The formal semantics of SPIRE transformationaldefinitions is specified using a two-tiered approach a small-step operationalsemantics for its base parallel concepts and a rewriting mechanism for high-level constructs

The SPIRE methodology is presented via a use case the intermediaterepresentation of PIPS a powerful source-to-source compilation infrastruc-ture for Fortran and C We illustrate the generality of our approach byshowing how SPIRE can be used to represent the constructs of the currentparallel languages Cilk Chapel X10 Habanero-Java OpenMP OpenCLand MPI To validate SPIRE we show the application of SPIRE on anotherIR namely the one of the widely-used LLVM compilation infrastructure

The extension of the sequential IR of PIPS using SPIRE has been imple-mented in the PIPS middle-end and used for parallel code transformationsand generation (see Chapter 7) We added the execution set (parallel andsequential attributes) the synchronization set (none spawn barrieratomic and single attributes) events and data distribution calls Chap-ter 8 provides performance data gathered using our implementation ofSPIRE on PIPS IR to illustrate the ability of the resulting parallel IRto efficiently express parallelism present in scientific applications The nextchapter addresses the crucial step in any parallelization process which isscheduling

An earlier version of the work presented in this chapter was presentedat CPC [67] and at the HiPEAC Computing Systems Week

Chapter 5

BDSC AMemory-Constrained

Number ofProcessor-Bounded

Extension of DSCLife is too unpredictable to live by a schedule Anonymous

In the previous chapter we have introduced an extension methodology forsequential intermediate representations to handle parallelism In this chap-ter as we are interested in this thesis in task parallelism we focus on thenext step finding or extracting this parallelism from a sequential code tobe eventually expressed using a SPIRE-derived parallel intermediate rep-resentation To reach that goal we introduce BDSC which extends Yangand Gerasoulisrsquos Dominant Sequence Clustering (DSC) algorithm BDSC isa new efficient automatic scheduling algorithm for parallel programs in thepresence of resource constraints on the number of processors and their localmemory size it is based on the static estimation of both execution time andcommunication cost Thanks to the handling of these two resource parame-ters in a single framework BDSC can address both shared and distributedparallel memory architectures As a whole BDSC provides a trade-off be-tween concurrency gain and communication overhead between processorsunder two resource constraints finite number of processors and amount ofmemory

51 Introduction

One key problem when attempting to parallelize sequential programs is tofind solutions to graph partitioning and scheduling problems where verticesrepresent computational tasks and edges data exchanges Each vertex islabeled with an estimation of the time taken by the computation it per-forms similarly each edge is assigned a cost that reflects the amount ofdata that need to be exchanged between its adjacent vertices Efficiency isstrongly dependent here on the accuracy of the cost information encoded inthe graph to be scheduled Gathering such information is a difficult process

76 CHAPTER 5 BDSC

in general in particular in our case where tasks are automatically generatedfrom program code

We introduce thus a new non-preemptive static scheduling heuristic thatstrives to give as small as possible schedule lengths ie parallel execu-tion time in order to exploit the task-level parallelism possibly present insequential programs targeting homogeneous multiprocessors or multicoreswhile enforcing architecture-dependent constraints the number of proces-sors the computational cost and memory use of each task and the communi-cation cost for each task exchange to provide hopefully significant speedupson shared and distributed computer architectures Our technique calledBDSC is based on an existing best-of-breed static scheduling heuristicnamely Yang and Gerasoulisrsquos DSC (Dominant Sequence Clustering) list-scheduling algorithm [108] [47] that we equip to deal with new heuristicsthat handle resource constraints One key advantage of DSC over otherscheduling policies (see Section 54) besides its already good performancewhen the number of processors is unlimited is that it has been proven op-timal for forkjoin graphs this is a serious asset given our focus on theprogram parallelization process since task graphs representing parallel pro-grams often use this particular graph pattern Even though this propertymay be lost when constraints are taken into account our experiments onscientific benchmarks suggest that our extension still provides good perfor-mance (see Chapter 8)

This chapter is organized as follows Section 52 presents the originalDSC algorithm that we intend to extend We detail our heuristic extensionBDSC in Section 53 Section 54 compares the main existing schedulingalgorithms with BDSC Finally Section 55 concludes the chapter

52 The DSC List-Scheduling Algorithm

In this section we present the list-scheduling heuristic called DSC [108] thatwe extend and enhance to be useful for automatic task parallelization underresource constraints

521 The DSC Algorithm

DSC (Dominant Sequence Clustering) is a task list-scheduling heuristic foran unbounded number of processors The objective is to minimize the toplevel of each task (see Section 252 for a review of the main notions oflist scheduling) A DS (Dominant Sequence) is a path that has the longestlength in a partially scheduled DAG a graph critical path is thus a DS for thetotally scheduled DAG The DSC heuristic computes a Dominant Sequence(DS) after each vertex is processed using tlevel(τG) + blevel(τG) aspriority(τ) Since priorities are updated after each iteration DSC com-putes dynamically the critical path based on both top level and bottom level

52 THE DSC LIST-SCHEDULING ALGORITHM 77

information A ready vertex τ ie all vertex all of whose all predecessorshave already been scheduled on one of the current DSs ie with the highestpriority is clustered with a predecessor τp when this reduces the top levelof τ by zeroing ie setting to zero the cost of communications on the in-cident edge (τp τ) As said in Section 252 task time(τ) and edge cost(e)are assumed to be numerical constants although we show how we lift thisrestriction in Section 63

The DSC instance of the general list-scheduling Algorithm 3 presented inSection 252 is provided in Algorithm 4 where select cluster is replacedby the code in Algorithm 5 (new cluster extends clusters with a new emptycluster its cluster time is set to 0) The table in Figure 51 represents theresult of scheduling the DAG in the same figure using the DSC algorithm

ALGORITHM 4 DSC scheduling of Graph G on P processors

procedure DSC(G P)


priority(τi) = tlevel(τi G) + blevel(τi G)





update_graph(G)


UT = UT -τ end

ALGORITHM 5 DSC cluster selection for Task τ for Graph G on P processors

function select_cluster(τ G P clusters)

(min_τ min_tlevel) = tlevel_decrease(τ G)

return (cluster(min_τ ) 6= cluster_undefined)

cluster(min_τ ) new_cluster(clusters )

end

To decide which predecessor τp to select to be clustered with τ DSCapplies the minimization procedure tlevel decrease presented in Algo-rithm 6 which returns the predecessor that leads to the highest reductionof top level for τ if clustered together and the resulting top level if nozeroing is accepted the vertex τ is kept in a new single vertex cluster Moreprecisely the minimization procedure tlevel decrease for a task τ in Al-gorithm 6 tries to find the cluster cluster(min τ) of one of its predecessors

78 CHAPTER 5 BDSC

τ1 1 τ4 2

τ2 3

τ3 2

2

1

1

step task tlevel blevel DS scheduled tlevelκ0 κ1 κ2

1 τ4 0 7 7 02 τ3 3 2 5 2 33 τ1 0 5 5 04 τ2 4 3 7 2 4

Figure 51 A Directed Acyclic Graph (left) and its scheduling (right)starred top levels () correspond to the selected clusters

τp that reduces the top level of τ as much as possible by zeroing the cost ofthe edge (min τ τ) All clusters start at the same time and each clusteris characterized by its running time cluster time(κ) which is the cumu-lated time of all tasks scheduled into κ idle slots within clusters may existand are also taken into account in this accumulation process The conditioncluster(τp) 6= cluster undefined is tested on predecessors of τ in orderto make it possible to apply this procedure for ready and unready verticesan unready vertex has at least one unscheduled predecessor

ALGORITHM 6 Minimization DSC procedure for Task τ in Graph G

function tlevel_decrease(τ G)

min_tlevel = tlevel(τ G)

min_τ = τ foreach τp isin predecessors(τ G)

where cluster(τp) 6= cluster_undefined

start_time = cluster_time(cluster(τp)))foreach τ primep isin predecessors(τ G) where

cluster(τ primep) 6= cluster_undefined

if(τp 6= τ primep) then

level = tlevel(τ primep G)+ task_time(τ primep)+ edge_cost(τprimep τ )

start_time = max(level start_time )

if(min_tlevel gt start_time) then

min_tlevel = start_time

min_τ = τpreturn (min_τ min_tlevel )

end


522 Dominant Sequence Length Reduction Warranty

Dominant Sequence Length Reduction Warranty (DSRW) is a greedy heuris-tic within DSC that aims to further reduce the scheduling length A vertexon the DS path with the highest priority can be ready or not ready With theDSRW heuristic DSC schedules the ready vertices first but if such a readyvertex τr is not on the DS path DSRW verifies using the procedure in Al-gorithm 7 that the corresponding zeroing does not affect later the reductionof the top levels of the DS vertices τu that are partially ready ie such thatthere exists at least one unscheduled predecessor of τu To do this we checkif the ldquopartial top levelrdquo of τu which does not take into account unexam-ined (unscheduled) predecessors and is computed using tlevel decreaseis reducible once τr is scheduled

ALGORITHM 7 DSRW optimization for Task τu when scheduling Task τr for

Graph G

function DSRW(τr τu clusters G)

(min_τ min_tlevel) = tlevel_decrease(τr G)

before scheduling τr(τbptlevel_before) = tlevel_decrease(τu G)

scheduling τrallocate_task_to_cluster(τr cluster(min_τ ) G)

saved_edge_cost = edge_cost(min_τ τr)edge_cost(min_τ τr) = 0

after scheduling τr(τaptlevel_after) = tlevel_decrease(τu G)

if (ptlevel_after gt ptlevel_before) then

(min_τ τr) zeroing not accepted

edge_cost(min_τ τr) = saved_edge_cost

return false

return true

end

The table in Figure 51 illustrates an example where it is useful to ap-ply the DSRW optimization There the DS column provides for the taskscheduled at each step its priority ie the length of its dominant sequencewhile the last column represents for each possible zeroing the correspond-ing task top level starred top levels () correspond to the selected clustersTask τ4 is mapped to Cluster κ0 in the first step of DSC Then τ3 is se-lected because it is the ready task with the highest priority The mappingof τ3 to Cluster κ0 would reduce its top level from 3 to 2 But the zeroing

80 CHAPTER 5 BDSC

κ0 κ1τ4 τ1τ3 τ2

κ0 κ1 κ2τ4 τ1τ2 τ3

Figure 52 Result of DSC on the graph in Figure 51 without (left) andwith (right) DSRW

of (τ4 τ3) affects the top level of τ2 τ2 being the unready task with thehighest priority Since the partial top level of τ2 is 2 with the zeroing of(τ4τ2) but 4 after the zeroing of (τ4τ3) DSRW will fail and DSC allocatesτ3 to a new cluster κ1 Then τ1 is allocated to a new cluster κ2 sinceit has no predecessors Thus the zeroing of (τ4τ2) is kept thanks to theDSRW optimization the total scheduling length is 5 (with DSRW) insteadof 7 (without DSRW) (Figure 52)

53 BDSC A Resource-Constrained Extension ofDSC

This section details the key ideas at the core of our new scheduling processBDSC which extends DSC with a number of important features namely(1) by verifying predefined memory constraints (2) by targeting a boundednumber of processors and (3) by trying to make this number as small aspossible

531 DSC Weaknesses

A good scheduling solution is a solution that is built carefully by havingknowledge about previous scheduled tasks and tasks to execute in the futureYet as stated in [72] ldquoan algorithm that only considers bottom level or onlytop level cannot guarantee optimal solutionsrdquo Even though tests carried outon a variety of scheduling algorithms show that the best competitor amongthem is the DSC algorithm [101] and DSC is a policy that uses the criticalpath for computing dynamic priorities based on both the bottom level andthe top level for each vertex it has some limits in practice

The key weakness of DSC for our purpose is that the number of pro-cessors cannot be predefined DSC yields blind clusterings disregarding re-source issues Therefore in practice a thresholding mechanism to limit thenumber of generated clusters should be introduced When allocating newclusters one should verify that the number of clusters does not exceed apredefined threshold P (Section 533) Also zeroings should handle mem-ory constraints by verifying that the resulting clustering does not lead tocluster data sizes that exceed a predefined cluster memory threshold M (Sec-tion 533)

53 BDSC A RESOURCE-CONSTRAINED EXTENSION OF DSC 81

Finally DSC may generate a lot of idle slots in the created clusters Itadds a new cluster when no zeroing is accepted without verifying the possibleexistence of gaps in existing clusters We handle this case in Section 534adding an efficient idle cluster slot allocation routine in the task-to-clustermapping process

532 Resource Modeling

Since our extension deals with computer resources we assume that each ver-tex in a task DAG is labeled with an additional information task data(τ)which is an over-approximation of the memory space used by Task τ itssize is assumed to be always strictly less than M A similar cluster data

function applies to clusters where it represents the collective data spaceused by the tasks scheduled within it Since BDSC as DSC needs execu-tion times and communication costs to be numerical constants we discussin Section 63 how this information is computed

Our improvement to the DSC heuristic intends to reach a tradeoff be-tween the gained parallelism and the communication overhead between pro-cessors under two resource constraints finite number of processors andamount of memory We track these resources in our implementation ofallocate task to cluster given in Algorithm 8 note that the aggrega-tion function regions union is defined in Section 263

ALGORITHM 8 Task allocation of Task τ in Graph G to Cluster κ with re-

source management

procedure allocate_task_to_cluster(τ κ G)

cluster(τ ) = κcluster_time(κ) = max(cluster_time(κ) tlevel(τ G)) +

task_time(τ )cluster_data(κ) = regions_union(cluster_data(κ)

task_data(τ ))end

Efficiently allocating tasks on the target architecture requires reducingthe communication overhead and transfer cost for both shared and dis-tributed memory architectures If zeroing operations that reduce the starttime of each task and nullify the corresponding edge cost are obviouslymeaningful for distributed memory systems they are also worthwhile onshared memory architectures Merging two tasks in the same cluster keepsthe data in the local memory and even possibly cache of each thread andavoids their copying over the shared memory bus Therefore transmissioncosts are decreased and bus contention is reduced

82 CHAPTER 5 BDSC

533 Resource Constraint Warranty

Resource usage affects the performance of running programs Thus par-allelization algorithms should try to limit the size of the memory used bytasks BDSC introduces a new heuristic to control the amount of memoryused by a cluster via the user-defined memory upper bound parameter MThe limitation of the memory size of tasks is important when (1) executinglarge applications that operate on large amount of data (2) M representsthe processor local1 (or cache) memory size since if the memory limitationis not respected transfer between the global and local memories may oc-cur during execution and may result in performance degradation and (3)targeting embedded systems architecture For each task τ BDSC uses anover-approximation of the amount of data that τ allocates to perform readand write operations it is used to check that the memory constraint ofCluster κ is satisfied whenever τ is included in κ Algorithm 9 implementsthis memory constraint warranty (MCW) regions union and data size

are functions that respectively merge data and yield the size (in bytes) ofdata (see Section 63 in the next chapter)

ALGORITHM 9 Resource constraint warranties on memory size M and pro-

cessor number P

function MCW(τ κ M)

merged = regions_union(cluster_data(κ) task_data(τ ))return data_size(merged) le M

end

function PCW(clusters P)

return |clusters| lt P

end

The previous line of reasoning is well adapted to a distributed memoryarchitecture When dealing with a multicore equipped with a purely sharedmemory such per-cluster memory constraint is less meaningful We cannonetheless keep the MCW constraint check within the BDSC algorithmeven in this case if we set M to the size of the global shared memory Apositive by-product of this design choice is that BDSC is able in the sharedmemory case to reject computations that need more memory space thanavailable even within a single cluster

Another scarce resource is the number of processors In the originalpolicy of DSC when no zeroing for τ is accepted ie that would decreaseits start time τ is allocated to a new cluster In order to limit the numberof created clusters we propose to introduce a user-defined cluster threshold

1We handle one level of the memory hierarchy of the processors


P This processor constraint warranty PCW is defined in Algorithm 9

534 Efficient Task-to-Cluster Mapping

In the original policy of DSC when no zeroings are accepted because nonewould decrease the start time of Vertex τ or DSRW failed τ is allocated to anew cluster This cluster creation is not necessary when idle slots are presentat the end of other clusters thus we suggest to select instead one of theseidle slots if this cluster can decrease the start time of τ without affecting thescheduling of the successors of the vertices already scheduled in this clusterTo ensure this these successors must have already been scheduled or theymust be a subset of the successors of τ Therefore in order to efficientlyuse clusters and not introduce additional clusters without needing it wepropose to schedule τ to this cluster that verifies this optimizing constraintif no zeroing is accepted

This extension of DSC we introduce in BDSC amounts thus to re-placing each assignment of the cluster of τ to a new cluster by a callto end idle clusters The end idle clusters function given in Algo-rithm 10 returns among the idle clusters the ones that finished the mostrecently before τ rsquos top level or the empty set if none is found This assumesof course that τ rsquos dependencies are compatible with this choice

ALGORITHM 10 Efficiently mapping Task τ in Graph G to clusters if possible

function end_idle_clusters(τ G clusters)

idle_clusters = clusters

foreach κ isin clusters

if(cluster_time(κ) le tlevel(τ G)) then

end_idle_p = TRUE

foreach τκ isin vertices(G) where cluster(τκ) = κforeach τs isin successors(τκ G)

end_idle_p and= cluster(τs) 6= cluster_undefined orτs isin successors(τ G)

if(notend_idle_p) then

idle_clusters = idle_clusters -κlast_clusters = argmaxκisinidle clusters cluster_time(κ)return (idle_clusters = empty) last_clusters empty

end

To illustrate the importance of this heuristic suppose we have the DAGpresented in Figure 53 The tables in Figure 54 exhibit the difference inscheduling obtained by DSC and our extension on this graph We observehere that the number of clusters generated using DSC is 3 with 5 idleslots while BDSC needs only 2 clusters with 2 idle slots Moreover BDSCachieves a better load balancing than DSC since it reduces the variance of

84 CHAPTER 5 BDSC

τ11

τ25

τ32

τ43

τ52

τ62

1

1

1 1

1

9

2

step task tlevel blevel DS scheduled tlevelκ0 κ1

1 τ1 0 15 15 02 τ3 2 13 15 13 τ2 2 12 14 3 24 τ4 8 6 14 75 τ5 8 6 14 8 106 τ6 13 2 15 11 12

Figure 53 A DAG amenable to cluster minimization (left) and its BDSCstep-by-step scheduling (right)

the clustersrsquo execution loads defined for a given cluster as the sum of thecosts of all its tasks Xκ =

sumτisinκ task time(τ) Indeed we compute first the

average of costs

E[Xκ] =

|clusters`1sumκ=0

Xκ

|clusters|and then the variance

V (Xκ) =

sum|clusters`1κ=0 (Xκ minus E[Xκ])2

|clusters|

For BDSC E(Xκ) = 152 = 75 and V (Xκ) = (7minus75)2+(8minus75)2

2 = 025 For

DSC E(Xκ) = 153 = 5 and V (Xκ) = (5minus5)2+(8minus5)2+(2minus5)2

3 = 6Finally with our efficient task-to-cluster mapping in addition to decreas-

ing the number of generated clusters we gain also in the total execution timesince our approach reduces communication costs by allocating tasks to thesame cluster eg by eliminating edge cost(τ5 τ6) = 2 in Figure 54 thusas shown in this figure (the last column accumulates the execution time forthe clusters) the execution time with DSC is 14 but is 13 with BDSC

To get a feeling for the way BDSC operates we detail the steps takento get this better scheduling in the table of Figure 53 BDSC is equivalentto DSC until Step 5 where κ0 is chosen by our cluster mapping heuristicsince successors(τ3 G) sub successors(τ5 G) no new cluster needs to beallocated

535 The BDSC Algorithm

BDSC extends the list scheduling template DSC provided in Algorithm 4 bytaking into account the various extensions discussed above In a nutshell theBDSC select cluster function which decides in which cluster κ a task τshould be allocated tries successively the four following strategies


κ0 κ1 κ2 cumulative time

τ1 1τ3 τ2 7

τ4 τ5 10τ6 14

κ0 κ1 cumulative time

τ1 1τ3 τ2 7τ5 τ4 10τ6 13

Figure 54 DSC (left) and BDSC (right) cluster allocation

1 choose κ among the clusters of τ rsquos predecessors that decrease the starttime of τ under MCW and DSRW constraints

2 or assign κ using our efficient task-to-cluster mapping strategy underthe additional constraint MCW

3 or create a new cluster if the PCW constraint is satisfied

4 otherwise choose the cluster among all clusters in MCW clusters min

under the constraint MCW Note that in this worst case scenario thetop level of τ can be increased leading to a decrease in performancesince the length of the graph critical path is also increased

BDSC is described in Algorithms 11 and 12 the entry graph Gu is thewhole unscheduled program DAG P the maximum number of processorsand M the maximum amount of memory available in a cluster UT de-notes the set of unexamined tasks at each BDSC iteration RL the set ofready tasks and URL the set of unready ones We schedule the vertices of Gaccording to the four rules above in a descending order of the verticesrsquo prior-ities Each time a task τr has been scheduled all the newly readied verticesare added to the set RL (ready list) by the update ready set function

BDSC returns a scheduled graph ie an updated graph where some ze-roings may have been performed and for which the clusters function yieldsthe clusters needed by the given schedule this schedule includes beside thenew graph the cluster allocation function on tasks cluster If not enoughmemory is available BDSC returns the original graph and signals its failureby setting clusters to the empty set In Chapter 6 we introduce an alter-native solution in case of failure using an iterative hierarchical schedulingapproach

We suggest to apply here an additional heuristic in which if multiplevertices have the same priority the vertex with the greatest bottom level ischosen for τr (likewise for τu) to be scheduled first to favor the successorsthat have the longest path from τr to the exit vertex Also an optimizationcould be performed when calling update priority values(G) indeed aftereach cluster allocation only the top levels of the successors of τr need to berecomputed instead of those of the whole graph

86 CHAPTER 5 BDSC

ALGORITHM 11 BDSC scheduling Graph Gu under processor and memory

bounds P and M

function BDSC(Gu P M)

if (P le 0) then

return error(primeNot enough processors prime Gu)

G = graph_copy(Gu)

foreach τi isin vertices(G)

priority(τi) = tlevel(τi G) + blevel(τj G)

UT = vertices(G)

RL = τ isin UT predecessors(τ G) = emptyURL = UT - RL

clusters = emptywhile UT 6= emptyτr = select_task_with_highest_priority(RL)

(τm min_tlevel) = tlevel_decrease(τr G)

if (τm 6= τr and MCW(τr cluster(τm) M)) then

τu = select_task_with_highest_priority(URL)

if (priority(τr) lt priority(τu)) then

if (notDSRW(τr τu clusters G)) then

if (PCW(clusters P)) then

κ = new_cluster(clusters )

allocate_task_to_cluster(τr κ G)

else

if (notfind_cluster(τr G clusters P M)) then

return error(primeNot enough memory prime Gu)

else

allocate_task_to_cluster(τr cluster(τm) G)

edge_cost(τm τr) = 0

else if (notfind_cluster(τr G clusters P M)) then



UT = UT -τrRL = update_ready_set(RL τr G)

URL = UT -RL

clusters(G) = clusters

return G

end

Theorem 1 The time complexity of Algorithm 11 (BDSC) is O(n3) nbeing the number of vertices in Graph G

Proof In the ldquowhilerdquo loop of BDSC the most expensive computationamong different if or else branches is the function end idle cluster usedin find cluster that locates an existing cluster suitable to allocate thereTask τ such reuse intends to optimize the use of the limited of processors

54 RELATED WORK SCHEDULING ALGORITHMS 87

ALGORITHM 12 Attempt to allocate κ in clusters for Task τ in Graph G

under processor and memory bounds P and M returning true if successful

function find_cluster(τ G clusters P M)

MCW_idle_clusters =

κ isin end_idle_clusters(τ G clusters P)

MCW(τ κ M)

if (MCW_idle_clusters 6= empty) then

κ = choose_any(MCW_idle_clusters )


else if (PCW(clusters P)) then

allocate_task_to_cluster(τ new_cluster(clusters) G)

else

MCW_clusters = κ isin clusters MCW(τ κ M)

MCW_clusters_min = argminκ isin MCW clusters cluster_time(κ)if (MCW_clusters_min 6= empty) then

κ = choose_any(MCW_clusters_min )


else

return false

return true

end

function error(m G)

clusters(G) = emptyreturn G

end

Its complexity is proportional tosumτisinvertices(G)

|successors(τG)|

which is of worst case complexity O(n2) Thus the total cost for n iterationsof the ldquowhilerdquo loop is O(n3)

Even though BDSCrsquos worst case complexity is larger than DSCrsquos whichis O(n2log(n)) [108] it remains polynomial with a small exponent Ourexperiments (see Chapter 8) showed that theoretical slowdown not to be asignificant factor in practice

54 Related Work Scheduling Algorithms

In this section we survey the main existing scheduling algorithms we com-pare them to BDSC Given the breadth of the literature on this subjectwe limit this presentation to heuristics that implement static scheduling of

88 CHAPTER 5 BDSC

task DAGs Static scheduling algorithms can be divided into three principalclasses guided random metaheuristic and heuristic techniques

Guided randomization and metaheuristic techniques use possibly ran-dom choices with guiding information gained from previous search resultsto construct new possible solutions Genetic algorithms are examples ofthese techniques they proceed by performing a global search working inparallel on a population to increase the chance to achieve a global optimumThe Genetic Algorithm for Task Scheduling (GATS 10) [38] is an exampleof GAs Overall these techniques usually take a larger than necessary timeto find a good solution [72]

Cuckoo Search (CS) [109] is an example of the metaheuristic class CScombines the behavior of cuckoo species (throwing their eggs in the nests ofother birds) with the Levy flight [28] In metaheuristics finding an initialpopulation and deciding when a solution is good are problems that affectthe quality of the solution and the complexity of the algorithm

The third class is heuristics including mostly list-scheduling-based al-gorithms (see Section 252) We compare BDSC with eight scheduling al-gorithms for a bounded number of processors HLFET ISH MCP HEFTCEFT ELT LPGS and LSGP

The Highest Level First with Estimated Times (HLFET) [10] and Inser-tion Scheduling Heuristic (ISH) [70] algorithms use static bottom levels forordering scheduling is performed according to a descending order of bottomlevels To schedule a task they select the cluster that offers the earliest exe-cution time using a non-insertion approach ie not taking into account idleslots within existing clusters to insert that task If scheduling a given taskintroduces an idle slot ISH adds the possibility of inserting from the readylist tasks that can be scheduled to this idle slot Since in both algorithmsonly bottom levels are used for scheduling purposes optimal schedules forforkjoin graphs cannot be guaranteed

The Modified Critical Path (MCP) algorithm [107] uses the latest starttimes ie the critical path length minus blevel as task priorities Itconstructs a list of tasks in an ascending order of latest start times andsearches for the cluster yielding the earliest execution using the insertionapproach As before it cannot guarantee optimal schedules for forkjoinstructures

The Heterogeneous Earliest-Finish-Time (HEFT) algorithm [105] selectsthe cluster that minimizes the earliest finish time using the insertion ap-proach The priority of a task its upward rank is the task bottom levelSince this algorithm is based on bottom levels only it cannot guaranteeoptimal schedules for forkjoin structures

The Constrained Earliest Finish Time (CEFT) [68] algorithm schedulestasks on heterogeneous systems It uses the concept of constrained criticalpaths (CCPs) that represent the tasks ready at each step of the schedulingprocess CEFT schedules the tasks in the CCPs using the finish time in the

55 CONCLUSION 89

entire CCP The fact that CEFT schedules critical path tasks first cannotguarantee optimal schedules for fork and join structures even if sufficientprocessors are provided

In contrast to these five algorithms BDSC preserves the DSC char-acteristics of optimal scheduling for forkjoin structures since it uses thecritical path length for computing dynamic priorities based on bottom lev-els and top levels when the number of processors available for BDSC isgreater or equal than the number of processors used when scheduling usingDSC HLFET ISH and MCP guarantee that the current critical path willnot increase but they do not attempt to decrease the critical path lengthBDSC decreases the length of the DS (Dominant Sequence) of each task andstarts with a ready vertex to simplify the computation time of new priori-ties When resource scarcity is a factor BDSC introduces a simple two-stepheuristic for task allocation to schedule tasks it first searches for possibleidle slots in already existing clusters and otherwise picks a cluster withenough memory Our experiments suggest that such an approach providesgood schedules (see Chapter 8)

The Locally Parallel-Globally Sequential (LPGS) [79] and Locally Sequen-tial-Globally Parallel (LSGP) [59] algorithms are two techniques that froma schedule for an unbounded number of clusters remap the solutions to abounded number of clusters In LSGP clusters are partitioned into blockseach block being assigned to one cluster (locally sequential) the blocks arehandled separately by different clusters which can be run in parallel (glob-ally parallel) LPGS links each original one-block cluster to one proces-sor (locally parallel) blocks are executed sequentially (globally sequential)BDSC computes a bounded schedule on the fly and covers many more otherpossibilities of scheduling than LPGS and LSGP

Extended Latency Time (ELT) algorithm [102] assigns tasks to a parallelmachine with shared memory It uses the attribute of synchronization timeinstead of communication time because this does not exist in a machinewith shared memory BDSC targets both shared and distributed memorysystems

55 Conclusion

This chapter presents the memory-constrained number of processor-boundedlist-scheduling heuristic BDSC which extends the DSC (Dominant SequenceClustering) algorithm This extension improves upon DSC by dealing withmemory- and number-of-processors-constrained parallel architectures andintroducing an efficient task-to-cluster mapping strategy

This chapter leaves pending many questions about the practical use ofBDSC and how to estimate the execution times and communication costs asnumerical constants The next chapter answers these questions and others

90 CHAPTER 5 BDSC

by showing (1) the practical use of BDSC within a hierarchical schedulingalgorithm called HBDSC for parallelizing scientific applications and (2) howBDSC will benefit from a sophisticated execution and communication costmodel based on either static code analysis results provided by a compilerinfrastructure such as PIPS or a dynamic-based instrumentation assessmenttool

Chapter 6

BDSC-Based HierarchicalTask Parallelization

All sciences are now under the obligation to prepare the ground for the futuretask of the philosopher which is to solve the problem of value to determine

the true hierarchy of values Friedrich Nietzsche

To make a recapitulation of what we presented so far in the previous chap-ters recall we have handled two key issues in our automatic task paralleliza-tion process namely (1) the design of SPIRE to provide parallel programrepresentations and (2) the introduction of the BDSC list-scheduling heuris-tic for mapping DAGs in the presence of resource constraints on the numberof processors and their local memory size In order to reach our goal ofautomatically task parallelizing sequential codes we develop in this chapteran approach based on BDSC that first maps this code into a DAG thengenerates a cost model to label edges and vertices of this DAG and finallyapplies BDSC on sequential applications

Therefore this chapter introduces a new BDSC-based hierarchical schedul-ing algorithm that we call HBDSC It introduces a new DAG data structurecalled the Sequence Dependence DAG (SDG) to represent partitioned par-allel programs Moreover in order to label this SDG vertices and edgesHBDSC uses a new cost model based on time complexity measures convexpolyhedral approximations of data array sizes and code instrumentation

61 Introduction

If static scheduling algorithms are key issues in sequential program paral-lelization they need to be properly integrated into compilation platformsto be used in practice These platforms are in particular expected to pro-vide the data required for scheduling purposes However gathering suchinformation is a difficult problem in general in particular in our case wheretasks are automatically generated from program code Therefore in addi-tion to BDSC this chapter introduces also new static program analyses togather the information required to perform scheduling under resource con-straints namely a static execution and communication cost model a datadependence graph to provide scheduling constraints and static informationregarding the volume of data exchanged between program tasks

In this chapter we detail how BDSC can be used in practice to scheduleapplications We show (1) how to build from an existing program source

92 CHAPTER 6 BDSC-BASED HIERARCHICAL TASK PAR

code what we call a Sequence Dependence DAG (SDG) which plays the roleof DAG G used in Chapter 5 (2) how to generate the numerical costs ofvertices and edges in SDGs and (3) how to perform what we call Hierar-chical Scheduling (HBDSC) for SDGs to generate parallel code encoded inSPIRE(PIPS IR) We use PIPS to illustrate how these new techniques canbe implemented in an optimizing compilation platform

PIPS represents user code as abstract syntax trees (see Section 265)We define a subset of its grammar in Figure 61 limited to the statementsS at stake in this chapter Econd Elower and Eupper are expressions while I

is an identifier The semantics of these constructs is straightforward Notethat in PIPS assignments are seen as function calls where left hand sidesare parameters passed by reference We use the notion of unstructured

with an execution attribute equal to parallel (defined in Chapter 4) torepresent parallel code

S isin Statement = sequence(S1Sn) |

test(EcondStSf ) |

forloop(I Elower Eupper Sbody) |

call |

unstructured(Centry Cexit parallel)

C isin Control = control(S Lsucc Lpred)

L isin Control

Figure 61 Abstract syntax trees Statement syntax

The remainder of this chapter is structured as follows Section 62 in-troduces the partitioning of source codes into Sequence Dependence DAGs(SDG) Section 63 presents our cost models for the labeling of vertices andedges of SDGs We introduce in Section 64 an important analysis thepath transformer that we use within our cost models A new BDSC-basedhierarchical scheduling algorithm (HBDSC) is introduced in Section 65Section 66 compares the main existing parallelization platforms with ourBDSC-based hierarchical parallelization approach Finally Section 67 con-cludes the chapter Figure 62 summarizes the processes of parallelizationapplied in this chapter

62 Hierarchical Sequence Dependence DAG Map-ping

In order to partition into tasks real applications which include loops testsand other structured constructs1 into dependence DAGs our approach is

1In this thesis we only handle structured parts of a code ie the ones that do notcontain goto statements Therefore within this context PIPS implements control depen-

62 HIERARCHICAL SEQUENCE DEPENDENCE DAG MAPPING 93

C Source Code Input Data

PIPS Analyses

DDGConvex Array

RegionsComplexityPolynomials

SymbolicComms ampMemory

Sizes

Polynomials

SequentialIR

SequenceDependence

Graph

SDG

Are polynomialsmonomials and on

the same base

PolynomialsApprox-imation

Instrumentation

InstrumentedSource

Execution

Numerical Profile

HBDSC

Unstructured SPIRE(PIPS IR) Schedule

yes

no

Figure 62 Parallelization process blue indicates thesis contributions anellipse a process and a rectangle results

dences in its IR since it is equivalent to an AST (for structured programs CDG and ASTare equivalent)


to first build a Sequence Dependence DAG (SDG) which will be the inputfor the BDSC algorithm Then we use the code presented in form of anAST to define a hierarchical mapping function that we call H to map eachsequence statement of the code to its SDG H is used for the input of theHBDSC algorithm We present in this section what SDGs are and how anH is built upon them

621 Sequence Dependence DAG (SDG)

A Sequence Dependence DAG G is a data dependence DAG where taskvertices τ are labeled with statements while control dependences are en-coded in the abstract syntax trees of statements Any statement S canlabel a DAG vertex ie each vertex τ contains a statement S which corre-sponds to the code it runs when scheduled We assume that there exist twofunctions vertex statement and statement vertex such that on their re-spective domains of definition they satisfy S = vertex statement(τ) andstatement vertex(SG) = τ In contrast to the usual program depen-dence graph defined in [44] an SDG is thus not built only on simple in-structions represented here as call statements compound statements suchas test statements (both true and false branches) and loop nests may con-stitute indivisible vertices of the SDG

Design Analysis

For a statement S one computes its sequence dependence DAG G using arecursive descent (depth-first traversal) along the structure of S creating avertex for each statement in a sequence and building loop body true branchand false branch statements as recursively nested graphs Definition 1 spec-ifies this relation of hierarchical nesting between statements which we callldquoenclosedrdquo

Definition 1 A statement S2 is enclosed into the statement S1 notedS2 sub S1 if and only if either

1 S2 = S1 ie S1 and S2 have the same lexicographical order positionwithin the whole AST to which they belong this ordering of state-ments based on their statement number render statements textuallycomparable

2 or S1 is a loop Sprime2 its body statement and S2 sub Sprime2

3 or S1 is a test Sprime2 its true or false branch and S2 sub Sprime2

4 or S1 is a sequence and there exists Sprime2 one of the statements in thissequence such that S2 sub Sprime2


This relation can also be applied to the vertices τi of these statementsSi in a graph such that vertex statement(τi) = Si we note S2 sub S1lArrrArr τ2 sub τ1

In Definition 2 we provide a functional specification of the constructionof a Sequence Dependence DAG G where the vertices are compound state-ments and thus under the enclosed relation To create connections withinthis set of vertices we keep an edge between two vertices if the possiblycompound statements in the vertices are data dependent using the recursivefunction prune we rely on static analyses such as the one provided in PIPSto get this data dependence information in the form of a data dependencegraph that we call D in the definition

Definition 2 Given a data dependence graph D = (TE)2 and a sequenceof statements S = sequence(S1S2 Sm) its SDG isG = (N prune(A) 0m2)where

bull N = n1 n2 nm where nj = vertex statement(Sj) forallj isin 1 m

bull A = (ni nj)exist(τ τ prime) isin E τ sub ni and τ prime sub nj

bull prune(Aprimecup(n nprime)) =

prune(Aprime) if exist(n0 nprime0) isin Aprime

(n0 sub n and nprime0 sub nprime)prune(Aprime) cup (n nprime) otherwise

bull prune(empty) = empty

bull 0m2 is a mtimesm communication edge cost matrix initially equal to thezero matrix

Note that G is a quotient graph of D moreover we assume that D doesnot contain inter iteration dependencies

Construction Algorithm

Algorithm 13 specifies the construction of the sequence dependence DAGG of a statement that is a sequence of S1S2 and Sn substatements Ginitially is an empty graph ie its sets of vertices and edges are empty andits communication matrix is the zero matrix

From the data dependence graph D provided via the function DDG thatwe suppose given Function SDG(S) adds vertices and edges to G accordingto Definition 2 using the functions add vertex and add edge First wegroup dependences between compound statements Si to form cumulateddependences using dependences between their enclosed statements Se this

2T = vertices(D) is a set of n tasks (vertices) τ and E is a set of m edges (τi τj)between two tasks


ALGORITHM 13 Construction of the sequence dependence DAG G from a

statement S = sequence(S1S2Sm)

function SDG(sequence(S1S2Sm))

G = (emptyempty0m2 )

D = DDG(sequence(S1S2Sm))

foreach Si isin S1S2 Sm

τi = statement_vertex(SiD)

add_vertex(τi G)

foreach Se Se sub Siτe = statement_vertex(SeD)

foreach τj isin successors(τe D)

Sj = vertex_statement(τj )if (exist k isin [1m] Sj sub Sk) then

τk = statement_vertex(SkD)

edge_regions(τiτk) cup= edge_regions(τeτj )if(notexist (τiτk) isin edges(G))

add_edge ((τiτk) G)

return G

end

yields the successors τj Then we search for another compound statementSk to form the second side of the dependence τk the first being τi We keepthen the dependences that label the edges of D using edge regions of thenew edge Finally we add dependence edge (τiτk) to G

Figure 64 illustrates the construction from the DDG given in Figure 63(right meaning of colors given in Section 242) the SDG of the C code(left) The figure contains two SDGs corresponding to the two sequences inthe code the body S0 of the first loop (in blue) has also an SDG G0 Notehow the dependences between the two loops have been deduced from thedependences of their enclosed statements (their loop bodies) These SDGsand their printouts have been generated automatically with PIPS3

622 Hierarchical SDG Mapping

A hierarchical SDG mapping function H maps each statement S to an SDGG = H(S) if S is a sequence statement otherwise G is equal to perp

Proposition 1 forallτ isin vertices(H(Sprime)) vertex statement(τ) sub Sprime

Our introduction of SDG and its hierarchical mapping to different state-ments via H is motivated by the following observations which also supportour design decisions

3We assume that declarations and codes are not mixed


void main()

int a[10] b[10]

int idc

S

c=42

for(i=1i lt=10i++)

a[i] = 0

for(i=1i lt=10i++)

S0

a[i] = bar(a[i])

d = foo(c)

b[i] = a[i]+d

return

Figure 63 Example of a C code (left) and the DDG D of its internal Ssequence (right)

1 The true and false statements of a test are control dependent uponthe condition of the test statement while every statement within aloop (ie statements of its body) is control dependent upon the loopstatement header If we define a control area as a set of statementstransitively linked by the control dependence relation our SDG con-struction process insures that the control area of the statement of agiven vertex is in the vertex This way we keep all the control depen-dences of a task in our SDG within itself

2 We decided to consider test statements as single vertices in the SDG toensure that they are scheduled on one cluster4 which guarantees theexecution of the enclosed code (true or false statements) whicheverbranch is taken on this cluster

3 We do not group successive simple call instructions into a single ldquobasicblockrdquo vertex in the SDG in order to let BDSC fuse the corresponding

4A cluster is a logical entity which will correspond to one process or thread


Figure 64 SDGs of S (top) and S0 (bottom) computed from the DDG (seethe right of Figure 63) S and S0 are specified in the left of Figure 63

statements so as to maximize parallelism and minimize communica-tions Note that PIPS performs interprocedural analyses which willallow call sequences to be efficiently scheduled whether these calls rep-resent trivial assignments or complex function calls

In Algorithm 13 Function SDG yields the sequence dependence DAG ofa sequence statement S We use it in Function SDG mapping presented inAlgorithm 14 to update the mapping function H for statements S that canbe a sequence or a non-sequence the goal is to build a hierarchy betweenthese SDGs via the recursive call H = SDG mapping(Sperp)

Figure 65 shows the construction of H where S is the last part of theequake code excerpt given in Figure 67 note how the body Sbody of thelast loop is also an SDG Gbody Note that H(Sbody) = Gbody and H(S) =G These SDGs have been generated automatically with PIPS we use theGraphviz tool for pretty printing [50]


ALGORITHM 14 Recursive mapping of an SDG G to each sequence statement

S via the function H

function SDG_mapping(S H)

switch (S)

case call

return H[S rarr perp]case sequence(S1Sn)

foreach Si isin S1 Sn

H = SDG_mapping(Si H)

return H[S rarr SDG(S)]

case forloop(I Elower Eupper Sbody)

H = SDG_mapping(Sbody H)

return H[S rarr perp]case test(Econd St Sf )

H = SDG_mapping(St H)

H = SDG_mapping(Sf H)

return H[S rarr perp]end

Figure 65 SDG G for a part of equake S given in Figure 67 Gbody is theSDG of the body Sbody (logically included into G via H)


63 Sequential Cost Models Generation

Since the volume of data used or exchanged by SDG tasks and their exe-cution times are key factors in the BDSC scheduling process we need toassess this information as precisely as possible PIPS provides an intra-and inter-procedural analysis of array data flow called array regions anal-ysis (see Section 263) that computes dependences for each array elementaccess For each statement S two types of set of regions are consideredread regions(S) and write regions(S)5 contain the array elements re-spectively read and written by S Moreover in PIPS array region analysisis not limited to array variables but is extended to scalar variables whichcan be considered as arrays with no dimensions [34]

631 From Convex Polyhedra to Ehrhart Polynomials

Our analysis uses the following set of operations on sets of regions Riof Statement Si presented in Section 263 regions intersection andregions union Note that regions must be defined with respect to a com-mon memory store for these operations to be properly defined Two sets ofregions R1 and R2 should be defined in the same memory store to make itpossible to compare them we should bring a set of regions R1 to the store ofR2 by combining R2 with the possible store changes introduced in betweenwhat we call ldquopath transformerrdquo that connects the two memory stores Weintroduce the path transformer analysis in Section 64

Since we are interested in the size of array regions to precisely assesscommunication costs and memory requirements we compute Ehrhart poly-nomials [41] which represent the number of integer points contained in agiven parameterized polyhedron from this region To manipulate thesepolynomials we use various operations using the Ehrhart API provided bythe polylib library [90]

Communication Edge Cost

To assess the communication cost between two SDG vertices τ1 as sourceand τ2 as sink vertices we rely on the number of bytes involved in depen-dences of type ldquoread after writerdquo (RAW) data using the read and writeregions as follows

Rw1 = write regions(vertex statement(τ1))Rr2 = read regions(vertex statement(τ2))edge cost bytes(τ1 τ2) =

sumrisinregions intersection(Rw1Rr2)

ehrhart(r)

5in regions(S) and out regions(S) are not suitable to be used for our context unlessafter each statement schedule in regions and out regions are recomputed That is whywe use read regions and write regions that are independent of the schedule

63 SEQUENTIAL COST MODELS GENERATION 101

For example the array dependences between the two statements S1 =MultiplY(IxxGxGx) and S2 = Gauss(Sxx Ixx) in the code of Harrispresented in Figure 11 is Ixx of size N timesM where the N and M variablesrepresent the input image size Therefore S1 should communicate the arrayIxx to S2 when completed We look for the array elements that are accessedin both statements (written in S1 and read in S2) we obtain the followingtwo sets of regions Rw1 and Rr2 information

Rw1 = lt Ixx(φ1)minusW minus EXACTminus1 le φ1 φ1 le N timesM gtRr2 = lt Ixx(φ1)minusRminus EXACTminus 1 le φ1 φ1 le N timesM gt

The result of the intersection between these two sets of regions is as follows

Rw1 capRr2 =lt Ixx(φ1)minusWRminus EXACTminus 1 le φ1 φ1 le N timesM gt

The sum of the Ehrhart polynomials resulting from the set Rw1 cap Rr2 ofregions that yields the volume of this set of regions representing the numberof its elements in bytes is as follows where 4 is the number of bytes in afloat

edge cost bytes(τ1 τ2) = ehrhart(lt Ixx(φ1)minusWRminus EXACTminus1 le φ1 φ1 le N timesM gt)

= 4timesN timesM

In practice in our experiments in order to compute the communicationtime this polynomial that represents the message size to communicate innumber of bytes is multiplied by a transfer time of one byte (β) it is thenadded to the latency time (α) These two coefficients are dependent on thespecific target machine we note

edge cost(τ1 τ2)= α+ β times edge cost bytes(τ1 τ2)

The default cost model in PIPS considers that α = 0 and β = 1Moreover in some cases the array region analysis cannot provide the re-quired information Therefore in the case of shared memory codes BDSCcan proceed and generate an efficient schedule without the information ofedge cost otherwise in the case of distributed memory codes edge cost(τ1τ2) =infin is generated as a result for such cases We will decide thus to sched-ule τ1 and τ2 in the same cluster

Local Storage Task Data

To provide an estimation of the volume of data used by each task τ we usethe number of bytes of data read and written by the task statement via thefollowing definitions assuming that the task statement contains no internalallocation of data


S = vertex statement(τ)task data(τ) = regions union(read regions(S)

write regions(S))

As above we define data size as follows for any set of regions R

data size(R) =sum

risinR ehrhart(r)

For instance in Harris main function task data(τ) where τ is the vertexlabeled with the call statement S1 = MultiplY(IxxGxGx) is equal to

R = lt Gx(φ1)minusRminus EXACTminus 1 le φ1 φ1 le N timesM gtlt Ixx(φ1)minusW minus EXACTminus 1 le φ1 φ1 le N timesM gt

The size of this region is

data size(R)= 2timesN timesMWhen the array region analysis cannot provide the required information

the size of the array contained in its declaration is returned as an upperbound of the accessed array elements for such cases

Execution Task Time

In order to determine an average execution time for each vertex in the SDGwe use a static execution time approach based on a program complexityanalysis provided by PIPS There each statement S is automatically labeledwith an expression represented by a polynomial over program variables viathe complexity estimation(S) function that denotes an estimation of theexecution time of this statement assuming that each basic operation (addi-tion multiplication) has a fixed architecture-dependent execution timeIn practice to perform our experiments on a specific target machine we usea table COMPLEXITY COST TABLE that for each basic operation pro-vides a coefficient of its time execution on this machine This sophisticatedstatic complexity analysis is based on inter-procedural information such aspreconditions Using this approach one can define task time as

task time(τ) = complexity estimation(vertex statement(τ))

For instance for the call statement S1 = MultiplY(IxxGxGx) in Harrismain function task time(τ) where τ is the vertex labeled with S1 is equalto N timesM times 17 + 2 as detailed in Figure 66 Thanks to the interproceduralanalysis of PIPS this information can be reported to the call for the functionMultiplY in the main function of Harris Note that we use here a defaultcost model where the coefficient of each basic operation is equal to 1

When this analysis cannot provide the required information currentlythe complexity estimation of a while loop is not computed However ouralgorithm computes a naive schedule in such cases where we assume thattask time(τ) = infin but the quality (in term of efficiency) of this scheduleis not guaranteed


void MultiplY(float M[NM] float X[NM] float Y[NM])

NM17+2

for(i = 0 i lt N i += 1)

M17+2

for(j = 0 j lt M j += 1)

17

M[z(ij)] = X[z(ij)]Y[z(ij)]

Figure 66 Example of the execution time estimation for the functionMultiplY of the code Harris each comment provides the complexity estima-tion of the statement below (N and M are assumed to be global variables)

632 From Polynomials to Values

We have just seen how to represent cost data and time information in termsof polynomials yet running BDSC requires actual values This is particu-larly a problem when computing tlevels and blevels since cost and time arecumulated there We convert cost information into time by assuming thatcommunication times are proportional to costs which amounts in particularto setting the communication latency α to zero This assumption is vali-dated by our experimental results and the fact that data arrays are usuallylarge in the application domain we target

Static Approach Polynomials Approximation

When program variables used in the above-defined polynomials are numeri-cal values each polynomial is a constant this happens to be the case for oneof our applications ABF However when input data are unknown at compiletime (as for the Harris application) we suggest to use a very simple heuristicto replace the polynomials by numerical constants When all polynomialsat stake are monomials on the same base we simply keep the coefficientof these monomials as costs Even though this heuristic appears naive atfirst it actually is quite useful in the Harris application Table 61 showsthe complexities for time estimation and communication (the last line) costsgenerated for each function of Harris using PIPS default operation and com-munication cost model where the N and M variables represent the inputimage size

In our implementation (see Section 833) we use a more precise oper-ation and communication cost model that depends on the target machineIt relies on a number of parameters of a given CPU such as the cost of anaddition a multiplication or of the communication cost of one byte for anIntel CPU all these in numbers of cycle units (CPI) In Section 833 we


Function Complexity and NumericalTransfer (polynomial) estimation

InitHarris 9timesN timesM 9SobelX 60timesN timesM 60SobelY 60timesN timesM 60MultiplY 17timesN timesM 17Gauss 85timesN timesM 85

CoarsitY 34timesN timesM 34

One image transfer 4timesN timesM 4

Table 61 Execution and communication time estimations for Harris usingPIPS default cost model (N and M variables represent the input image size)

detail these parameters for the used CPUs

Dynamic Approach Instrumentation

The general case deals with polynomials that are functions of many variablessuch as the ones that occur in the SPEC2001 benchmark equake and thatdepend on the variables ARCHelems or ARCHnodes and timessteps (see thenon-bold code in Figure 67 that shows a part of the equake code)

In such cases we first instrument automatically the input sequential codeand run it once in order to obtain the numerical values of the polynomialsThe instrumented code contains the initial user code plus instructions thatcompute the numerical values of the cost polynomials for each statementBDSC is then applied using this cost information to yield the final parallelprogram Note that this approach is sound since BDSC ensures that thevalue of a variable (and thus a polynomial) is the same whichever schedul-ing is used Of course this approach works well as our experiments show(see Chapter 8) when a programrsquos complexity and communication polyno-mials do not change when some of its input parameters are modified (partialevaluation) This is the case for many signal processing benchmarks whereperformance is mostly a function of structure parameters such as imagesize and is independent of the actual signal (pixel) values upon which theprogram acts

We show an example of this final case using a part of the instrumentedequake code6 in Figure 67 The added instrumentation instructions arefprintf statements the second parameter of which represents the state-ment number of the following statement and the third the value of itsexecution time for task time instrumentation For edge cost instrumenta-tion the second parameter is the number of the incident statements of the

6We do not show the instrumentation on the statements inside the loops for readabilitypurposes


FILE lowast finstrumented = fopen(instrumented equakein w)

fprintf(finstrumented task time 62 = ldn179 lowastARCHelems + 3)for (i = 0 i lt ARCHelems i++)

for (j = 0 j lt 4 j++)

cor[j] = ARCHvertex[i][j]

fprintf(finstrumented task time 163 = ldn20 lowastARCHnodes + 3)for(i = 0 i lt= ARCHnodes -1 i += 1)

for(j = 0 j lt= 2 j += 1)

disp[disptplus ][i][j] = 00

fprintf(finstrumented edge cost 163rarr 166 = ldnARCHnodes lowast 9)fprintf(finstrumented task time 166 = ldn110 lowastARCHnodes + 106)smvp_opt(ARCHnodes K ARCHmatrixcol ARCHmatrixindex

disp[dispt] disp[disptplus ])

fprintf(finstrumented task time 167 = dn6)time = iterExcdt

fprintf(instrumented task time 168 = ldn510 lowastARCHnodes + 3) S

for (i = 0 i lt ARCHnodes i++)

for (j = 0 j lt 3 j++)

Sbody

disp[disptplus ][i][j] = -ExcdtExcdt

disp[disptplus ][i][j] +=

20M[i][j]disp[dispt ][i][j]-M[i][j] -

Excdt 20C[i][j]) disp[disptminus ][i][j] -

Excdt Excdt (M23[i][j] phi2(time) 20 +

C23[i][j] phi1(time) 20 +

V23[i][j] phi0(time) 20)

disp[disptplus ][i][j] =

disp[disptplus ][i][j] (M[i][j] + Excdt 20 C[i][j])

vel[i][j] = 05 Excdt (disp[disptplus ][i][j] -

disp[disptminus ][i][j])

fprintf(finstrumented task time 175 = dn2)disptminus = dispt

fprintf(finstrumented task time 176 = dn2)dispt = disptplus

fprintf(finstrumented task time 177 = dn2)disptplus = i

Figure 67 Instrumented part of equake (Sbody is the inner loop sequence)

edge and the third the edge cost polynomial value After execution of theinstrumented code the numerical results of the polynomials are printed inthe file instrumented equakein presented in Figure 68 This file is aninput for the PIPS implementation of BDSC We parse this file in order toextract execution and communication estimation times (in number of cy-


cles) to be used as parameters for the computation of top levels and bottomlevels necessary for the BDSC algorithm (see Chapter 5)

task_time 62 = 35192835

task_time 163 = 718743

edge_cost 163 -gt 166 = 323433

task_time 166 = 3953176

task_time 167 = 6

task_time 168 = 18327873

task_time 175 = 2

task_time 176 = 2

task_time 177 = 2

Figure 68 Numerical results of the instrumented part of equake(instrumented equakein)

This section supplies BDSC scheduling algorithm that relies upon weightson vertices (time execution and storage data) and edges (communicationcost) of its entry graph Indeed we exploit complexity and array regionanalyses provided by PIPS to harvest all this information statically How-ever since each two sets of regions should be defined in the same memorystore to make it possible comparing them the second set of regions must becombined with the path transformer that we define in the next section andthat connects the two memory stores of these two sets of regions

64 Reachability Analysis The Path Transformer

Our cost model defined in Section 63 uses affine relations such as convexarray regions to estimate the communications and data volumes induced bythe execution of statements We used a set of operations on array regionssuch as the function regions intersection (see Section 631) Howeverthis is not sufficient because regions should be combined with what we callpath transformers in order to propagate the memory stores used in themIndeed regions must be defined with respect to a common memory store forregion operations to be properly performed

A path transformer permits to compare array regions of statements orig-inally defined in different memory stores The path transformer between twostatements computes the possible changes performed by a piece of code de-limited by two statements Sbegin and Send enclosed within a statement SA path transformer is represented by a convex polyhedron over programvariables and its computation is based on transformers (see Section 261)thus it is a system of linear inequalities

64 REACHABILITY ANALYSIS THE PATH TRANSFORMER 107

Our path transformer analysis can be seen as a special case of reachabil-ity analysis a particular kind of graph analysis where one checks whether agiven final state can be reached from an initial one In the domain of pro-gramming languages one can find practical application of this technique inthe computation for general control-flow graphs of the transitive closure ofthe relation corresponding to statement transitions (see for instance [106])Since the scientific applications we target in this thesis use mostly for loopswe propose in this section a dedicated analysis called path transformer thatprovides in the restricted case of for loops and tests analytical expressionsfor such transitive closures

641 Path Definition

A ldquopathrdquo specifies a statement execution instance including the mappingof loop indices to values and specific syntactic statement of interest Apath is defined by a pair P = (MS) where M is a mapping function fromIde to Exp7 and S is a statement specified by its syntactic path from thetop program syntax tree to the particular statement Indeed a particularstatement Si may correspond to a particular iteration M(i) within a set ofiterations (created by a loop of index i) Paths are used to store the valuesof indices of the loops enclosing statements Every loop on the path has itsentry in M indices of loops not related to it are not present

We provide a simple program example in Figure 69 that shows the ASTof a code that contains a loop We search for the path transformer betweenSb and Se we suppose that S the smallest subtree that encloses the twostatements is forloop0 As the two statements are enclosed into a loop wehave to specify the indices of these statements inside the loop eg M(i) = 2for Sb and M(i) = 5 for Se The path execution trace between Sb and Se isillustrated in Figure 610 The chosen path runs through Sb for the seconditeration of the loop until Se for the fifth iteration

In this section we assume that Send is reachable [43] from Sbegin Thismeans that for each enclosing loop S (Sbegin sub S and Send sub S) whereS = forloop(I Elower Eupper Sbody) Elower le Eupper - loops are alwaysexecuted - and for each mapping of an index of these loops Elower le ibegin leiend le Eupper Moreover since there is no execution trace from the true Stbranch to the false branch Sf of a test statement Stest Send is not reachablefrom Sbegin if for each mapping of an index of a loop that encloses Stestibegin = iend or there is no loop enclosing the test

642 Path Transformer Algorithm

In this section we present an algorithm that computes affine relations be-tween program variables at any two points or ldquopathsrdquo in the program from

7Ide and Exp are the set of identifiers I and expressions E respectively


forloop0

i 1 10 sequence0

[S0 S1 Sb S2 Se S3]

Figure 69 An example of a subtree of root forloop0

Sb(2)S2(2)Se(2)S3(2)S0(3)S1(3)Sb(3)S2(3)Se(3)S3(3)S0(4)S1(4)Sb(4)S2(4)Se(4)S3(4)S0(5)S1(5)Sb(5)S2(5)Se(5)

Figure 610 An example of a path execution trace from Sb to Se in thesubtree forloop0

the memory store of Sbegin to the memory store of Send along any sequentialexecution that links the two memory states path transformer(S PbeginPend) is the path transformer of the subtree of S with all vertices not on thePbegin path to Pend path execution trace pruned It is the path transformerbetween statements Sbegin and Send if Pbegin = (MbeginSbegin) and Pend =(MendSend)

path transformer uses a variable mode to indicate the current con-


text on the compile-time version of the execution trace Modes are propa-gated along the depth-first left-to-right traversal of statements (1) mode =sequence when there is no loop on the trace at the current statement level(2) mode = prologue on the prologue part of an enclosing loop (3) mode =permanent on the full part of an enclosing loop and (4) mode = epilogue

on the epilogue part of an enclosing loop The algorithm is described inAlgorithm 15

A first call to the recursive function pt on(S Pbegin Pend mode) withmode = sequence is made In this function many path transformer oper-ations are performed A path transformer T is recursively defined by

T = id | perp | T1 T2 | T1 t T2 | T k | T (S)

These operations on path transformers are specified as follows

bull id is the transformer identity used when no variables are modifiedie they are all constrained by an equality

bull perp denotes the undefined (empty) transformer ie in which the set ofaffine constraints of the system is not feasible

bull rsquorsquo denotes the operation of composition of transformers T1 and T2 T1 T2 is overapproximated by the union of constraints in T1 (on variablesx1 xn x

primeprime1 x

primeprimep) with constraints in T2 (on variables xprimeprime1 x

primeprimep xprime1

xprimenprime) then projected on x1 xn xprime1 x

primenprime to eliminate the ldquointerme-

diaterdquo variables xprimeprime1 xprimeprimep (this definition is extracted from [76]) Note

that T id = T for all T

bull rsquotrsquo denotes the convex hull of each transformer argument The usualunion of two transformers is a union between two convex polyhedraand is not necessarily a convex polyhedron A convex hull polyhedronapproximation is required in order to obtain a convex result a convexhull operation provides the smallest polyhedron enclosing the union

bull rsquokrsquo denotes the effect of a k iterations of a transformer T In practiceheuristics are used to compute a transitive closure approximation T lowastthus this operation generally provides an approximation of the trans-former T k Note that PIPS uses here the Affine Derivative Closurealgorithm [18]

bull T (S) is the transformer of a statement S defined in Section 261we suppose these transformers already exist T is defined by a list ofarguments and a predicate system labeled by T

We distinguish several cases in the pt on(S (Mbegin Sbegin) (MendSend) mode) function The computation of the path transformer beginswhen S is Sbegin ie the path is beginning we return the transformer of


ALGORITHM 15 AST-based Path Transformer algorithm

function path_transformer(S(MbeginSbegin)(MendSend))

return pt_on(S(MbeginSbegin)(MendSend)sequence )

end

function pt_on(S (MbeginSbegin)(MendSend) mode)

if (Send = S) then return id

elseif (Sbegin = S) then return T(S)

switch (S)

case call

(mode = permanent

or mode = sequence and Sbegin le S and S lt Sendor mode = prologue and Sbegin le S

or mode = epilogue and S lt Send)

return T(S) return id

case sequence(S1S2Sn)

if (|S1S2 Sn|=0) then return id

else

rest = sequence(S2Sn)

return pt_on(S1 (MbeginSbegin)(MendSend) mode) pt_on(rest (MbeginSbegin)(MendSend) mode)

case test(EcondStSf )

Tt = pt_on(St (MbeginSbegin)(MendSend) mode)

Tf = pt_on(Sf (MbeginSbegin)(MendSend) mode)

if((Sbegin sub St and Send sub Sf ) or(Send sub St and Sbegin sub Sf ) then

if(mode = permanent) then

return Tt t Tf

else

return perpelse if (Sbegin sub St or Send sub St) then

return T(Econd) Tt

else if (Sbegin sub Sf or Send sub Sf ) then

return T(notEcond) Tf

else

return T(S)


return

pt_on_loop(S(MbeginSbegin)(MendSend) mode)

end

Sbegin It ends when S is Send there we close the path and we returnthe identity transformer id The other cases depend on the type of thestatement as follows


Call

When S is a function call we return the transformer of S if S belongs tothe paths from Sbegin to Send and id otherwise we use the lexicographicordering lt to compare statement paths

Sequence

When S is a sequence of n statements we return the composition of thetransformers of each statement in this sequence An example of the resultprovided by the path transformer algorithm for a sequence is illustrated inFigure 611 The left side presents a code fragment that contains a sequenceof four instructions The right side shows the resulting path transformerbetween the two statements Sb and Se It shows the result of the compositionof the transformers of the three first statements note that the transformerof Se is not included

S

Sb i = 42

i = i+5

i = i3

Se i = i+4

path_transformer(S(perpSb)(perpSe)) =

T(i)

i = 141

Figure 611 An example of a C code and the path transformer of the se-quence between Sb and Se M is perp since there are no loops in the code

Test

When S is a test the path transformer is computed according to the locationof Sbegin and Send in the branches of the test If one is in the true statementand the other in the false statement of the test the path transformer is theunion of the true and false transformers iff the mode is permanent ie thereis at least one loop enclosing S If only one of the two statements is enclosedin the true or in the false branch of the test the transformer of the condition(true or false) is composed with the transformer of the corresponding enclos-ing branch (true or false) and is returned An example of the result providedby the path transformer algorithm for a test is illustrated in Figure 612The left side presents a code fragment that contains a test The right sideshows the resulting path transformer between the two statements Sb andSe It shows the result of the composition of the transformers of statementsfrom Sb to the statement just before the test and the transformer of thecondition true (igt0) and the statements in the true branch just before Senote that the transformer of Se is not included


S

Sb i=4

i = i3

if(igt0)

k1++

Se i = i+4

else

k2++

i--


T(i k1)

i = 12

k1 = k1init + 1

Figure 612 An example of a C code and the path transformer of the se-quence of calls and test between Sb and Se M is perp since there are no loopsin the code

Loop

When S is a loop we call the function pt on loop (see Algorithm 16) Notethat if loop indices are not specified for a statement in the map functionthe algorithm posits ibegin equal to the lower bound of the loop and iendto the upper one In this function a composition of two transformers arecomputed depending on the value of mode (1) the prologue transformerfrom the iteration ibegin until the upper bound of the loop or the iterationiend if Send belongs to this loop and (2) the epilogue transformer fromthe iteration lower bound of the loop until the upper bound of the loop orthe iteration iend if Send belongs to this loop These two transformers arecomputed using the function iter that we show in Algorithm 17

iter computes the transformer of a number of iterations of a loop Swhen loop bounds are constant or symbolic functions via the predicatefunction is constant p on an expression that returns true if this expres-sion is constant It computes a loop transformer that is the transformerof the proper number of iterations of this loop It is equal to the trans-former of the looping condition Tenter composed with the transformer ofthe loop body composed with the transformer of the incrementation state-ment Tinc In practice when h minus l is not a constant a transitive closureapproximation T lowast is computed using the function already implemented inPIPS affine derivative closure and generally provides an approxima-tion of the transformer T hminusl+1

The case where ibegin = iend means that we execute only one itera-tion of the loop from Sbegin to Send Therefore we compute Ts ie isa transformer on the loop body with the sequence mode using Function


ALGORITHM 16 Case for loops of the path transformer algorithm

function pt_on_loop(S (MbeginSbegin)(MendSend) mode)

forloop(I Elower Eupper Sbody) = S

low = ibegin = (undefined(Mbegin(I))) Elower Mbegin(I)

high = iend = (undefined(Mend(I))) Eupper Mend(I)

Tinit = T(I = Elower)

Tenter = T(I lt= Eupper)

Tinc = T(I=I+1)

Texit = T(I gt Eupper)

if (mode = permanent) then

T = iter(S(MbeginSbegin)(MendSend)ElowerEupper)

return Tinit T Texit

Ts = pt_on_loop_one_iteration(S(MbeginSbegin)(MendSend))

Tp = Te = p = e = id

if (mode = prologue or mode = sequence) then

if(Sbegin sub S)

p = pt_on(Sbody (MbeginSbegin)(MendSend) prologue) Tinc

low = ibegin+1

high = (mode = sequence) iend-1 Eupper

Tp = p iter(S(MbeginSbegin)(MendSend)low high)

if (mode = epilogue or mode = sequence) then

if(Send sub S)

e = Tenter pt_on(Sbody (MbeginSbegin)(MendSend)epilogue )

high = (mode = sequence) -1 iend -1

low = Elower

Te = iter(S (MbeginSbegin)(MendSend)low high) e

T = Tp Te

T = ((Sbegin 6sub S) Tinit id) T ((Send 6sub S) Texit id)

if(is_constant_p(Eupper) and is_constant_p(Elower)) then

if (mode = sequence and ibegin = iend) then

return Ts

else

return T

else

return T t Ts

end

pt on loop one iteration presented in Algorithm 17 This eliminates thestatements outside the path from Sbegin to Send Moreover since with sym-bolic parameters we cannot have this precision we execute one iteration ormore a union of T and Ts is returned The result should be composed withthe transformer of the index initialization Tinit andor the exit condition ofthe loop Texit depending on the position of Sbegin and Send in the loop

An example of the result provided by the path transformer algorithmis illustrated in Figure 613 The left side presents a code that contains a


ALGORITHM 17 Transformers for one iteration and a constantsymbolic num-

ber of iterations hminus l + 1 of a loop S

function pt_on_loop_one_iteration(S(MbeginSbegin)(MendSend))




Tinc = T(I=I+1)


Ts = pt_on(Sbody (MbeginSbegin)(MendSend)sequence )

Ts = (Sbegin sub S and Send 6sub S) Ts Tinc Texit Ts

Ts = (Send sub S and Sbegin 6sub S) Tinit Tenter Ts Ts

return Ts

end

function iter(S (MbeginSbegin) (Mend Send) l h)


Tenter = T(I lt= h)

Tinc = T(I=I+1)

Tbody = pt_on(Sbody (MbeginSbegin)(MendSend) permanent )

Tcomplete body = Tenter Tbody Tinc

if(is_constant_p(Eupper) and is_constant_p(Elower))

return T(I=l) (Tcomplete body)hminusl+1

else

Tstar = affine_derivative_closure(Tcomplete body)

return Tstar

end

sequence of two loops and an incrementation instruction that makes the com-putation of the path transformer required when performing parallelizationIndeed comparing array accesses in the two loops must take into accountthe fact that n used in the first loop is not the same as n in the second onebecause it is incremented between the two loops The right side shows theresulting path transformer between the two statements Sb and Se It showsthe result of the call to path transformer(S (MbSb)(MeSe)) Wesuppose here that Mb(i)=0 and Me(j)= nminus 1 Notice that the constraintn = n init + 1 is important to detect that n is changed between the twoloops via the n++ instruction

An optimization for the cases of a statement S of type sequence or loopwould be to return the transformer of S T (S) if S does not contain neitherSbegin nor Send


S

l1

for(i=0iltni++)

Sba[i] = a[i]+1

k1++

n++

l2

for(j=0jltnj++)

k2++

Seb[j] = a[j]+a[j+1]

path_transformer(S(MbSb)(MeSe)) =

T(ijk1k2n)

i+k1init = iinit+k1

j+k2init = k2 -1

n = ninit+1

iinit+1 le i

ninit le i

k2init+1 le k2

k2 le k2init+n

Figure 613 An example of a C code and the path transformer between Sband Se

643 Operations on Regions using Path Transformers

In order to make possible operations combining sets of regions of two state-ments S1 and S2 the set of regions R1 of S1 must be brought to the storeof the second set of regions R2 of S2 Therefore we first compute the pathtransformer that links the two statements of the two sets of regions as fol-lows

T12 = path transformer(S (M1 S1) (M2 S2))

where S is a statement that verifies S1 sub S and S2 sub S M1 and M2 containinformation about the indices of these statements in the potentially existingloops in S Then we compose the second set of regions R2 with T12 tobring R2 to a common memory store with R1 We use thus the operationregion transformer compose already implemented in PIPS [34] as follows

Rprime2 = region transformer compose(R2 T12)

We thus update the operations defined in Section 263 (1) the intersectionof the two sets of regions R1 and R2 is obtained via this call

regions intersection(R1 Rprime2)

and (2) the difference between R1 and R2 is obtained via this call

regions difference(R1 Rprime2)

and (3) the union of R1 and R2 is obtained via this call


regions union(R1 Rprime2)

For example if we want to compute RAW dependences between theWrite regions Rw1 of Loop l1 and Read regions Rr2 of Loop l2 in the codein Figure 613 we must first find T12 the path transformer between the twoloops The result is then

RAW (l1 l2) = regions intersection(Rw1regions transformer compose(Rr2 T12))

If we assume the size of Arrays a and b is n = 20 for example we obtainthe following results

Rw1 = lt a(φ1)minusW minus EXACTminus 0 le φ1 φ1 le 19 φ1 + 1 le n gtRr2 = lt a(φ1)minusRminus EXACTminus 0 le φ1 φ1 le 19 φ1 le n gtT12 = T (i k1 n)k1 = kinit + i n = ninit + 1 i ge 0 i ge n initRprimer2 = lt a(φ1)minusRminus EXACTminus 0 le φ1 φ1 le 19 φ1 le n+ 1 gt

Rw1 capRprimer2 = lt a(φ1)minusWRminus EXACTminus 0 le φ1 φ1 le 19 φ1 + 1 le n gt

Notice how φ1 accesses to Array a differ in Rr2 and Rprimer2 (after composi-tion with T12) because of the equality n = n init + 1 in T12 due to themodification of n between the two loops via the n++ instruction

In our context of combining path transformers with regions to computecommunication costs dependences and data volumes between the verticesof a DAG for each loop of index i M(i) must be set to the lower boundof the loop and iend to the upper one (the default values) We are notinterested here in computing the path transformer between statements ofspecific iterations However our algorithm is general enough to handle loopparallelism for example or transformations on loops where one wants accessto statements of a particular iteration

65 BDSC-Based Hierarchical Scheduling (HBDSC)

Now that all the information needed by the basic version of BDSC presentedin the previous chapter has been gathered we detail in this section how wesuggest to adapt it to different SDGs linked hierarchically via the mappingfunction H introduced in Section 622 on page 96 in order to eventuallygenerate nested parallel code when possible We adopt in this section theunstructured parallel programming model of Section 422 since it offers thefreedom to implement arbitrary parallel patterns and since SDGs implementthis model Therefore we use the parallel unstructured construct ofSPIRE to encode the generated parallel code

65 BDSC-BASED HIERARCHICAL SCHEDULING (HBDSC) 117

651 Closure of SDGs

An SDG G of a nested sequence S may depend upon statements outside SWe introduce the function closure defined in Algorithm 18 to provide a self-contained version of G There G is completed with a set of entry verticesand edges in order to recover dependences coming from outside S We usethe predecessors τp in the DDG D to find these dependences Rs for eachdependence we create an entry vertex and we schedule it in the same clusteras τp Function reference is used here to find the reference expression ofthe dependence Note that these import vertices will be used to schedule Susing BDSC in order to compute the updated values of tlevel and blevel

impacted by the communication cost of each vertex in G relatively to allSDGs in H We set their task time to 0

ALGORITHM 18 Computation of the closure of the SDG G of a statement

S = sequence(S1 Sn)

function closure(sequence(S1Sn) G)

D = DDG(sequence(S1Sn))


τd = statement_vertex(Si D)

τk = statement_vertex(Si G)

foreach τp (τp isin predecessors(τd D) and τp 6isin vertices(G))

Rs = regions_intersection(read_regions(Si)

write_regions(vertex_statement(τp)))foreach R isin Rs

sentry = import(reference(R))

τentry = new_vertex(sentry)

cluster(τentry) = cluster(τp)add_vertex(τentry G)

add_edge ((τentry τk) G)

return G

end

For instance in the C code given in the left of Figure 63 parallelizing S0may decrease its execution time but it may also increase the execution timeof the loop indexed by i this could be due to communications from outsideS0 if we presume to simplify that every statement is scheduled on a dif-ferent cluster The corresponding SDG of S0 presented in Figure 64 is notcomplete because of dependences not explicit in this DAG coming from out-side S0 In order to make G0 self-contained we add these dependences viaa set of additional import entry vertices τentry scheduled in the same clusteras the corresponding statements outside S0 We extract these dependencesfrom the DDG presented in the right of Figure 63 as illustrated in Functionclosure the closure of G0 is given in Figure 614 The scheduling of S0needs to take into account all the dependences coming from outside S0 we


add thus three vertices to encode the communication cost of i a[i] c atthe entry of S0

Figure 614 Closure of the SDG G0 of the C code S0 in Figure 63 withadditional import entry vertices

652 Recursive Top-Down Scheduling

Hierarchically scheduling a given statement S of SDG H(S) in a cluster κis seen here as the definition of a hierarchical schedule σ which maps eachsubstatement s of S to σ(s) = (sprime κ n) If there are enough processor andmemory resources to schedule S using BDSC (sprime κ n) is a triplet madeof a parallel statement sprime = parallel(σ(s)) the cluster κ = cluster(σ(s))where s is being allocated and the number n = nbclusters(σ(s)) of clustersthe inner scheduling of sprime requires Otherwise scheduling is impossible andthe program stops In a scheduled statement all sequences are replaced byparallel unstructured statements

A successful call to the HBDSC(SH κ PM σ) function defined in Al-gorithm 19 which assumes that P is strictly positive yields a new versionof σ that schedules S into κ and takes into account all substatements of Sonly P clusters with a data size at most M each can be used for schedulingσ[S rarr (Sprime κ n)] is the function equal to σ except for S where its value is(Sprime κ n) H is the function that yields an SDG for each S to be scheduledusing BDSC

Our approach is top-down in order to yield tasks that are as coarsegrained as possible when dealing with sequences In the HBDSC function we


ALGORITHM 19 BDSC-based update of Schedule σ for Statement S of SDG

H(S) with P and M constraints

function HBDSC(S H κ P M σ)switch (S)

case call

return σ[S rarr (S κ 0)]

case sequence(S1Sn)

Gseq = closure(S H(S))

Gprime = BDSC(Gseq P M σ)iter = 0

do

σprime = HBDSC_step(Gprime H κ P M σ)G = Gprime

Gprime = BDSC(Gseq P M σprime)if (clusters(Gprime) = empty) then

abort(primeUnable to schedule prime)

iter ++

σ = σprimewhile (completion_time(Gprime) lt completion_time(G) and

|clusters(Gprime)| le |clusters(G)| anditer le MAX_ITER)

return σ[S rarr(dag_to_unstructured(G)

κ |clusters(G)|)]


σprime = HBDSC(Sbody H κ P M σ)(Sprimebody κbody nbclustersbody) = σprime(Sbody)

return σprime[S rarr (forloop(I Elower Eupper Sprimebody)

κ nbclustersbody)]

case test(Econd St Sf )

σ = HBDSC(St H κ P M σprime)σprimeprime = HBDSC(Sf H κ P M σprime)(Sprimet κt nbclusters t) = σprimeprime(St)(Sprimef κf nbclustersf ) = σprimeprime(Sf )

return σprimeprime[S rarr (test(Econd Sprimet Sprimef )

κ max(nbclusters t nbclustersf ))]

end

distinguish four cases of statements First the constructs of loops8 and testsare simply traversed scheduling information being recursively gathered indifferent SDGs Then for a call statement there is no descent in the call

8Regarding parallel loops since we adopt the task parallelism paradigm note thatinitially it may be useful to apply the tiling transformation and then perform full unrollingof the outer loop (we give more details in the protocol of our experiments in Section 84)This way the input code contains more potentially parallel tasks resulting from the initial(parallel) loop


graph the call statement is returned In order to handle the correspondingcall function one has to treat separately the different functions Finallyfor a sequence S one first accesses its SDG and computes a closure of thisDAGGseq using the function closure defined above Next Gseq is scheduledusing BDSC to generate a scheduled SDG Gprime

The hierarchical scheduling process is then recursively performed to takeinto account substatements of S within Function HBDSC step defined in Al-gorithm 20 on each statement s of each task of Gprime There Gprime is traversedalong a topological sort-ordered descent using the function topsort(Gprime)yields a list of stages of computation each cluster stage being a list ofindependent lists L of tasks τ one L for each cluster κ generated by BDSCfor this particular stage in the topological order

The recursive hierarchical scheduling via HBDSC within the functionHBDSC step of each statement s = vertex statement(τ) may take advan-tage of at most P prime available clusters since |cluster stage| clusters are alreadyreserved to schedule the current stage cluster stage of tasks for StatementS It yields a new scheduling function σs Otherwise if no clusters are avail-able all statements enclosed into s are scheduled on the same cluster as theirparent κ We use the straightforward function same cluster mapping (notprovided here) to affect recursively (se κ 0) to σ(se) for each se enclosedinto s

Figure 615 illustrates the various entities involved in the computation ofsuch a scheduling function Note that one needs to be careful in HBDSC step

to ensure that each rescheduled substatement s is allocated a number ofclusters consistent with the one used when computing its parallel executiontime we check the condition nbclustersprimes ge nbclusterss which ensures thatthe parallelism assumed when computing time complexities within s remainsavailable

Figure 615 topsort(G) for the hierarchical scheduling of sequences

Cluster allocation information for each substatement s whose vertex inGprime is τ is maintained in σ via the recursive call to HBDSC this time with


ALGORITHM 20 Iterative hierarchical scheduling step for DAG fixpoint com-

putation

function HBDSC_step(Gprime H κ P M σ)foreach cluster_stage isin topsort(Gprime)

Pprime = P - |cluster_stage |

foreach L isin cluster_stage

nbclustersL = 0

foreach τ isin L

s = vertex_statement(τ )if (Pprime le 0) then

σ = σ[s rarr same_cluster_mapping(s κ σ)]else

nbclusterss = (s isin domain(σ))

nbclusters(σ(s)) 0

σs = HBDSC(s H cluster(τ ) Pprime M σ)nbclusters primes = nbclusters(σs(s))if (nbclusters primes ge nbclusterss and

task_time(τ σ) ge task_time(τ σs) then

nbclusterss = nbclusters primes

σ = σsnbclustersL = max(nbclustersL nbclusterss)

Pprime -= nbclustersL

return σend

the current cluster κ = cluster(τ) For the non-sequence constructs inFunction HBDSC cluster information is set to κ the current cluster

The scheduling of a sequence yields a parallel unstructured statementwe use the function dag to unstructured(G) that returns a SPIRE un-structured statement Su from the SDG G where the vertices of G are thestatement control vertices su of Su and the edges of G constitute the listof successors Lsucc of su while the list of predecessors Lpred of su is deducedfrom Lsucc Note that vertices and edges of G are not changed before andafter scheduling however information of scheduling is saved in σ

As an application of our scheduling algorithm on a real applicationFigure 616 shows the scheduled SDG for Harris obtained using P = 3generated automatically with PIPS using the Graphviz tool

653 Iterative Scheduling for Resource Optimization

BDSC is called in HBDSC before substatements are hierarchically scheduledHowever a unique pass over substatements could be suboptimal since par-allelism may exist within substatements It may be discovered by laterrecursive calls to HBDSC Yet if this parallelism had been known ahead of


Figure 616 Scheduled SDG for Harris using P=3 cores the schedulinginformation via cluster(τ) is also printed inside each vertex of the SDG

time previous values of task time used by BDSC would have been possiblysmaller which could have had an impact on the higher-level scheduling Inorder to address this issue our hierarchical scheduling algorithm iteratesthe top down pass HBDSC step on the new DAG Gprime in which BDSC takesinto account these modified task complexities iteration continues while Gprime

provides a smaller DAG scheduling length than G and the iteration limitMAX ITER has not been reached We compute the completion time of theDAG G as follows

completion time(G) = maxκisinclusters(G) cluster time(κ)

One constraint due to the iterative nature of the hierarchical schedulingis that in BDSC zeroing cannot be made between the import entry ver-tices and their successors This keeps an independence in terms of allocatedclusters between the different levels of the hierarchy Indeed at a higherlevel for S if we assume that we have scheduled the parts Se inside (hier-archically) S attempting to reschedule S iteratively cancels the precedentschedule of S but maintains the schedule of Se and vice versa Therefore foreach sequence we have to deal with a new set of clusters and thus zeroingcannot be made between these entry vertices and their successors

Note that our top-down iterative hierarchical scheduling approach alsohelps dealing with limited memory resources If BDSC fails at first becausenot enough memory is available for a given task the HBDSC step functionis nonetheless called to schedule nested statements possibly loosening upthe memory constraints by distributing some of the work on less memory-challenged additional clusters This might enable the subsequent call toBDSC to succeed


For instance in Figure 617 we assume that P = 3 and the memory Mof each available cluster κ0 κ1 or κ2 is less that the size of Array a There-fore the first application of BDSC on S = sequence(S0S1) will fail (Notenough memory) Thanks to the while loop introduced in the hierarchicalschedule algorithm the scheduling of the hierarchical parts inside S0 andS1 hopefully minimizes the task data of each statement Thus a secondschedule succeeds as is illustrated in Figure 618 Note that the edge cost

between tasks S0 and S1 is equal to 0 (the region on the edge in the figureis equal to the empty set) Indeed the communication is made at an in-ner level between κ1 and κ3 due to the entry vertex added to H(S1body)This cost is obtained after the hierarchical scheduling of S it is thus a par-allel edge cost we define parallel task data and edge cost of a parallelstatement task in Section 655

S0

for(i=1ileni++) S

0body

a[i]=foo() bar(b[i])

S1

for(i=1ileni++)S

1body

sum +=a[i] c[i]=baz()

lta(Φ1)-WR-EXACT-1 le Φ

1 Φ

1 le ngt

H(S)

Figure 617 After the first application of BDSC on S = sequence(S0S1)failing with ldquoNot enough memoryrdquo

654 Complexity of HBDSC Algorithm

Theorem 2 The time complexity of Algorithm 19 (HBDSC) over StatementS is O(kn) where n is the number of call statements in S and k a constantgreater than 1

Proof Let t(l) be the worst-case time complexity for our hierarchicalscheduling algorithm on the structured statement S of hierarchical level9

9Levels represent the hierarchy structure between statements of the AST and arecounted up from leaves to the root


κ1

a[i]=foo()

κ

2

bar(b[i])

κ3

sum+=a[i]

κ4

c[i]=baz()

κ1

import(a[i])

H(S)

lta(Φ1)-WR-EXACT-Φ

1 = igt

closure(S1body

H(S1body

))

H(S0body

)

S0 κ

0

for(i=1ileni++)S

0body


S1

κ0

for(i=1ileni++)S

1body


Figure 618 Hierarchically (via H) Scheduled SDGs with memory resourceminimization after the second application of BDSC (which succeeds) onsequence(S0S1) (κi = cluster(σ(Si))) To keep the picture readable onlycommunication edges are figured in these SDGs

l Time complexity increases significantly only in sequences loops and testsbeing simply managed by straightforward recursive calls of HBDSC on sub-statements For a sequence S t(l) is proportional to the time complexityof BDSC followed by a call to HBDSC step the proportionality constant isk =MAX ITER (supposed to be greater than 1)

The time complexity of BDSC for a sequence of m statements is at mostO(m3) (see Theorem 1) Assuming that all subsequences have a maximumnumber m of (possibly compound) statements the time complexity for thehierarchical scheduling step function is the time complexity of the topologicalsort algorithm followed by a recursive call to HBDSC and is thus O(m2 +mt(l minus 1)) Thus t(l) is at most proportional to k(m3 +m2 +mt(l minus 1)) simkm3+kmt(lminus1) Since t(l) is an arithmetico-geometric series its analytical


value t(l) is (km)l(km3+kmminus1)minuskm3

kmminus1 sim (km)lm2 Let lS be the level for thewhole Statement S The worst performance occurs when the structure of Sis flat ie when lS sim n and m is O(1) hence t(n) = t(lS) sim kn

Even though the worst case time complexity of HBDSC is exponentialwe expect and our experiments suggest that it behaves more tamely onactual properly structured code Indeed note that lS sim logm(n) if S isbalanced for some large constant m in this case t(n) sim (km)log(n) showinga subexponential time complexity

655 Parallel Cost Models

In Section 63 we present the sequential cost models usable in the case ofsequential codes ie for each first call to BDSC When a substatement Seof S (Se sub S) is parallelized the parameters task time task data andedge cost are modified for Se and thus for S Thus hierarchical schedul-ing must use extended definitions of task time task data and edge cost

for tasks τ using statements S = vertex statement(τ) that are parallelunstructured statements extending the definitions provided in Section 63which still applies to non-unstructured statements For such a case we as-sume that BDSC and other relevant functions take σ as an additional argu-ment to access the scheduling result associated to statement sequences andhandle the modified definitions of task time edge cost and task dataThese functions are defined as follows

Parallel Task Time

When statements S = vertex statement(τ) correspond to parallel unstruc-tured statements lets Su = parallel(σ(S)) be the parallel unstructured(ie such that S isin domain(σ)) We define task time as follows whereFunction clusters returns the set of clusters used to schedule Su and Func-tion control statements is used to return the statements from Su

task time(τ σ) = maxκisinclusters(Suσ)

cluster time(κ)

clusters(Su σ) = cluster(σ(s))s isin control statements(Su)

Otherwise the previous definitions of task time on S provided in Sec-tion 63 is used The recursive computation of complexities accesses thisnew definition in case of parallel unstructured statements

Parallel Task Data and Edge Cost

To define task data and edge cost parameters that depend on the com-putation of array regions10 we need extended versions of the read regions

10Note that every operation combining regions of two statements used in this chapterassumes the use of the path transformer via the function region transformer compose


and write regions functions that take into account the allocation informa-tion to clusters These functions take σ as an additional argument to accessthe allocation information cluster(σ)

S = vertex statement(τ)

task data(τ σ) = regions union(read regions(S σ)

write regions(S σ))

Rw1 = write regions(vertex statement(τ1) σ)

Rr2 = read regions(vertex statement(τ2) σ)

edge cost bytes(τ1 τ2 σ) =sum

risinregions intersection(Rw1Rr2)

ehrhart(r)

edge cost(τ1 τ2 σ) = α+ β times edge cost bytes(τ1 τ2 σ)

When statements are parallel unstructured statements Su we propose to adda condition on the accumulation of regions in the sequential recursive compu-tation functions of these regions as follows where Function stmts in cluster

returns the set of statements scheduled in the same cluster as Su

read regions(Su σ) =

regions unionsisinstmts in cluster(Suσ)read regions(s σ)

write regions(Su σ) =

regions unionsisinstmts in cluster(Suσ)write regions(s σ)

stmts in cluster(Su σ) =

s isin control statements(Su)cluster(σ(s)) = cluster(σ(Su))Otherwise the sequential definitions of read regions and write regions

on S provided in Section 63 is used The recursive computation of arrayregions accesses this new definition in case of parallel unstructured state-ments

66 Related Work Task Parallelization Tools

In this section we review several other approaches that intend to automatethe parallelization of programs using different granularities and schedulingpolicies Given the breadth of literature on this subject we limit this pre-sentation to approaches that focus on static list-scheduling methods andcompare them with BDSC

Sarkarrsquos work on the partitioning and scheduling of parallel programs [96]for multiprocessors introduced a compile-time method where a GR (Graph-ical Representation) program is partitioned into parallel tasks at compile

as illustrated in Section 643 in order to bring the regions of the two statements in acommon memory store

66 RELATED WORK TASK PARALLELIZATION TOOLS 127

time A GR graph has four kinds of vertices ldquosimplerdquo to represent an indi-visible sequential computation ldquofunction callrdquo ldquoparallelrdquo to represent par-allel loops and ldquocompoundrdquo for conditional instructions Sarkar presents anapproximation parallelization algorithm Starting with an initial fine gran-ularity partition P0 tasks (chosen by heuristics) are iteratively merged tillthe coarsest partition Pn (with one task containing all vertices) after n it-erations is reached The partition Pmin with the smallest parallel executiontime in the presence of overhead (scheduling and communication overhead)is chosen For scheduling Sarkar introduces the EZ (Edge-Zeroing) algo-rithm that uses bottom levels for ordering it is based on edge weights forclustering all edges are examined from the largest edge weight to the small-est it then proceeds by zeroing the highest edge weight if the completiontime decreases While this algorithm is based only on the bottom level foran unbounded number of processors and does not recompute the prioritiesafter zeroings BDSC adds resource constraints and is based on both bottomlevels and dynamic top levels

The OSCAR Fortran Compiler [62] is used as a multigrain parallelizerfrom Fortran to parallelized OpenMP Fortran OSCAR partitions a pro-gram into a macro-task graph where vertices represent macro-tasks of threekinds namely basic repetition and subroutine blocks The coarse graintask parallelization proceeds as follows First the macro-tasks are gener-ated by decomposition of the source program Then a macro-flow graph isgenerated to represent data and control dependences on macro-tasks Themacro-task graph is subsequently generated via the analysis of parallelismamong macro-tasks using an earliest executable condition analysis that rep-resents the conditions on which a given macro-task may begin its executionat the earliest time assuming precedence constraints If a macro-task graphhas only data dependence edges macro-tasks are assigned to processors bystatic scheduling If a macro-task graph has both data and control depen-dence edges macro-tasks are assigned to processors at run time by a dynamicscheduling routine In addition to dealing with a richer set of resource con-straints BDSC targets both shared and distributed memory systems witha cost model based on communication used data and time estimations

Pedigree [83] is a compilation tool based on the program dependencegraph (PDG) The PDG is extended by adding a new type of vertex a ldquoParrdquovertex which groups children vertices reachable via the same branch condi-tions Pedigree proceeds by estimating a latency for each vertex and datadependences edge weights in the PDG The scheduling process orders thechildren and assigns them to a subset of the processors For schedulingvertices with minimum early and late times are given highest priority thehighest priority ready vertex is selected for scheduling based on the synchro-nization overhead and latency While Pedigree operates on assembly codePIPS and our extension for task-parallelism using BDSC offer a higher-levelsource-to-source parallelization framework Moreover Pedigree generated


code is specialized for only symmetric multiprocessors while BDSC targetsmany architecture types thanks to its resource constraints and cost models

The SPIR (Signal Processing Intermediate Representation) compiler [31]takes a sequential dataflow program as input and generates a multithreadedparallel program for an embedded multicore system First SPIR builds astream graph where a vertex corresponds to a kernel function call or to thecondition of an ldquoifrdquo statement an edge denotes a transfer of data betweentwo kernel function calls or a control transfer by an ldquoifrdquo statement (trueor false) Then for task scheduling purposes given a stream graph and atarget platform the task scheduler assigns each vertex to a processor in thetarget platform It allocates stream buffers and generates DMA operationsunder given memory and timing constraints The degree of automationof BDSC is larger than SPIRrsquos because this latter system needs severalkeywords extensions plus the C code denoting the streaming scope withinapplications Also the granularity in SPIR is a function whereas BDSCuses several granularity levels

We collect in Table 62 the main characteristics of each parallelizationtool addressed in this section

Resource Dependence Execution Communica- Memoryblevel tlevel constraints control data time tion time model

estimation estimation

HBDSCradic radic radic

Xradic radic radic

Shareddistributed

Sarkarrsquosradic

X X Xradic radic radic

Sharedwork distributed

OSCAR Xradic

Xradic radic radic

X Shared

Pedigreeradic radic

Xradic radic radic

X Shared

SPIR Xradic radic radic radic radic radic

Shared

Table 62 Comparison summary between different parallelization tools

67 Conclusion

This chapter presents our BDSC-based hierarchical task parallelization ap-proach which uses a new data structure called SDG and a symbolic exe-cution and communication cost model based on either static code analysis

67 CONCLUSION 129

or a dynamic-based instrumentation assessment tool The HBDSC schedul-ing algorithm maps hierarchically via a mapping function H each task ofthe SDG to one cluster The result is the generation of a parallel coderepresented in SPIRE(PIPS IR) using the parallel unstructured constructWe have integrated this approach within the PIPS automatic parallelizationcompiler infrastructure (See Section 82 for the implementation details)

The next chapter illustrates two parallel transformations of SPIRE(PIPSIR) parallel code and the generation of OpenMP and MPI codes fromSPIRE(PIPS IR)

An earlier version of the work presented in the previous chapter (Chap-ter 5) and a part of the work presented in this chapter are submitted toParallel Computing [65]

Chapter 7

SPIRE-Based Parallel CodeTransformations and

GenerationSatisfaction lies in the effort not in the attainment full effort is full victory

Gandhi

In the previous chapter we describe how sequential programs are analyzedscheduled and represented in a parallel intermediate representation Theirparallel versions are expressed as graphs using the parallel unstructured

attribute of SPIRE but such graphs are not directly expressible in parallellanguages To test the flexibility of this parallelization approach this chap-ter covers three aspects of parallel code transformations and generation (1)the transformation of SPIRE(PIPS IR)1 code from the unstructured par-allel programing model to a structured (high level) model using the spawn

and barrier constructs (2) the transformation from the shared to the dis-tributed memory programming model and (3) the code targeting of parallellanguages from SPIRE(PIPS IR) We have implemented our SPIRE-derivedparallel IR in the PIPS middle-end and BDSC-based task parallelization al-gorithm We generate both OpenMP and MPI code from the same parallelIR using the PIPS backend and its prettyprinters

71 Introduction

Generally the process of source-to-source code generation is a difficult taskIn fact compilers use at least two intermediate representations when trans-forming a source code written in a programming language abstracted in afirst IR to a source code written in an other programming language ab-stracted in a second IR Moreover this difficulty is compounded wheneverthe resulting input source code is to be executed on a machine not targetedby its language The portability issue was a key factor when we designedSPIRE as a generic approach (Chapter 4) Indeed representing differentparallel constructs from different parallel programming languages makes thecode generation process generic In the context of this thesis we need togenerate the equivalent parallel version of a sequential code representing

1Recall SPIRE(PIPS IR) means that the function of transformation SPIRE is appliedon the sequential IR of PIPS it yields a parallel IR of PIPS

71 INTRODUCTION 131

the parallel code in SPIRE facilitates parallel code generation in differentparallel programming languages

The parallel code encoded in SPIRE(PIPS IR) that we generate in Chap-ter 6 uses an unstructured parallel programming model based on static taskgraphs However structured programming constructs increase the level ofabstraction when writing parallel codes and structured parallel programsare easier to analyze Moreover parallel unstructured does not exist in par-allel languages In this chapter we use SPIRE both for the input unstruc-tured parallel code and the output structured parallel code by translatingthe SPIRE unstructured construct to spawn and barrier constructs An-other parallel transformation we address is to transform a SPIRE sharedmemory code to a SPIRE distributed memory one in order to adapt theSPIRE code to message-passing librarieslanguages such as MPI and thusdistributed memory system architectures

A parallel program transformation takes the parallel intermediate repre-sentation of a code fragment and yields a new equivalent parallel represen-tation for it Such transformations are useful for optimization purposes orto enable other optimizations that intend to improve execution time datalocality energy memory use etc While parallel code generation outputs afinal code written into a parallel programming language that can be com-piled by its compiler and run on a target machine a parallel transformationgenerates an intermediate code written into a parallel intermediate repre-sentation

In this chapter a parallel transformation takes a hierarchical scheduleσ defined in Section 652 on page 118 and yields another schedule σprimewhich maps each statement S in domain(σ) to σprime(S) = (Sprime κ n) such thatSprime belongs to the resulting parallel IR κ = cluster(σ(S)) is the clusterwhere S is allocated and n = nbclusters(σ(S)) is the number of clustersthe inner scheduling of Sprime requires A cluster stands for a processor

Todayrsquos typical architectures have multiple hierarchical levels A mul-tiprocessor has many nodes each node has multiple clusters each clusterhas multiple cores and each core has multiple physical threads Thereforein addition to distinguishing the parallelism in terms of abstraction of par-allel constructs such as structured and unstructured we can also classifyparallelism in two types in terms of code structure equilevel when it isgenerated within a sequence ie zero hierarchy level and hierarchical ormultilevel when one manages the code hierarchy ie multiple levels Inthis chapter we handle these two types of parallelism in order to enhancethe performance of the generated parallel code

The remainder of this chapter is structured into three sections FirstlySection 72 presents the first of the two parallel transformations we de-signed the unstructured to structured parallel code transformation Sec-ondly the transformation from shared memory to distributed memory codesis presented in Section 73 Moving to code generation issues Section 74

132 CHAPTER 7 SPIRE-BASED PARALLEL CODE

details the generation of parallel programs from SPIRE(PIPS IR) firstSection 741 shows the mapping approach between SPIRE and differentparallel languages and then as illustrating cases Section 742 introducesthe generation of OpenMP and Section 743 the generation of MPI fromSPIRE(PIPS IR) Finally we conclude in Section 75 Figure 71 summarizesthe parallel code transformations and generation applied in this chapter

UnstructuredSPIRE(PIPS IR) Schedule SDG

Structuring

StructuredSPIRE(PIPS IR)

DataTransfer

Generation

sendrecv-Based IR

OpenMP TaskGeneration

spawn rarr omp task

OpenMP Pragma-Based IR

PrettyprinterOpenMP

SPMDizationspawn rarr guards

MPI-Based IR

PrettyprinterMPI

OpenMP Code

MPI Code

Figure 71 Parallel code transformations and generation blue indicates thischapter contributions an ellipse a process and a rectangle results

72 PARALLEL UNSTRUCTURED TO STRUCTURED TRANSFORMATION133

72 Parallel Unstructured to Structured Transfor-mation

We present in Chapter 4 the application of the SPIRE methodology to thePIPS IR case Chapter 6 shows how to construct unstructured parallel codepresented in SPIRE(PIPS IR) In this section we detail the algorithm trans-forming unstructured (mid graph-based level) parallel code to structured(high syntactic level) code expressed in SPIRE(PIPS IR) and the imple-mentation of the structured model in PIPS The goal behind structuringparallelism is to increase the level of abstraction of parallel programs andsimplify analysis and optimization passes Another reason is the fact thatparallel unstructured does not exist in parallel languages However thishas a cost in terms of performance if the code contains arbitrary parallelpatterns For instance the example already presented in Figure 43 whichuses unstructured constructs to form a graph and copied again for con-venience in Figure 72 can be represented in a low-level unstructured wayusing events ((b) of Figure 72) and in a structured way using spawn2 andbarrier constructs ((c) of Figure 72) The structured representation (c)minimizes control overhead Moreover it is a trade-off between structur-ing and performance Indeed if the tasks have the same execution timethe structured way is better than the unstructured one since no synchro-nizations using events are necessary Therefore depending on the run-timeproperties of the tasks it may be better to run one form or the other

721 Structuring Parallelism

Function unstructured to structured presented in Algorithm 21 uses ahierarchical schedule σ that maps each substatement s of S to σ(s) =(sprime κ n) (see Section 65) to handle new statements It specifies howthe unstructured scheduled statement parallel(σ(S)) presented in Sec-tion 65 is used to generate a structured parallel statement encoded also inSPIRE(PIPS IR) using spawn and barrier constructs Note that anothertransformation would be to generate events and spawn constructs from theunstructured constructs (as in (b) of Figure 72)

For an unstructured statement S (other constructs are simply traversedparallel statements are reconstructed using the original statements) its con-trol vertices are traversed along a topological sort-ordered descent As anunstructured statement by definition is also a graph (control statementsare the vertices while predecessors (resp successors) control statements Snof a control statement S are edges from Sn to S (resp from S to Sn))we use the same idea as in Section 65 topsort but this time from an

2Note that in this chapter we assume for simplification purposes that constants in-stead of temporary identifiers having the corresponding value are used as the first param-eter of spawn statements


nopentry

x=3 y=5

z=x+yu=x+1

nopexit

(a)

barrier(

ex=newEvent (0)

spawn(0

x = 3

signal(ex)

u = x + 1

)

spawn(1

y = 5

wait(ex)

z = x + y

)

freeEvent(ex)

)

(b)

barrier(

spawn(0 x = 3)

spawn(1 y = 5)

)

barrier(

spawn(0 u = x + 1)

spawn(1 z = x + y)

)

(c)

Figure 72 Unstructured parallel control flow graph (a) and event (b) andstructured representations (c)

unstructured statement We use thus Function topsort unstructured(S)that yields a list of stages of computation each cluster stage being a list ofindependent lists L of statements s

Cluster stages are seen as a set of fork-join structures a fork-join struc-ture can be seen as a set of spawn statements (to be launched in parallel)enclosed in a barrier statement It can be seen also as a parallel sequenceIn our implementation of structured code generation we choose to generatespawn and barrier constructs instead of parallel sequence constructs be-cause in contrast to the parallel sequence construct the spawn constructallows the implementation of recursive parallel tasks Besides we can rep-resent a parallel loop via the spawn construct but it cannot be implementedusing parallel sequence

New statements have to be added to domain(σ) Therefore in order toset the third argument of σ (nbclusters(σ)) of these new statements we usethe function structured clusters that harvests the clusters used to sched-

72 PARALLEL UNSTRUCTURED TO STRUCTURED 135

ALGORITHM 21 Bottom-up hierarchical unstructured to structured SPIRE

transformation updating a parallel schedule σ for Statement S

function unstructured_to_structured(S p σ)κ = cluster(σ(S))n = nbclusters(σ(S))switch (S)

case call

return σcase unstructured(Centry Cexit parallel )

Sstructured = []

foreach cluster_stage isin topsort_unstructured(S)

(Sbarrierσ) = spawns_barrier(cluster_stage κ p σ)nbarrier = |structured_clusters(statements(Sbarrier) σ)|σ = σ[Sbarrier rarr (Sbarrier κ nbarrier)]

Sstructured ||= [Sbarrier]

Sprime = sequence(Sstructured)

return σ[S rarr(Sprime κ n)]


σ = unstructured_to_structured(Sbody p σ)Sprime = forloop(I Elower Eupper parallel(σ(Sbody )))return σ[S rarr (Sprime κ n)]


σ = unstructured_to_structured(St p σ)σ = unstructured_to_structured(Sf p σ)Sprime = test(Econd parallel(σ(St)) parallel(σ(Sf )))return σ[S rarr (Sprime κ n)]

end

function structured_clusters (S1S2 Sn σ)clusters = emptyforeach Si isin S1S2 Sn

if(cluster(σ(Si)) 6isin clusters)

clusters cup= cluster(σ(Si))return clusters

end

ule the list of statements in a sequence Recall that Function statements

of a sequence returns its list of statements

We use Function spawns barrier detailed in Algorithm 22 to formspawn and barrier statements In a cluster stage each list L for each clusterκs corresponds to a spawn statement Sspawn In order to affect a clusternumber as an entity to the spawn statement as required by SPIRE we as-sociate a cluster number to the cluster κs returned by cluster(σ(s)) viathe cluster number function we posit thus that two clusters are equalif they have the same cluster number However since BDSC maps tasks


to logical clusters in Algorithm 19 with numbers in [0 p minus 1] we con-vert this logical number to a physical one using the formula nphysical =P minus p+cluster number(κs) P is the total number of clusters available inthe target machine

ALGORITHM 22 Spawns-barrier generation for a cluster stage

function spawns_barrier(cluster_stage κ p σ)pprime = p - |cluster_stage |

SL = []


Lprime = []

nbclustersL = 0

foreach s isin L

κs = cluster(σ(s)) all s have the same schedule κsif(pprime gt 0)

σ = unstructured_to_structured(s pprime σ)Lprime ||= [parallel(σ(s))]nbclustersL = max(nbclustersL nbclusters(σ(s)))

Sspawn = sequence(Lprime)

nphysical = P - p + cluster_number(κs)cluster_number(κs) = nphysical

synchronization(Sspawn) = spawn(nphysical)

σ = σ[Sspawn rarr(Sspawn κs |structured_clusters(Lprime σ)|)]SL ||= [Sspawn]

pprime -= nbclustersL

Sbarrier = sequence(SL)

synchronization(Sbarrier) = barrier ()

return (Sbarrierσ)end

Spawn statements SL of one cluster stage are collected in a barrier state-ment Sbarrier as follows First the list of statements SL is constructed usingthe || symbol for the concatenation of two lists of arguments Then State-ment Sbarrier is created using Function sequence from the list of statementsSL Synchronization attributes should be set for the new statements Sspawnand Sbarrier We use thus the functions spawn(I) and barrier() to set thesynchronization attribute spawn or barrier of different statements

Figure 73 depicts an example of a SPIRE structured representation (inthe right hand side) of a part of the C implementation of the main func-tion of Harris presented in the left hand side Note that the scheduling isresulting from our BDSC-based task parallelization process using P = 3(we use three clusters because the maximum parallelism in Harris is three)Here up to three parallel tasks are enclosed in a barrier for each chain offunctions Sobel Multiplication and Gauss The synchronization anno-tations spawn and barrier are printed within the SPIRE generated code


in = InitHarris ()

Sobel

SobelX(Gx in)

SobelY(Gy in)

Multiply

MultiplY(Ixx Gx Gx)

MultiplY(Iyy Gy Gy)

MultiplY(Ixy Gx Gy)

Gauss

Gauss(Sxx Ixx)

Gauss(Syy Iyy)

Gauss(Sxy Ixy)

Coarsity

CoarsitY(out

Sxx Syy Sxy)

barrier(

spawn(0 InitHarris(in))

)

barrier(

spawn(0 SobelX(Gx in))

spawn(1 SobelY(Gy in))

)

barrier(

spawn(0 MultiplY(Ixx Gx Gx))

spawn(1 MultiplY(Iyy Gy Gy))

spawn(2 MultiplY(Ixy Gx Gy))

)

barrier(

spawn(0 Gauss(Sxx Ixx))

spawn(1 Gauss(Syy Iyy))

spawn(2 Gauss(Sxy Ixy))

)

barrier(

spawn(0 CoarsitY(out

Sxx Syy Sxy))

)

Figure 73 A part of Harris and one possible SPIRE core language repre-sentation

remember that the synchronization attribute is added to each statement (seeSection 423 on page 56)

Multiple parallel code configuration schemes can be generated In a firstscheme illustrated in Function spawns barrier we activate other clustersfrom an enclosing one ie the enclosing cluster executes the spawn instruc-tions (launch the tasks) A second scheme would be to use the enclosingcluster to execute one of the enclosed tasks there is no need to generate thelast task as a spawn statement since it is executed by the current clusterWe show an example in Figure 74

Although barrier synchronization of statements with only one spawnstatement can be omitted as another optimization barrier(spawn(i S)) =S we choose to keep these synchronizations for pedagogical purposes andplan to apply this optimization in a separate phase as a parallel transfor-mation or handle it at the code generation level

722 Hierarchical Parallelism

Function unstructured to structured is recursive in order to handle hier-archy in code An example is provided in Figure 75 where we assume thatP = 4 We use two clusters κ0 and κ1 to schedule the cluster stage S1 S2







Gauss(Sxy Ixy)

Figure 74 SPIRE representation of a part of Harris non optimized (left)and optimized (right)

the remaining clusters are then used to schedule statements enclosed in S1and S2 Note that Statement S12 where p = 2 is mapped to the physicalcluster number 3 for its logical number 3 (3 = 4minus 2 + 1)

S

S0

for(i = 1 i lt= N i++)

A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = foo(A[i]) S11

C[i] = bar() S12

S2


B[i] = baz(B[i]) S21

S3


C[i] += A[i]+ B[i]S31

barrier(

spawn(0 forloop(i 1 N 1

barrier(

spawn(1 B[i] = 3)

spawn(2 A[i] = 5)

)

sequential ))

)

barrier(


barrier(

spawn(2 A[i] = foo(A[i]))

spawn(3 C[i] = bar())

)

sequential ))

spawn(1

forloop(i 1 N 1

B[i] = baz(B[i])

sequential ))

)

barrier(


C[i] += A[i]+B[i]

sequential ))

)

Figure 75 A simple C code (hierarchical) and its SPIRE representation

Barrier statements Sstructured formed from all cluster stages Sbarrier inthe sequence S constitute the new sequence Sprime that we build in Functionunstructured to structured by default the execution attribute of thissequence is sequential

The code in the right hand side of Figure 75 contains barrier state-

72 FROM SHARED MEMORY TO DISTRIBUTED MEMORY 139

ments with only one spawn statement that can be omitted (as an opti-mization) Moreover in the same figure note how S01 and S02 are appar-ently reversed this is a consequence in our algorithm 21 of the functiontopsort unstructured where statements are generated in the order of thegraph traversal

The execution attribute of the loops in Figure 75 is sequential butthey are actually parallel Note that regarding these loops since we adoptthe task parallelism paradigm it may be useful to apply the tiling trans-formation and then perform full unrolling of the outer loop We give moredetails of this transformation in the protocol of our experiments in Sec-tion 84 This way the input code contains more parallel tasks spawn loopswith sequential execution attribute resulting from the parallel loops

73 From Shared Memory to Distributed MemoryTransformation

This section provides a second illustration of the type of parallel transfor-mations that can be performed thanks to SPIRE from shared memory todistributed memory transformation Given the complexity of this issue asillustrated by the vast existing related work (see Section 731) the workreported in this section is preliminary and exploratory

The send and recv primitives of SPIRE provide a simple communicationAPI adaptable for message-passing librarieslanguages such as MPI andthus distributed memory system architectures In these architectures thedata domain of whole programs should be partitioned into disjoint partsand mapped to different processes that have separate memory spaces If aprocess accesses a non-local variable communications must be performedIn our case send and recv SPIRE functions have to be added to the parallelintermediate representation of distributed memory code

731 Related Work Communications Generation

Many researchers have been generating communication routines betweenseveral processors (general purpose graphical accelerator) and have beentrying to optimize the generated code A large amount of literature alreadyexists and large amount of work has been done to address this problemsince the early 1990s However no practical and efficient general solutioncurrently exists [27]

Amarasinghe and Lam [15] generate send and receive instructions nec-essary for communication between multiple processors Goumas et al [49]generate automatically message-passing code for tiled iteration spaces How-ever these two works handle only perfectly nested loops with uniform de-pendences Moreover the data parallelism code generation proposed in [15]


and [49] produces an SPMD (single program multiple data) program to berun on each processor this is different from our context of task parallelismwith its hierarchical approach

Bondhugula [27] in their automatic distributed memory code generationtechnique use the polyhedral framework to compute communication setsbetween computations under a given iteration of the parallel dimension ofa loop Once more the parallel model is nested loops Indeed he ldquoonlyrdquoneeds to transfer written data from one iteration to another of read mode ofthe same loop or final writes for all data to be aggregated once the entireloop is finished In our case we need to transfer data from one or manyiterations or many of a loop to different loop(s) knowing that data spacesof these loops may be interlaced

Cetus contains a source-to-source OpenMP to MPI transformation [23]where at the end of a parallel construct each participating process com-municates the shared data it has produced and that other processes mayuse

STEP [78] is a tool that transforms a parallel program containing high-level directives of OpenMP into a message-passing program To generateMPI communications STEP constructs (1) send updated array regionsneeded at the end of a parallel section and (2) receive array regions neededat the beginning of a parallel section Habel et al[52] present a directive-based programming model for hybrid distributed and shared memory sys-tems Using pragmas distribute and gridify they distribute nested loopdimensions and array data among processors

The OpenMP to GPGPU compiler framework [74] transforms OpenMPto GPU code In order to generate communications it inserts memorytransfer calls for all shared data accessed by the kernel functions and definedby the host and data written in the kernels and used by the host

Amini et al [17] generate memory transfers between a computer hostand its hardware GPU (CPU-GPU) using data flow analysis this latter isalso based on array regions

The main difference with our context of task parallelism with hierarchyis that these works generate communications between different iterations ofnested loops or between one host and GPUs Indeed we have to transferdata between two different entire loops or two iterations of different loops

In this section we describe how to figure out which pieces of data haveto be transferred We construct array region transfers in the context of adistributed-memory homogeneous architecture

732 Difficulties

Distributed-memory code generation becomes a problem much harder thanthe ones addressed in the above related works (see Section 731) becauseof the hierarchical aspect of our methodology In this section we illustrate


how communication generation impose a set of challenges in terms of ef-ficiency and correctness of the generated code and the coherence of datacommunications

Inter-Level Communications

In the example in Figure 76 we suppose given this schedule which althoughnot efficient is useful to illustrate the difficulty of sending an array elementby element from the source to the sink dependence extracted from the DDGThe difficulty occurs when these source and sink belong to different levelsof hierarchy in the code ie the loop tree or test branches

S

S0

for(i = 0 i lt N i++)

A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = foo(A[i-1]) S11

B[i] = bar(B[i]) S12

forloop(i 0 N-1 1

barrier(

spawn(0 A[i] = 5

if(i==0

asend(1 A[i])

nop))

spawn(1 B[i] = 3

asend(0 B[i])

) sequential )

forloop(i 1 N-1 1

barrier(

spawn(1 if(i==1

recv(0 A[i-1])

nop)

A[i] = foo(A[i -1]))

spawn(0 recv(1 B[i]))

B[i] = bar(B[i]))

)

) sequential)

Figure 76 An example of a shared memory C code and its SPIRE codeefficient communication primitive-based distributed memory equivalent

Communicating element by element is simple in the case of array B in S12since no loop-carried dependence exists But when a dependence distanceis greater than zero such as is the case of Array A in S11 sending onlyA[0] to S11 is required although generating exact communications in suchcases requires more sophisticated static analyses that we do not have Notethat asend is a macro instruction for an asynchronous send it is detailed inSection 733


Non-Uniform Dependence Distance

In previous works (see Section 731) only uniform (constant) loop-carrieddependences were considered When a dependence is non-uniform ie thedependence distance is not constant as in the two codes in Figure 77 know-ing how long dependences are for Array A and generating exact communica-tions in such cases is complex The problem is even more complicated whennested loops are used and arrays are multidimensional Generating exactcommunications in such cases requires more sophisticated static analysesthat we do not have

S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = A[i-p] + a S11

B[i] = A[i] B[i]S12

p = B[i]

S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = A[pi-q] + aS11

B[i] = A[i] B[i] S12

Figure 77 Two examples of C code with non-uniform dependences

Exact Analyses

We presented in Section 63 our cost models we rely on array region anal-ysis to overapproximate communications This analysis is not always exactPIPS generates approximations (see Section 263 on page 24) noted by theMAY attribute if an over- or under-approximation has been used becausethe region cannot be exactly represented by a convex polyhedron WhileBDSC scheduling is still robust in spite of these approximations (see Sec-tion 86) they can affect the correctness of the generated communicationsFor instance in the code presented in Figure 78 the quality of array regionsanalysis may pass from EXACT to MAY if a region contains holes as is thecase of S0 Also while the region of A in the case of S1 remains EXACTbecause the exact region is rectangular the region of A in the case of S2 isMAY since it contains holes

Approximations are also propagated when computing in and out regionsfor Array A (see Figure 79) In this chapter communications are computed


S0

lt A(φ1)minusRminus MAYminus 1 le φ1 φ1 le N gt lt A(φ1)minusW minus MAYminus 1 le φ1 φ1 le N gt


if(i = N2)

A[i] = foo(A[i])

S1

lt A(φ1)minusRminus EXACTminus 2 le φ1 φ1 le 2timesN gt lt B(φ1)minusW minus EXACTminus 1 le φ1 φ1 le N gt


lt A(φ1)minusRminus EXACTminus i+ 1 le φ1 φ1 le i+N gt lt B(φ1)minusW minus EXACTminus φ1 = i gt

for(j = 1 j lt= N j++)

B[i] = bar(A[i+j])

S2

lt A(φ1)minusRminus MAYminus 2 le φ1 φ1 le 2timesN gt lt B(φ1)minusW minus EXACTminus 1 le φ1 φ1 le N gt


B[j] = baz(A[2j])

Figure 78 An example of a C code and its read and write array regionsanalysis for two communications from S0 to S1 and to S2

using in and out regions such as communicating A from S0 to S1 and to S2in Figure 79

S0

lt A[φ1]minus IN minus MAYminus 1 le φ1 φ1 le N gt lt A[φ1]minusOUT minus MAYminus 2 le φ1 φ1 le 2Nφ1 le N gt


if (i=n2)

A[i] = foo(A[i])

S1

lt A[φ1]minus IN minus EXACTminus 2 le φ1 φ1 le 2N gtfor(i = 1 i lt= n i += 1)

lt A[φ1]minus IN minus EXACTminus i+ 1 le φ1 φ1 le i+N gtfor(j = 1 j lt= n j += 1)

B[i] = bar(A[i+j])

S2

lt A[φ1]minus IN minus MAYminus 2 le φ1 φ1 le 2N gtfor(j = 1 j lt= n j += 1)

B[j] = baz(A[j2])

Figure 79 An example of a C code and its in and out array regions analysisfor two communications from S0 to S1 and to S2


733 Equilevel and Hierarchical Communications

As transferring data from one task to another when they are not at the samelevel of hierarchy is unstructured and too complex designing a compilationscheme for such cases might be overly complex and even possibly jeopardizethe correctness of the distributed memory code In this thesis we propose acompositional static solution that is structured and correct but which maycommunicate more data than necessary but their coherence is guaranteed

Communication Scheme

We adopt a hierarchical scheme of communication between multiple levelsof nesting of tasks and inside one level of hierarchy An example of thisscheme is illustrated in Figure 710 where each box represents a task τ Each box is seen as a host of its enclosing boxes and an edge represents adata dependence This scheme illustrates an example of configuration withtwo nested levels of hierarchy The level is specified with the superscript onτ in the figure

Once structured parallelism is generated (sequential sequences of barrierand spawn statements)3 the second step is to add send and recv SPIRE callstatements when assuming a message-passing language model We detail inthis section our specification of communications insertion in SPIRE

We divide parallelism into hierarchical and equilevel parallelisms and dis-tinguish between two types of communications namely inside a sequencezero hierarchy level (equilevel) and between hierarchy levels (hierarchical)

In Figure 711 where edges represent possible communications τ(1)0 τ

(1)1

and τ(1)2 are enclosed in the vertex τ

(0)1 that is in the same sequence as

τ(0)0 The dependence (τ

(0)0 τ

(0)1 ) for instance denotes an equilevel commu-

nication of A also the dependence (τ(1)0 τ

(1)2 ) denotes an equilevel commu-

nication of B while (τ(0)1 τ

(1)0 ) is a hierarchical communication of A We

use two main steps to construct communications via equilevel com andhierarchical com functions that are presented below

The key limitation of our method is that it cannot enforce memory con-straints when applying the BDSC-based hierarchical scheduling which usesan iterative approach on the number of applications of BDSC (see Sec-tion 65) Therefore our proposed solution for the generation of communi-cations cannot be applied when strict limitations on memory are imposed atscheduling time However our code generator can easily verify if the dataused and written at each level can be maintained in the specific clusters

As a summary compared to the works presented in Section 731 wedo not improve upon them (regarding non-affine dependences or efficiency)

3Note that communications can also be constructed from an unstructured code onehas to manipulate unstructured statements rather than sequence ones


τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)

τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)

τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)

Figure 710 Scheme of communications between multiple possible levels ofhierarchy in a parallel code the levels are specified with the superscripts onτ rsquos

However we extend their ideas to our problem of generating communica-tion in task parallelism within a hierarchical code structure which is morecomplex than theirs

Assumptions

For our communication scheme to be valid the following assumptions mustbe met for proper communications generation

bull all write regions are exact in order to guarantee the coherence of datacommunications


Figure 711 Equilevel and hierarchical communications

bull non-uniform dependences are not handled

bull more data than necessary are communicated but their coherence isguaranteed thanks to the topological sorting of tasks involved by com-munications This ordering is also necessary to avoid deadlocks

bull the order in which the send and receive operations are performed ismatched in order to avoid deadlock situations

bull HBDSC algorithm is applied without the memory constraint (M = infin)

bull the input shared-memory code is a structured parallel code with spawn

and barrier constructs

bull all send operations are non-blocking in order to avoid deadlocks Wepresented in SPIRE (see Section 424 on page 60 how non-blockingcommunications can be implemented in SPIRE using the blockingprimitives within spawned statements) Therefore we assume that weuse an additional process Ps for performing non-blocking send com-munications and a macro instruction asend for asynchronous send asfollows

asend(dest data) = spawn(Ps send(dest data))

734 Equilevel Communications Insertion Algorithm

Function equilevel com presented in Algorithm 23 computes first the equi-level data (array regions) to communicate from (when neighbors = prede-cessors) or to (when neighbors = successors) Si inside an SDG G (a codesequence) using the function equilevel regions presented in Algorithm 24


After the call to the equilevel regions function equilevel com createsa new sequence statement Sprimei via the sequence function Sprimei contains thereceive communication calls Si and the asend communication calls Afterthat it updates σ with Sprimei and returns it Recall that the || symbol is usedfor the concatenation of two lists

ALGORITHM 23 Equilevel communications for Task τi inside SDG G updating

a schedule σ

function equilevel_com(τi G σ neighbors)

Si = vertex_statement(τi)coms = equilevel_regions(Si G σ neighbors )

L = (neighbors = successors) coms emptyL ||= [Si]

L ||= (neighbors = predecessors) coms emptySprimei = sequence(L)

return σ[Si rarr (Sprimei cluster(σ(Si)) nbclusters(σ(Si)))]end

Function equilevel regions traverses all neighbors τn (successors orpredecessors) of τi (task of Si) in order to generate asend or recv statementscalls for τi if they are not in the same cluster Neighbors are traversedthroughout a topological sort-ordered descent in order to impose an orderbetween multiple send or receive communication calls between each pair ofclusters

ALGORITHM 24 Array regions for Statement Si to be transferred inside SDG

G (equilevel)

function equilevel_regions(Si G σ neighbors)

κi = cluster(σ(Si))τi = statement_vertex(Si G)

coms = emptyforeach τn isin topsort_ordered(G neighbors(τi G))

Sn = vertex_statement(τn)κn = cluster(σ(Sn))if(κi 6= κn)

R = (neighbors = successors) transfer_data(τi τn)

transfer_data(τn τi)coms ||= com_calls(neighbors R κn)

return coms

end

We thus guarantee the coherence of communicated data and avoid commu-nication deadlocks We use the function topsort ordered that computesfirst the topological sort of the transitive closure of G and then accesses the


neighbors of τi in such an order For instance the graph in the right ofFigure 712 shows the result of the topological sort of the predecessors ofτ4 of the graph in the left Function equilevel regions returns the list ofcommunications performed inside G by Si

τ2 τ

3

τ1

τ4

τ3

τ1

τ4

A

N

A

N

Figure 712 A part of an SDG and the topological sort order of predecessorsof τ4

Communication calls are generated in two steps (the two functions pre-sented in Algorithm 25)

First the transfer data function is used to compute which data ele-ments have to be communicated in form of array regions It returns theelements that are involved in RAW dependence between two tasks τ andτsucc It filters the out and in regions of the two tasks in order to getthese dependences it uses edge regions for this filtering edge regions

are computed during the construction of the SDG in order to maintain thedependences that label the edges of the DDG (see Section 621 on page 95)The intersection between the two results is returned4 For instance in thegraph on the right of Figure 712 if N is written also in τ1 filtering outregions of τ1 avoids to receive N twice (N comes from τ1 and τ3)

Second communication calls are constructed via the function com calls

that from the cluster κ of the statement source (asend) or sink (recv) ofthe communication and the set of data in term of regions R generates a listof communication calls coms These communication calls are the result ofthe function region to coms already implemented in PIPS [20] This func-tion converts a region to a code of sendreceive instructions that representthe communications of the integer points contained in a given parameterizedpolyhedron from this region An example is illustrated in Figure 713 wherethe write regions are exact and the other required assumptions are verified

4We assume that Rr is combined with the path transformer between the statements ofτ and τsucc


ALGORITHM 25 Two functions to construct a communication call data to

transfer and call generation

function transfer_data(τ τsucc)Rprimeo = Rprimei = emptyRo = out_regions(vertex_statement(τ ))foreach ro isin Ro

if(exist r isin edge_regions(τ τsucc) entity(r) = entity(ro))

Rprimeo cup= ro

Ri = in_regions(vertex_statement(τsucc))foreach ri isin Ri

if(exist r isin edge_regions(τ τsucc) entity(r) = entity(ri))

Rprimei cup= ri

return regions_intersection(Rprimeo Rprimei)

end

function com_calls(neighbors R κ)coms = emptyi = cluster_number(κ)com = (neighbors = successors) ldquoasendrdquo ldquorecvrdquoforeach r isin R

coms ||= region_to_coms(r com i)

return coms

end

In (b) of the figure the intersection between in regions of S2 and out regionsof S1 is converted to a loop of communication calls using region to coms

function Note that the current version of our implementation prototypeof region to coms is in fact unable to manage this somewhat complicatedcase Indeed the task running on Cluster 0 needs to know the value of thestructure variable M to generate a proper communication with the task onCluster 1 (see Figure 713 (b)) M would need to be sent by Cluster 1 whereit is known before the sending loop is run Thus our current implemen-tation assumes that all variables in the code generated by region to coms

are known in each communicating cluster For instance in Figure 713 withM replaced by N region to coms will perform correctly This constraintwas not a limitation in the experimental benchmarks tested in Chapter 8

Also in the part of the main function presented in the left of Figure 73the statement MultiplY(Ixy Gx Gy) scheduled on Cluster number 2needs to receive both Gx and Gy from Clusters number 1 and 0 (see Fig-ure 616) The receive calls list thus generated for this statement using ouralgorithm is5

5Since the whole array is received the region is converted to a simple instruction noloop is needed


S1

lt A(φ1)minusW minus EXACTminus 1 le φ1 φ1 le N gt lt A[φ1]minusOUT minus MAYminus 2 le φ1 φ1 le 2timesM gt


A[i] = foo()

S2

lt A[φ1]minusRminus MAYminus 2 le φ1 φ1 le 2timesM gt lt A[φ1]minus IN minus MAYminus 2 le φ1 φ1 le 2timesM gt

for(i = 1 i lt= M i++)

B[i] = bar(A[i2])

(a) bull

spawn(0


A[i] = foo()

for(i = 2 i lt= 2M i++)

asend(1 A[i])

)

spawn(1

for(i = 2 i lt= 2M i++)

recv(0 A[i])


B[i] = bar(A[i2])

)

(b) bull

Figure 713 Communications generation from a region usingregion to coms function

recv(1 Gy)

recv(0 Gx)

735 Hierarchical Communications Insertion Algorithm

Data transfers are also needed between different levels of the code hierar-chy because each enclosed task requires data from its enclosing one andvice versa To construct these hierarchical communications we introduceFunction hierarchical com presented in Algorithm 26

This function generates asend or recv statements in calls for each taskS and its hierarchical parent (enclosing vertex) scheduled in Cluster κpFirst it constructs the list of transfers Rh that gathers the data to becommunicated to the enclosed vertex Rh can be defined as in regions

or out regions of a task S Then from this set of transfers Rh com-munication calls coms (recv from the enclosing vertex or asend to it) aregenerated Finally σ updated with the new statement S and the set of


ALGORITHM 26 Hierarchical communications for Statement S to its hierar-

chical parent clustered in κp

function hierarchical_com(S neighbors σ κp)Rh = (neighbors = successors) out_regions(S)

in_regions(S)

coms = com_calls(neighbors Rh κp)seq = (neighbors = predecessors) coms empty

|| [S]

|| (neighbors = successors) coms emptySprime = sequence(seq)

σ = σ[S rarr (Sprime cluster(σ(S)) nbclusters(σ(S)))]return (σ Rh)

end

regions Rh that represents the data to be communicated this time from theside of the enclosing vertex are returned

For instance in Figure 711 for the task τ(1)0 the data elements Rh to

send from τ(1)0 to τ

(0)1 are the out regions of the statement labeled τ

(1)0

Thus the hierarchical asend calls from τ(1)0 to its hierarchical parent τ

(1)1 is

constructed using as argument Rh

736 Communications Insertion Main Algorithm

Function communications insertion presented in Algorithm 27 constructsa new code (asend and recv primitives are generated) It uses a hierarchicalschedule σ that maps each substatement s of S to σ(s) = (sprime κ n) (seeSection 65) to handle new statements It uses also the function hierarchicalSDG mapping H that we use to recover the SDG of each statement (seeSection 62)

Equilevel

To construct equilevel communications all substatements Si in a sequenceS are traversed For each Si communications are generated inside eachsequence via two calls to the function equilevel com presented above onewith the function neighbors equal to successors to compute asend callsthe other with the function predecessors to compute recv calls Thisensures that each send operation is matched to an appropriately-orderedreceive operation avoiding deadlock situations We use the function Hthat returns an SDG G for S while G is essential to access dependencesand thus communications between the vertices τ of G whose statementss = vertex statement(τ) are in S


ALGORITHM 27 Communication primitives asend and recv insertion in

SPIRE

function communications_insertion(S H σ κp)κ = cluster(σ(S))n = nbclusters(σ(S))switch (S)

case call

return σcase sequence(S1Sn)

G = H(S)

h_recvs = h_sends = emptyforeach τi isin topsort(G)

Si = vertex_statement(τi)κi = cluster(σ(Si))σ = equilevel_com(τi G σ predecessors )

σ = equilevel_com(τi G σ successors )

if(κp 6= NULL and κp 6= κi)(σ h_Rrecv) =

hierarchical_com(Si predecessors σ κp)h_sends ||= com_calls(successors h_Rrecv κi)(σ h_Rsend) =

hierarchical_com(Si successors σ κp)h_recvs ||= com_calls(predecessors h_Rsend κi)

σ = communications_insertion(Si H σ κi)Sprime = parallel(σ(S))Sprimeprime = sequence(h_sends || [Sprime] || h_recvs )

return σ[Sprime rarr (Sprimeprime κ n)]


σ = communications_insertion(Sbody H σ κp)(Sprimebody κbody nbclustersbody) = σ(Sbody)

return σ[S rarr (forloop(I Elower Eupper Sprimebody) κ n)]


σ = communications_insertion(St H σ κp)σ = communications_insertion(Sf H σ κp)(Sprimet κt nbclusters t) = σ(St)(Sprimef κf nbclustersf ) = σ(Sf )

return σ[S rarr (test(Econd Sprimet Sprimef ) κ n)]

end

Hierarchical

We construct also hierarchical communications that represent data transfersbetween different levels of hierarchy because each enclosed task requires datafrom its enclosing one We use the function hierarchical com presentedabove that generates asend or recv statements in calls for each task Sand its hierarchical parent (enclosing vertex) scheduled in Cluster κp Thus


the argument κp is propagated in communications insertion in order tokeep correct the information of the cluster of the enclosing vertex Thereturned set of regions h Rsend or h Rrecv that represents the data to becommunicated at the start and end of the enclosed vertex τi is used toconstruct respectively send and recv communication calls on the side ofthe enclosing vertex of S Note that hierarchical send communications needto be performed before running Sprime to ensure that subtasks can run More-over no deadlocks can occur in this scheme because the enclosing taskexecutes sendreceive statements sequentially before spawning its enclosedtasks which run concurrently

The top level of a program presents no hierarchy In order not to generatehierarchical communications that are not necessary as an optimization weuse the test (κp 6= NULL) Knowing that for the first call to the functioncommunications insertion κp is set to NULL This is necessary to preventthe generation of hierarchical communication for Level 0

Example

As an example we apply the phase of communication generation to thecode presented in Figure 75 to generate SPIRE code with communicationsboth equilevel (noted with comments equilevel) and hierarchical (noted withcomments hierarchical) communications are illustrated in the right of Fig-ure 714 Moreover a graphical illustration of these communications is givenin Figure 715

737 Conclusion and Future Work

As said before the work presented in this section is a preliminary and ex-ploratory solution that we believe is useful for suggesting the potentials ofour task parallelization in the distributed memory domain It may also serveas a base for future work to generate more efficient code than ours

In fact the solution proposed in this section may communicate more datathan necessary but their coherence is guaranteed thanks to the topologicalsorting of tasks involved by communications at each level of generationequilevel (ordering of neighbors of every task in an SDG) and hierarchical(ordering of tasks of an SDG) Moreover as a future work we intend tooptimize the generated code by eliminating redundant communications andaggregating small messages to larger ones

For instance hierarchical communications transfer all out regions fromthe enclosed task to its enclosing task A more efficient scheme than this onewould be to not send all out regions of the enclosed task but only regionsthat will not be modified afterward This is another track of optimizationto eliminate redundant communications that we leave for future work

Moreover we optimize the generated code by imposing that the current


for(i = 0i lt Ni++)

A[i] = 5

B[i] = 3

for(i = 0i lt Ni++)

A[i] = foo(A[i])

C[i] = bar()

for(i = 0i lt Ni++)

B[i] = baz(B[i])

for(i = 0i lt Ni++)

C[i] += A[i]+ B[i]

barrier(


asend(1 i) hierarchical

asend(2 i)) hierarchical

barrier(

spawn(1

recv(0 i) hierarchical

B[i] = 3

asend(0 B[i])) hierarchical

spawn(2


A[i] = 5

asend(0 A[i])) hierarchical

) recv(1 B[i]) hierarchical

recv(2 A[i]) hierarchical

sequential )

asend(1 B))) equilevel

barrier(



asend(2 A[i]) hierarchical


barrier(

spawn(2



A[i] = foo(A[i])


spawn(3


C[i] = bar()

asend(0 C[i])) hierarchical

) recv(2 A[i]) hierarchical

recv(3 C[i]) hierarchical

sequential ))

spawn(1 recv(0 B) equilevel

forloop(i 1 N 1

B[i] = baz(B[i]) sequential )

asend(0 B)) equilevel

barrier(


forloop(i 1 N 1

C[i] += A[i]+B[i] sequential )))

Figure 714 Equilevel and hierarchical communications generation andSPIRE distributed memory representation of a C code

74 PARALLEL CODE GENERATION 155

κ

0

for(i=0iltni++)

for(i=0iltni++)

for(i=0iltni++) B[i]=baz(B[i])

A[i]=5

B[i]=3

iA[i]

iB[i]

A[i]=foo(A[i])

C[i]=bar()

i

A[i]

i

C[i]

B

B

κ

1

κ

0

κ

2

κ

1

κ

2

κ

3

for(i=0iltni++) C[i]+=A[i]+B[i]

κ

0

Level 0 of hierarchy


A[i]

Figure 715 Graph illustration of communications made in the C code ofFigure 714 equilevel in green arcs hierarchical in black

cluster executes the last nested spawn The result of application of thisoptimization on the code presented in the right of Figure 714 is illustrated inthe left of Figure 716 We also eliminate barriers with one spawn statementthe result is illustrated in the right of Figure 716

74 Parallel Code Generation

SPIRE is designed to fit different programming languages thus facilitatingthe transformation of codes for different types of parallel systems Exist-ing parallel language constructs can be mapped into SPIRE We show inthis section how this intermediate representation simplifies the targeting ofvarious languages via a simple mapping approach and use the prettyprintedgenerations of OpenMP and MPI as an illustration

741 Mapping Approach

Our mapping technique to a parallel language is made of three steps namely(1) the generation of parallel tasks by translating spawn attributes and syn-


barrier(

forloop(i 1 N 1

asend(1 i)

barrier(

spawn(1

recv(0 i)

B[i] = 3

asend(0 B[i]))

A[i] = 5

)

recv(1 B[i])

sequential)

)

barrier(

spawn(1 forloop(i1N1

asend(2 i)

asend(2 A[i])

barrier(

spawn(2

recv(0 i)

recv(0 A[i])

A[i] = foo(A[i])

asend(0 A[i]))

C[i] = bar()

)

recv(2 A[i])

sequential ))

forloop(i 1 N 1

B[i] = baz(B[i])

sequential)

)

barrier(

forloop(i 1 N 1

C[i] += A[i]+B[i]

sequential)

)

forloop(i 1 N 1

asend(1 i)

barrier(

spawn(1

recv(0 i)

B[i] = 3

asend(0 B[i]))

A[i] = 5

)

recv(1 B[i])

sequential )

barrier(


asend(2 i)

asend(2 A[i])

barrier(

spawn(2

recv(0 i)

recv(0 A[i])

A[i] = foo(A[i])

asend(0 A[i]))

C[i] = bar()

)

recv(2 A[i])

sequential ))

forloop(i 1 N 1

B[i] = baz(B[i])

sequential)

)

forloop(i 1 N 1

C[i] += A[i]+B[i]

sequential)

Figure 716 Equilevel and hierarchical communications generation afteroptimizations the current cluster executes the last nested spawn (left) andbarriers with one spawn statement are removed (right)

chronization statements by translating barrier attributes to this language(2) the insertion of communications by translating asend and recv primi-tives to this target language if this latter uses a distributed memory modeland (3) the generation of hierarchical (nested) parallelism by exploiting thecode hierarchy For efficiency reasons we assume that the creation of tasksis done once at the beginning of programs after that we just synchronize


the tasks which are thus persistent

Task Parallelism

The parallel code is generated in a forkjoin manner a barrier statementencloses spawned parallel tasks Mapping SPIRE to an actual parallel lan-guage consists in transforming the barrier and spawn annotations of SPIREinto the equivalent calls or pragmas available in the target language

Data Distribution

If the target language uses a shared-memory paradigm no additional changesare needed to handle this language Otherwise if it is a message-passing li-brarylanguage a pass of insertion of appropriate asend and recv primitivesof SPIRE is necessary (see section 73) After that we generate the equiv-alent communication calls of the corresponding language from SPIRE bytransforming asend and recv primitives into these equivalent communica-tion calls

Hierarchical Parallelism

Along this thesis we take into account the hierarchy structure of code toachieve maximum efficiency Indeed exploiting the hierarchy of code whenscheduling the code via the HBDSC algorithm within the pass of transforma-tion of SPIRE via the unstructured to structured algorithm and alsowhen generating both equilevel and hierarchical communications made itpossible to handle hierarchical parallelism in order to generate more par-allelism Moreover we take into account the ability of parallel languagesto implement hierarchical parallelism and thus improve the performance ofthe generated parallel code Figure 75 on page 138 illustrates zero and onelevel of hierarchy in the generated SPIRE code (right) we have generatedequilevel and hierarchical SPIRE(PIPS IR)

742 OpenMP Generation

OpenMP 30 has been extended with tasks and thus supports both taskand hierarchical task parallelism for shared memory systems For efficiencysince the creation of threads is done once at the beginning of the programomp parallel pragma is used once in the program This section details thegeneration of OpenMP task parallelism features using SPIRE

Task Parallelism

OpenMP supports two types of task parallelism the dynamic via omp task

directive and static via omp section directive scheduling models (see Sec-tion 344 on page 39) In our implementation of the OpenMP back-end


generation we choose to generate omp task instead of omp section tasksfor four reasons namely

1 the HBDSC scheduling algorithm uses a static approach and thuswe want to combine the static results of HBDSC with the run-timedynamic scheduling of OpenMP provided when using omp task inorder to improve scheduling

2 unlike the omp section directive the omp task directive allows theimplementation of recursive parallel tasks

3 omp section may only be used in omp sections construct and thuswe cannot implement a parallel loop using omp sections

4 the actual implementations of OpenMP do not provide the possibilityof nesting many omp sections and thus cannot take advantage of theBDSC-based hierarchical parallelization (HBDSC) feature to exploitmore parallelism

As regards the use of the OpenMP task construct we apply a simpleprocess a SPIRE statement annotated with the synchronization attributespawn is decorated by the pragma omp task An example is illustrated inFigure 717 where the SPIRE code in the left hand side is rewritten inOpenMP on the right hand side

Synchronization

In OpenMP the user can specify explicitly synchronization to handle controland data dependences using the omp single pragma for example Thereonly one thread (omp single pragma) will encounter a task construct (omptask) to be scheduled and executed by any thread in the team (the set ofthreads created inside an omp parallel pragma) Therefore a statementannotated with the SPIRE synchronization attribute barrier is decoratedby the pragma omp single An example is illustrated in Figure 717


SPIRE handles multiple levels of parallelism providing a trade-off betweenparallelism and synchronization overhead This is taken into account atscheduling time using our BDSC-based hierarchical scheduling algorithmIn order to generate hierarchical parallelism in OpenMP the same process asabove of translating spawn and barrier SPIRE annotations should be ap-plied However synchronizing nested tasks (with synchronization attributebarrier) using omp single pragma cannot be used because of a restric-tion of OpenMP-compliant implementations single regions may not beclosely nested inside explicit task regions Therefore we use the pragma


barrier(




)

pragma omp single

pragma omp task

Gauss(Sxx Ixx)

pragma omp task

Gauss(Syy Iyy)

pragma omp task

Gauss(Sxy Ixy)

Figure 717 SPIRE representation of a part of Harris and its OpenMP taskgenerated code

omp taskwait for synchronization when encountering a statement anno-tated with a barrier synchronization The second barrier in the left ofFigure 718 implements nested parallel tasks omp taskwait is then used(right of the figure) to synchronize the two tasks spawned inside the loopNote also that the barrier statement with zero spawn statement (left of thefigure) is omitted in the translation (right of the figure) for optimizationpurposes

barrier(

spawn(0

forloop(i 1 N 1

barrier(

spawn(1 B[i] = 3)

spawn(2 A[i] = 5)

)

sequential ))

)

barrier(

spawn(0

forloop(i 1 N 1

C[i] = A[i]+B[i]

sequential ))

)

pragma omp single

for(i = 1 i lt= N i += 1)

pragma omp task

B[i] = 3

pragma omp task

A[i] = 5

pragma omp taskwait

for(i = 1 i lt= N i += 1)

C[i] = A[i]+B[i]

Figure 718 SPIRE representation of a C code and its OpenMP hierarchicaltask generated code

743 SPMDization MPI Generation

MPI is a message-passing library it is widely used to program distributed-memory systems We show here the generation of communications between


processes In our technique we adopt a model often called ldquoSPMDizationrdquoof parallelism where processes are created once at the beginning of the pro-gram (static creation) each process executes the same code and the numberof processes remains unchanged during execution This allows to reduce theprocess creation and synchronization overheads We state in Section 345 onpage 40 that MPI processes are created when the MPI Init function is calledit defines also the universal intracommunicator MPI COMM WORLD for allprocesses to drive various communications We use rank of process parame-ter rank0 to manage the relation between process and task this rank is ob-tained by calling the function MPI Comm rank(MPI COMM WORLD amprank0 )This section details the generation of MPI using SPIRE

Task Parallelism

A simple SPMDization technique is applied where a statement with the syn-chronization attribute spawn of parameter entity0 that specifies the clusternumber (process in the case of MPI) is filtered by an if statement on thecondition rank0 == entity0 An example is illustrated in Figure 719

Communication and Synchronization

MPI supports distributed memory systems we have to generate commu-nications between different processes MPI Isend and MPI Recv functionsare called to replace communication primitives asend and recv of SPIREMPI Isend [3] starts a nonblocking send These communication functionsare sufficient to ensure the synchronization between different processes Anexample is illustrated in Figure 719


Most people use only MPI COMM WORLD as a communicator betweenMPI processes This universe intracommunicator is not enough for handlinghierarchical parallelism MPI implementation offers the possibility of creat-ing subset communicators via the function MPI Comm split [80] that createsnew communicators from an initially defined communicator Each new com-municator can be used in a level of hierarchy in the parallel code Anotherscheme consists in gathering the code of each process and then assigning thecode to the corresponding process

In our implementation of the generation of MPI code from SPIRE codewe leave the implementation of hierarchical MPI as future work we imple-ment only flat MPI code Another type of hierarchical MPI is obtained bymerging MPI and OpenMP together (hybrid programming) [12] this is alsocalled multi-core-aware message-passing (MPI) We also discuss this in thefuture work section (see Section 92)

75 CONCLUSION 161

barrier(

spawn(0

Gauss(Sxx Ixx)

)

spawn(1

Gauss(Syy Iyy)

asend(0 Syy)

)

spawn(2

Gauss(Sxy Ixy)

asend(0 Sxy)

)

)

MPI_Init (ampargc ampargv)

MPI_Comm_rank (MPI_COMM_WORLD amprank0 )

if (rank0 ==0)

Gauss(Sxx Ixx)

if (rank0 ==1)

Gauss(Syy Iyy)

MPI_Isend(Syy NM MPI_FLOAT 0 MPI_ANY_TAG

MPI_COMM_WORLD amprequest )

if (rank0 ==2)

Gauss(Sxy Ixy)

MPI_Isend(Sxy NM MPI_FLOAT 0 MPI_ANY_TAG


MPI_Finalize ()

Figure 719 SPIRE representation of a part of Harris and its MPI generatedcode

75 Conclusion

The goal of this chapter is to put the first brick of a framework of parallelcode transformations and generation using our BDSC- and SPIRE-basedhierarchical task parallelization techniques We present two parallel trans-formations first from unstructured to structured parallelism for high-levelabstraction of parallel constructs and second the conversion of shared mem-ory programs to distributed ones For these two transformations we jugglewith SPIRE constructs for performing them Moreover we show the gener-


ation of task parallelism and communications at both levels equilevel andhierarchical Finally we generate both OpenMP and MPI from the sameparallel IR derived from SPIRE except for the hierarchical MPI that weleave as future work

The next chapter provides experimental results using OpenMP and MPIgenerated programs via the methodology presented in this thesis based onHBDSC scheduling algorithm and SPIRE-derived parallel languages

Chapter 8

Experimental Evaluationwith PIPS

Experience is a lantern tied behind our backs which illuminates the pathConfucius

In this chapter we describe the integration of SPIRE and HBDSC withinthe PIPS compiler infrastructure and their application to the parallelizationof five significant programs targeting both shared and distributed mem-ory architectures the image and signal processing benchmarks Harris andABF the SPEC2001 benchmark equake the NAS parallel benchmark ISand an FFT code We present the experimental results we obtained forthese applications our experiments suggest that HBDSCrsquos focus on efficientresource management leads to significant parallelization speedups on bothshared and distributed memory systems improving upon DSC results asshown by the comparison of the sequential and parallelized versions of thesefive applications running on both OpenMP and MPI frameworks

We provide also a comparative study between our task parallelizationimplementation in PIPS and that of the audio signal processing languageFaust using two Faust applications Karplus32 and Freeverb

81 Introduction

The HBDSC algorithm presented in Chapter 6 has been designed to offerbetter task parallelism extraction performance for parallelizing compilersthan traditional list-scheduling techniques such as DSC To verify its ef-fectiveness BDSC has been implemented in the PIPS compiler infrastruc-ture and tested on programs written in C Both the HBDSC algorithm andSPIRE are implemented in PIPS Recall that HBDSC is the hierarchicallayer that adapts a scheduling algorithm (DSC BDSC) which handlesa DAG (sequence) for a hierarchical code The analyses of estimation oftime execution of tasks dependences and communications between tasksfor computing tlevel and blevel are also implemented in PIPS

In this chapter we provide experimental BDSC-vs-DSC comparison re-sults based on the parallelization of five benchmarks namely ABF [51]Harris [94] equake [22] IS [85] and FFT [8] We chose these particularbenchmarks since they are well-known benchmarks and exhibit task paral-lelism that we hope our approach will be able to take advantage of Note

164 CHAPTER 8 EXPERIMENTAL EVALUATION WITH PIPS

that since our focus in this chapter is to compare BDSC with prior worknamely DSC our experiments in this chapter do not address the hierarchicalcomponent of HBDSC

BDSC uses numerical constants to label its input graph we show inSection 63 how we lift this restriction using static and dynamic approachesbased on array region and complexity analyses In this chapter we studyexperimentally the input sensitivity issues on the task and communicationtime estimations in order to assess the scheduling robustness of BDSC sinceit relies on approximations provided by these analyses

Several existing systems already automate the parallelization of pro-grams targeting different languages (see Section 66) such as the non-open-source compiler OSCAR [62] In the goal of comparing our approach withothers we choose the audio signal processing language Faust [86] because itis an open-source compiler and also generates automatically parallel tasksin OpenMP

The remainder of this chapter is structured as follows We detail theimplementation of HBDSC in Section 82 Section 83 presents the scientificbenchmarks under study Harris ABF equake IS and FFT the two targetmachines that we use for our measurements and also the effective costmodel parameters that depend on each target machine Section 84 explainsthe process we follow to parallelize these benchmarks Section 85 providesperformance results when these benchmarks are parallelized on the PIPSplatform targeting a shared memory architecture and a distributed memoryarchitecture We also assess the sensitivity of our parallelization techniqueon the accuracy of the static approximations of the code execution time usedin task scheduling in Section 86 Section 87 provides a comparative studybetween our task parallelization implementation in PIPS and that of Faustwhen dealing with audio signal processing benchmarks

82 Implementation of HBDSC- and SPIRE-BasedParallelization Processes in PIPS

We have implemented the HBDSC parallelization (Chapter 6) and someSPIRE-based parallel code transformations and generation (Chapter 7) al-gorithms in PIPS PIPS is structured as a collection of independent codeanalysis and transformation phases that we call passes This section de-scribes the order and the composition of passes that we have implementedin PIPS to enable the parallelization of sequential C77 language programsFigure 81 summarizes the parallelization processes

These passes are managed by the Pipsmake library1 Pipsmake managesobjects that are resources stored in memory orand on disk The passes are

1httpcriensmpfrPIPSpipsmakehtml

82 IMPLEMENTATION OF HBDSC- AND SPIRE-BASED 165

C Source Code

PIPS Analyses

Complexities Convex Array

Regions

InputData

SequentialIR

DDG

SequenceDependence Graph

SDG

Cost ModelsStaticInstrumentation

NumericalProfile

HBDSC


Structuring

Structured SPIRE(PIPS IR)

OpenMPGeneration

MPIGeneration

OpenMP Code MPI Code

Figure 81 Parallelization process blue indicates thesis contributions anellipse a pass or a set of passes and a rectangle results

described by generic rules that indicate used and produced resources Ex-amples of these rules are illustrated in the following subsections Pipsmake


maintains the global data consistency intra- and inter-procedurally

The goal of the following passes is to generate the parallel code expressedin SPIRE(PIPS IR) using HBDSC scheduling algorithm then communica-tions primitives and finally OpenMP and MPI codes in order to automatethe task parallelization of sequential programs

Tpips2 is the line interface and scripting language of the PIPS systemThe execution of the necessary passes is obtained by typing their name inthe command line interpreter of PIPS tpips Figure 82 depicts harristpipsthe script used to execute our shared and distributed memory parallelizationprocesses on the benchmark Harris It chains the passes described in thefollowing subsections

Create a workspace for the program harrisc

create harris harrisc

apply SEQUENCE_DEPENDENCE_GRAPH[main]

shell dot -Tpng harrisdatabasemainmain_sdgdot gt main_sdgpng

shell gqview main_sdgpng

setproperty HBDSC_NB_CLUSTERS 3

apply HBDSC_PARALLELIZATION[main]

apply SPIRE_UNSTRUCTURED_TO_STRUCTURED[main]

Print the result on stdout

display PRINTED_FILE[main]

echo OMP style

activate OPENMP_TASK_GENERATION

activate PRINT_PARALLELIZEDOMPTASK_CODE

display PARALLELPRINTED_FILE[main]

echo MPI style

apply HBDSC_GEN_COMMUNICATIONS[main]

activate MPI_TASK_GENERATION

activate PRINT_PARALLELIZEDMPI_CODE


Print the scheduled SDG of harris

shell dot -Tpng harrisdatabasemainmain_ssdgdot gt main_ssdgpng

shell gqview main_scheduled_ssdgpng

close

quit

Figure 82 Executable (harristpips) for harrisc

2httpwwwcriensmpfrpipstpips-user-manualhtdoctpips-user-

manualpdf


821 Preliminary Passes

Sequence Dependence Graph Pass

Pass sequence dependence graph generates the Sequence Dependence Graph(SDG) as explained in Section 62 It uses in particular the dg resource (datadependence graph) of PIPS and outputs the resulting SDG in the resourcesdg

sequence_dependence_graph gt MODULEsdg

lt PROGRAMentities

lt MODULEcode

lt MODULEproper_effects

lt MODULEdg

lt MODULEregions

lt MODULEtransformers

lt MODULEpreconditions

lt MODULEcumulated_effects

Code Instrumentation Pass

We model the communication costs data sizes and execution times of differ-ent tasks with polynomials These polynomials have to be converted to nu-merical values in order to be used by BDSC algorithm We instrument thususing the pass hbdsc code instrumentation the input sequential code andrun it once in order to obtain the numerical values of the polynomials Theinstrumented code contains the initial user code plus statements that com-pute the values of the cost polynomials for each statement (as detailed inSection 632)

hbdsc_code_instrumentation gt MODULEcode

lt PROGRAMentities

lt MODULEcode


lt MODULEdg

lt MODULEregions

lt MODULEsummary_complexity

Path Transformer Pass

The path transformer (see Section 64) between two statements computes thepossible changes performed by a piece of code delimited by two statementsSbegin and Send enclosed within a statement S The goal of the followingpass is to compute the path transformer between two statements labeled bythe labels sbegin and ssend in the code resource It outputs a resource filethat contains the resulting transformer


path_transformer gt MODULEpath_transformer_file

lt PROGRAMentities

lt MODULEcode





822 Task Parallelization Passes

HBDSC Task Parallelization Pass

Pass hbdsc parallelization applies HBDSC on the resulting SDG (theinput resource dg) and generates a parallel code in an unstructured formusing the resource dg implementing the scheduled SDG Resource schedule

implements our function of hierarchical schedule σ (see Section 65) it mapseach statement to a given cluster This pass implements the HBDSC algorithmpresented in Section 65 Algorithm 19 Note that Algorithm 19 uses theunstructured construct instead of scheduled SDG to represent unstructuredparallelism Our prototype uses SDG instead for historical reasons Thispass uses the analyses of regions preconditions transformers complexitiesand effects and outputs scheduled SDG in the resource dg and a functionσ in the resource schedule

hbdsc_parallelization gt MODULEsdg

lt PROGRAMentities gt MODULEschedule

lt MODULEcode


lt MODULEsdg

lt MODULEregions





Parallelism Structuring Pass

Pass hbdsc parallelization generates unstructured parallelism The goalof the following pass is to structure this parallelism Pass spire uns-

tructured to structured transforms the parallelism generated in form ofgraphs (SDG) into spawn and barrier constructs It uses the scheduledSDG and the schedule function σ (the resource schedule) and generates anew code (structured) and updates the schedule resource since it adds tothe schedule the newly created statements of the new code This pass im-


plements Algorithm unstructured to structured presented in Section 72Algorithm 21

spire_unstructured_to_structured gt MODULEcode


lt MODULEcode

lt MODULEsdg

lt MODULEschedule

HBDSC Tuning

In PIPS one can use different properties to parameterize passes and selectthe precision required or the algorithm to use Default properties are de-fined by PIPS but they can be redefined by the user using the commandsetproperty when the tpips interface is used

The following properties are used by Pass hbdsc parallelization toallow the user to define the number of clusters and memory size based onthe properties of the target architecture

The number of clusters is set by default to 4 the memory size to minus1which means that we have infinite memory

HBDSC_NB_CLUSTERS 4

HBDSC_MEMORY_SIZE -1

In order to control the granularity of parallel tasks we introduce thefollowing property By default we generate the maximum parallelism incodes Otherwise ie when COSTLY TASKS ONLY is TRUE only loopsand function calls are used as parallel tasks This is important in order tomake a trade-off between task creation overhead and task execution time

COSTLY_TASKS_ONLY FALSE

The next property specifies if communication data and time informationis to be extracted from the result of the instrumentation and stored intoHBDSC INSTRUMENTED FILE file which is a dynamic analysis or if the staticcost models have to be used The default option is an empty string Itmeans that static results have to be used

HBDSC_INSTRUMENTED_FILE

823 OpenMP Related Passes

SPIRE is designed to help generate code for different types of parallel lan-guages just like different parallel languages can be mapped into SPIREThe four following passes show how this intermediate representation simpli-fies the task of producing code for various languages such as OpenMP andMPI (see Section 824)


OpenMP Task Generation Pass

Pass openmp task generation generates OpenMP task parallel code fromSPIRE(PIPS IR) as detailed in Section 742 It replaces spawn constructsby the OpenMP directives omp task barrier constructs by omp single

or omp taskwait It outputs a new resource parallelized code for theOpenMP code

openmp_task_generation gt MODULEparallelized_code

lt PROGRAMentities

lt MODULEcode

OpenMP Prettyprint Pass

The next pass print parallelizedOMPTASK code is a pretty printing func-tion It outputs the code decorated with OpenMP (omp task) directivesusing parallelized code resource

print_parallelizedOMPTASK_code gt MODULEparallelprinted_file

lt PROGRAMentities

lt MODULEparallelized_code

824 MPI Related Passes SPMDization

Automatic Data Distribution Pass

Pass hbdsc gen communications generates send and recv primitives Itimplements the algorithm communications construction presented in Sec-tion 73 Algorithm 27 It takes as input the parallel code (lt symbol) andthe precedent resources of the sequential code (= symbol) It generates anew parallel code that contains communications calls

hbdsc_gen_communications gt MODULEcode

lt PROGRAMentities

lt MODULEcode

= MODULEdg

= MODULEschedule

= MODULEproper_effects

= MODULEpreconditions

= MODULEregions

MPI Task Generation Pass

The same process applies for MPI The pass mpi task generation generatesMPI parallel code from SPIRE(PIPS IR) as detailed in Section 743 It

83 EXPERIMENTAL SETTING 171

replaces spawn constructs by guards (if statements) barrier constructs byMPI barrier send functions by MPI Isend and recv functions by MPI RecvNote that we leave the issue of generation of hierarchical MPI as future workOur implementation generates only flat MPI code (a sequence of non-nestedtasks)

mpi_task_generation gt MODULEparallelized_code

lt PROGRAMentities

lt MODULEcode

MPI Prettyprint Pass

Pass print parallelizedMPI code outputs the code decorated with MPIinstructions

print_parallelizedMPI_code gt MODULEparallelprinted_file

lt PROGRAMentities


We generate an MPI code in the format described in Figure 83

declare variables


parse arguments

main program

MPI_Finalize ()

Figure 83 A model of an MPI generated code

83 Experimental Setting

We carried out experiments to compare DSC and BDSC algorithms BDSCprovides significant parallelization speedups on both shared and distributedmemory systems on the set of benchmarks that we present in this sectionWe also present the cost model parameters that depend on the target ma-chine

831 Benchmarks ABF Harris Equake IS and FFT

For comparing DSC and BDSC on both shared and distributed memorysystems we use five benchmarks ABF Harris Equake IS and FFT brieflypresented below


ABF

The signal processing benchmark ABF (Adaptive Beam Forming) [51] is a1065-line program that performs adaptive spatial radar signal processingfor an array of smart radar antennas in order to transmit or receive signalsin different directions and enhance these signals by minimizing interferenceand noise It has been developed by Thales [104] and is publicly availableFigure 84 shows the scheduled SDG of the function main of this benchmarkon three clusters (κ0 κ1 and κ2) This function contains 7 parallel loopswhile the degree of task parallelism is only three (there are three paralleltasks at most) In Section 84 we show how we use these loops to createmore parallel tasks

Harris

The image processing algorithm Harris [53] is a 105-line image processingcorner detector used to identify points of interest It uses several functions(operators such as auto-correlation) applied to each pixel of an input imageIt is used for feature extraction for motion detection and image matching InHarris [94] at most three tasks can be executed in parallel Figure 616 onpage 122 shows the scheduled SDG for Harris obtained using three processors(aka clusters)

Equake (SPEC Benchmarks)

The 1432-line SPEC benchmark equake [22] is used in the simulation of seis-mic wave propagation in large valleys based on computations performed onan unstructured mesh that locally resolves wavelengths using the Archimedesfinite element method Equake contains four hot loops three in the functionmain and one in the function smvp opt A part of the function main is illus-trated in Figure 65 on page 99 We use these four loops for parallelization

IS (NAS Parallel Benchmarks)

Integer Sort (IS) is one of the eleven benchmarks in the NAS Parallel Bench-marks suite [85] The serial version contains 1076 lines We exploit the hotfunction rank of IS that contains a degree of task parallelism of two and twoparallel loops

FFT

Fast Fourier Transform (FFT) is widely used in many areas of science math-ematics and engineering We use the implementation of FFT3 of [8] In FFTfor N complex points both time domain decomposition and frequency do-main synthesis require three loops The outer loop runs through the log2N


Figure 84 Scheduled SDG of the main function of ABF

stages The middle loop (chunks) moves through each of the individual fre-quency spectra (chunk) in the stage being worked on The innermost loopuses the Butterfly pattern to compute the points for each frequency of thespectrum We use the middle loop for parallelization since it is parallel (eachchunk can be launched in parallel)


832 Experimental Platforms Austin and Cmmcluster

We use the two following experimental platforms for our performance eval-uation tests

Austin

The first target machine is a host Linux (Ubuntu) shared memory machinewith a 2-socket AMD quadcore Opteron with 8 cores with M = 16 GBof RAM running at 24 GHz We call this machine Austin in the remain-der of this chapter The version of gcc 463 is installed where the versionof OpenMP 30 is supported (we use gcc to compile using -fopenmp forOpenMP programs and -O3 as optimization flags)

Cmmcluster

The second machine that we target for our performance measurements is ahost Linux (RedHat) distributed memory machine with 6 dual-core proces-sors Intel(R) Xeon(R) with M = 32 GB of RAM per processor runningat 25 GHz We call this machine Cmmcluster in the remainder of thischapter The versions of gcc 446 and Open MPI 162 are installed andused for compilation (we use mpicc to compile MPI programs and -O3 asoptimization flags)

833 Effective Cost Model Parameters

In the default cost model of PIPS the number of clock cycles per instruction(CPI) is equal to 1 and the number of clock cycles for the transfer of one byteis β = 1 Instead of using this default cost model we use an effective modelfor each target machine Austin and Cmmcluster Table 81 summarizes thecost model for the integer and floating-point operations of addition subtrac-tion multiplication division and transfer cost of one byte The parametersof the arithmetic operations of this table are extracted from agnerorg3 thatgives the measurements of CPU clock cycles for AMD (Austin) and Intel(Cmmcluster) CPUs The transfer cost of one byte in number of cycles βhas been measured by using a simple MPI program that sends a numberof bytes over the Cmmcluster memory network Since Austin is a sharedmemory machine that we use to execute OpenMP codes we do not need tomeasure β for this machine (NA)

84 Protocol

We have extended PIPS with our implementation in C of BDSC-based hi-erarchical scheduling (HBDSC) To compute the static execution time and

3httpwwwagnerorgoptimizeinstruction_tablespdf

84 PROTOCOL 175

`````````````OperationMachine Austin Cmmcluster

int float int float

Addition (+) 25 2 3 3Subtraction (-) 25 2 3 3

Multiplication (times) 3 2 11 3Division () 40 30 46 39

One byte transfer (β) NA NA 25 25

Table 81 CPI for the Austin and Cmmcluster machines plus the transfercost of one byte β on the Cmmcluster machine (in cycles)

communication cost estimates needed by HBDSC we relied upon the PIPSrun time complexity analysis and a more realistic architecture-dependentcommunication cost matrix (see Table 81) For each code S of our testbenchmarks PIPS performed automatic task parallelization applied ourhierarchical scheduling process HBDSC(S P M perp) (using either BDSC orDSC) on these sequential programs to yield a schedule σunstructured (un-structured code) Then PIPS transforms the schedule σunstructured viaunstructured to structured(S P σunstructured) to yield a new scheduleσstructured PIPS automatically generated an OpenMP [4] version from theparallel statements encoded in SPIRE(PIPS IR) in σstructured(S) using omp

task directives another version in MPI [3] was also generated We alsoapplied the DSC scheduling process on these benchmarks and generated thecorresponding OpenMP and MPI codes The execution times have beenmeasured using gettimeofday time function

Compilation times for these benchmarks were quite reasonable the longest(equake) being 84 seconds In this last instance most of the time (79 sec-onds) was spent by PIPS to gather semantic information such as regionscomplexities and dependences our prototype implementation of HBDSC isonly responsible for the remaining 5 seconds

We ran all these parallelized codes on the two shared and distributedmemory computing systems presented in Section 832 To increase availablecoarse-grain task parallelism in our test suite we have used both unmodifiedand modified versions of our benchmarks We tiled and fully unrolled byhand the most costly loops in ABF (7 loops) FFT (2 loops) and equake(4 loops) the tiling factor for the BDSC version depends on the numberof available processors P (tile size = number of loop iterations P ) whilewe had to find the proper one for DSC since DSC puts no constraintson the number of needed processors but returns the number of processorsits scheduling requires For Harris and IS our experiments have lookedat both tiled and untiled versions of the benchmarks The full unrollingthat we apply for our experiments creates a sequence of statements froma nested loop An example of loop tiling and full unrolling of a loop of


the ABF benchmark is illustrated in Figure 85 The function COR estimatescovariance based on a part of the received signal sel out averaged on Nb rg

range gates and nb pul pulses Note that the applied loop tiling is equivalentto loop chunking [82]

Parallel loop

for (i=0ilt19i++)

COR(Nb_rg nb_pul sel_out[i]

CORR_out[i])

(a) Loop on Function COR

for (i=0ilt19i+=5)

for (j=ij lt min(i+5 19)j++)

COR(Nb_rg nb_pul sel_out[j]

CORR_out[j])

(b) Loop block tiling on Function COR

(tile size = 5 for P = 4)

for (j=0jlt5j++)

COR(Nb_rg nb_pul sel_out[j] CORR_out[j])

for (j=5jlt10j++)


for (j=10jlt15j++)


for (j=15jlt19j++)


(c) Full loop unrolling on Function COR

Figure 85 Steps for the rewriting of data parallelism to task parallelismusing the tiling and unrolling transformations

For our experiments we use M = 16 GB for Austin and M = 32 GBfor Cmmcluster Thus the schedules used for our experiments satisfy thesememory sizes

85 BDSC vs DSC

We apply our parallelization process on the benchmarks ABF Harris Equakeand IS We execute the parallel versions on both the Austin and Cmmclus-ter machines As for FFT we get an OpenMP version and study how moreperformance can be extracted using streaming languages

851 Experiments on Shared Memory Systems

We measured the execution time of the parallel OpenMP codes on the P =1 2 4 6 and 8 cores of Austin

Figure 86 shows the performance results of the generated OpenMP codeon the two versions scheduled using DSC and BDSC on ABF and equakeThe speedup data show that the DSC algorithm is not scalable on theseexamples when the number of cores is increased this is due to the generation

85 BDSC VS DSC 177

of more clusters (task creation overhead) with empty slots (poor potentialparallelism and bad load balancing) than with BDSC a costly decision giventhat when the number of clusters exceeds P they have to share the samecore as multiple threads

1 2 4 6 80

1

2

3

4

5

6 ABF-DSCABF-BDSCequake-DSCequake-BDSC

Number of cores

Spe

edup

(ABF 035 sequake 230 s)

Figure 86 ABF and equake speedups with OpenMP

Figure 87 presents the speedup obtained using P = 3 since the maxi-mum parallelism in Harris is three assuming no exploitation of data paral-lelism for two parallel versions BDSC with and BDSC without tiling of thekernel CoarsitY (we tiled by 3) The performance is given using three dif-ferent input image sizes 1024times1024 2048times1024 and 2048times2048 The bestspeedup corresponds to the tiled version with BDSC because in this casethe three cores are fully loaded The DSC version (not shown in the figure)yields the same results as our versions because the code can be scheduledusing three cores

Figure 88 shows the performance results of the generated OpenMP codeon the NAS benchmark IS after applying BDSC The maximum task paral-lelism without tiling in IS is two which is shown in the first subchart theother subcharts are obtained after tiling The program has been run withthree IS input classes (A B and C [85]) The bad performance of our imple-mentation for Class A programs is due to the large task creation overheadwhich dwarfs the potential parallelism gains even more limited here becauseof the small size of Class A data

As for FFT after tiling and fully unrolling the middle loop of FFT andparallelizing it an OpenMP version is generated using PIPS Since we onlyexploit the parallelism of the middle loop this is not sufficient to obtaingood performance Indeed a maximum speed up of 15 is obtained for anFFT of size n = 220 on Austin (8 cores) The sequential outer loop of log2niterations makes the performance that bad However our technique can be


1024x1024 2048x1024 2048x2048 0

05

1

15

2

25Harris-OpenMP

Harris-tiled-OpenMP

Image sizeSp

ee

du

p 3

thre

ad

s vs

se

qu

en

tial

sequential = 183 ms sequential = 345 ms sequential = 684 ms

Figure 87 Speedups with OpenMP impact of tiling (P=3)

OMP (2 tasks) OMP-tiled-2 OMP-tiled-4 OMP-tiled-6 OMP-tiled-80

05

1

15

2

25

3

35Class A (sequential = 168 s)

Class B (sequential = 455 s)

Class C (sequnetial = 28 s)

Benchmark version

Spe

edup

Ope

nMP

vs

seq

uent

ial

Figure 88 Speedups with OpenMP for different class sizes (IS)

seen as a first step of parallelization for languages such as OpenStream

OpenStream [92] is a data-flow extension of OpenMP in which dynamicdependent tasks can be defined In [91] an efficient version of the FFTis provided where in addition to parallelizing the middle loop (data par-allelism) it exploits possible streaming parallelism between the outer loopstages Indeed once a chunk is executed the chunks that get their depen-dences satisfied in the next stage start execution (pipeline parallelism) Thisway a maximum speedup of 66 on a 4-socket AMD quadcore Opteron with16 cores for size of FFT n = 220 [91] is obtained

Our parallelization of FFT does not generate automatically pipeline par-allelism but generates automatically parallel tasks that may also be coupledas streams In fact to use OpenStream the user has to write manuallythe input streaming code We can say that thanks to the code generated

85 BDSC VS DSC 179

using our approach where the parallel tasks are automatically generatedthe user just has to connect these tasks by providing (manually) the dataflow necessary to exploit additional potential pipeline parallelism

852 Experiments on Distributed Memory Systems

We measured the execution time of the parallel codes on P = 1 2 4 and 6processors of Cmmcluster

Figure 89 presents the speedups of the parallel MPI vs sequential ver-sions of ABF and equake using P = 2 4 and 6 processors As before theDSC algorithm is not scalable when the number of processors is increasedsince the generation of more clusters with empty slots leads to higher processscheduling cost on processors and communication volume between them

1 2 4 60

05

1

15

2

25

3


Number of processors

Spe

edup


Figure 89 ABF and equake speedups with MPI

Figure 810 presents the speedups of the parallel MPI vs sequentialversions of Harris using three processors The tiled version with BDSC givesthe same result as the non-tiled version since the communication overheadis so important when the three tiled loops are scheduled on three differentprocessors that BDSC scheduled them on the same processor this led thusto a schedule equivalent to the one of the non-tiled version Compared toOpenMP the speedups decrease when the image size is increased because theamount of communication between processors increases The DSC version(not shown on the figure) gives the same results as the BDSC version becausethe code can be scheduled using three processors

Figure 811 shows the performance results of the generated MPI codeon the NAS benchmark IS after application of BDSC The same analysis asthe one for OpenMP applies here in addition to communication overheadissues


1024x1024 2048x1024 2048x20480

02040608

112141618

2

Harris-MPIHarris-tiled-MPI

Image size

Sp

ee

du

p 3

pro

cess

es

vs s

eq

ue

ntia

l

sequential = 97 ms Sequential = 244 ms sequential = 442 ms

Figure 810 Speedups with MPI impact of tiling (P=3)

MPI (2 tasks) MPI-tiled-2 MPI-tiled-4 MPI-tiled-60

05

1

15

2

25



Class C (sequential = 1357 s)

Benchmark version

Spe

edup

MP

I vs

se

quen

tial

Figure 811 Speedups with MPI for different class sizes (IS)

86 Scheduling Robustness

Since our BDSC scheduling heuristic relies on the numerical approximationsof the execution and communication times of tasks one needs to assessits sensitivity to the accuracy of these estimations Since a mathematicalanalysis of this issue is made difficult by the heuristic nature of BDSC andin fact of scheduling processes in general we provide below experimentaldata that show that our approach is rather robust

In practice we ran multiple versions of each benchmark using variousstatic execution and communication cost models

bull the naive variant in which all execution times and communicationscosts are supposed constant (only data dependences are enforced dur-ing the scheduling process)

bull the default BDSC cost model described above

86 SCHEDULING ROBUSTNESS 181

bull a biased BDSC cost model where we modulated each execution timeand communication cost value randomly by at most ∆ (the defaultBDSC cost model would thus correspond to ∆ = 0)

Our intent with introduction of different cost models is to assess how smallto large differences to our estimation of task times and communication costsimpact the performance of BDSC-scheduled parallel code We would expectthat parallelization based on the naive variant cost model would yield theworst schedules thus motivating our use of complexity analysis for paral-lelization purposes if the schedules that use our default cost model are indeedbetter Adding small random biases to task times and communication costsshould not modify too much the schedules (to demonstrate stability) whileadding larger ones might showing the quality of the default cost model usedfor parallelization

Table 82 provides for four benchmarks (Harris ABF equake and IS)and execution environment (OpenMP and MPI) the worst execution timeobtained within batches of about 20 runs of programs scheduled using thenaive default and biased cost models For this last case we only kept inthe table the entries corresponding to significant values of ∆ namely thoseat which for at least one benchmark the running time changed So forinstance when running ABF on OpenMP the naive approach run time is321 ms while BDSC clocks at 214 adding random increments to the taskcommunication and execution estimations provided by our cost model (Sec-tion 63) of up to but not including 80 does not change the schedulingand thus running time At 80 running time increases to 230 and reaches297 when ∆ = 3 000 As expected the naive variant always provides sched-

Benchmark Language BDSC Naive ∆ = 50() 80 100 200 1000 3000

HarrisOpenMP 153 277 153 153 153 153 153 277

MPI 303 378 303 303 303 303 303 378

ABFOpenMP 214 321 214 230 230 246 246 297

MPI 240 310 240 260 260 287 287 310

equakeOpenMP 58 134 58 58 80 80 80 102

MPI 106 206 106 106 162 162 162 188

ISOpenMP 16 35 20 20 20 25 25 29

MPI 25 50 32 32 32 39 39 46

Table 82 Run-time sensitivity of BDSC with respect to static cost estima-tion (in ms for Harris and ABF in s for equake and IS)

ules that have the worst execution times thus motivating the introductionof performance estimation in the scheduling process Even more interest-ingly our experiments show that one needs to introduce rather large tasktime and communication cost estimation errors ie values of ∆ to makethe BDSC-based scheduling process switch to less efficient schedules Thisset of experimental data thus suggests that BDSC is a rather useful and


robust heuristic well adapted to the efficient parallelization of scientificbenchmarks

87 Faust Parallel Scheduling vs BDSC

In music the passions enjoy themselves Friedrich Nietzsche

Faust (Functional AUdio STream) [86] is a programming language forreal-time signal processing and synthesis FAUST programs (DSP code) aretranslated into equivalent imperative programs (C C++ Java etc) In thissection we compare our approach in generating omp task directives withFaust approach in generating omp section directives

871 OpenMP Version in Faust

The Faust compiler produces a single sample computation loop using thescalar code generator Then the vectorial code generator splits the sampleprocessing loop into several simpler loops that communicate by vectors andproduces a DAG Next this DAG is topologically sorted in order to de-tect loops that can be executed in parallel After that the parallel schemegenerates OpenMP parallel sections directives using this DAG in whicheach node is a computation loop the granularity of tasks is only at the looplevel Finally the compiler generates OpenMP sections directives in whicha pragma section is generated for each loop

872 Faust Programs Karplus32 and Freeverb

In order to compare our approach with Faustrsquos in generating task parallelismwe use Karplus32 and Freeverb as running examples [7]

Karplus32

The 200-line Karplus32 program is used to simulate the sound produced bya string excitation passed through a resonator to modify the final sound Itis an advanced version of a typical Karplus algorithm with 32 resonators inparallel instead of one resonator only

Freeverb

The 800-line Freeverb program is free-software that implements artificialreverberation It filters echoes to generate a signal with reverberation Ituses four Schroeder allpass filters in series and eight parallel Schroeder-Moorer filtered-feedback comb-filters for each audio channel

88 CONCLUSION 183

873 Experimental Comparison

Figure 812 shows the performance results of the generated OpenMP code onKarplus32 after application of BDSC and Faust for two sample buffer sizescount equals 1024 and 2048 Note that BDSC is performed on the generatedC codes of these two programs using Faust The hot function in Karplus32(compute of 90 lines) contains only 2 sections in parallel sequential coderepresents about 60 of the whole code of Function compute Since wehave only two threads that execute only two tasks Faust execution is alittle better than BDSC The dynamic scheduling of omp task cannot beshown in this case due to the small amount of parallelism in this program

Karplus32(1024) Karplus32(2048)0

05

1

15

2 FaustBDSC

count

Spe

edup

2 t

hrea

ds v

s s

eque

ntia

l

Sequential = 12 ms Sequential = 18 ms

Figure 812 Run-time comparison (BDSC vs Faust for Karplus32)

Figure 813 shows the performance results of the generated OpenMPcode on Freeverb after application of BDSC and Faust for two sizes countequals 1024 and 2048 The hot function in Freeverb (compute of 500 lines)contains up to 16 sections in parallel (the degree of task parallelism is equalto 16) sequential code represents about 3 of the whole code of Functioncompute BDSC generates better schedules since it is based on sophisticatedcost models uses top level and bottom level values for ordering while Faustproceeds by a topological ordering (only top level) and is based only on de-pendence analysis In this case results show that BDSC generates betterschedules and the generation of omp task instead of omp section is ben-eficial Furthermore the dynamic scheduler of OpenMP can make betterdecisions at run time

88 Conclusion

This chapter describes the integration of SPIRE and HBDSC within thePIPS compiler infrastructure and their application for the automatic task


Freeverb(1024) Freeverb(2048)0

05

1

15

2

25

3 FaustBDSC

count

Spe

edup

8 t

hrea

ds v

s s

eque

ntia

l

Sequential = 40 ms sequential = 80 ms

Figure 813 Run-time comparison (BDSC vs Faust for Freeverb)

parallelization of seven sequential programs We illustrate the impact of thisintegration using the signal processing benchmark ABF (Adaptive BeamForming) the image processing benchmark Harris the SPEC benchmarkequake and the NAS parallel benchmark IS on both shared and distributedmemory systems We also provide a comparative study between our taskparallelization implementation in PIPS and that of the audio signal pro-cessing language Faust using two Faust programs Karplus32 and FreeverbExperimental results show that BDSC provides more efficient schedules thanDSC and Faust scheduler in these cases

The application of BDSC for the parallelization of the FFT (OpenMP)shows the limits of our task parallelization approach but we suggest thatan interface with streaming languages such as OpenStream could be derivedfrom our work

Chapter 9

ConclusionCelui qui passe a cote de la plus belle histoire de sa vie nrsquoaura que lrsquoage de ses

regrets et tous les soupirs du monde ne sauraient bercer son ameYasmina Khadra

Designing an auto-parallelizing compiler is a big challenge the interest incompiling for parallel architectures has exploded in the last decade withthe widespread availability of manycores and multiprocessors and the needfor automatic task parallelization which is a tricky task is still unsatisfiedIn this thesis we have exploited the parallelism present in applications inorder to take advantage of the performance benefits that multiprocessorscan provide Indeed we have delegated the ldquothink parallelrdquo mindset tothe compiler that detects parallelism in sequential codes and automaticallywrites its equivalent efficient parallel code

We have developed an automatic task parallelization methodology forcompilers the key characteristics we focus on are resource constraints andstatic scheduling It contains all the techniques required to decompose appli-cations into tasks and generate equivalent parallel code using a generic ap-proach that targets different parallel languages and different architecturesWe have applied this extension methodology in the existing tool PIPS acomprehensive source-to-source compilation platform

To conclude this dissertation we review its main contributions and re-sults and present future work opportunities

91 Contributions

This thesis contains five main contributions

HBDSC (Chapter 6)

We have extended the DSC scheduling algorithm to handle simultaneouslytwo resource constraints namely a bounded amount of memory per proces-sor and a bounded number of processors which are key parameters whenscheduling tasks on actual multiprocessors We have called this extensionBounded DSC (BDSC) The practical use of BDSC in a hierarchical mannerled to the development of a BDSC-based hierarchical scheduling algorithm(HBDSC)

A standard scheduling algorithm works on a DAG (a sequence of state-ments) To apply HBDSC on real applications our approach was to first

186 CHAPTER 9 CONCLUSION

build a Sequence Dependence DAG (SDG) which is the input of the BDSCalgorithm Then we used the code presented in form of an AST to define ahierarchical mapping function that we call H to map each sequence state-ment of the code to its SDG H is used as input of the HBDSC algorithm

Since the volume of data used or exchanged by SDG tasks and theirexecution times are key factors in the BDSC scheduling process we esti-mate this information as precisely as possible We provide new cost modelsbased on execution time complexity estimation and convex polyhedral ap-proximations of array sizes for the labeling of SDG vertices and edges Welook first at the size of array regions to estimate precisely communicationcosts and the data sizes used by tasks Then we compute Ehrhart polyno-mials that represent the size of the message to communicate and volume ofdata in number of bytes They are then converted in numbers of cyclesBesides execution time for each task is computed using a static executiontime approach based on a program complexity analysis provided by PIPSComplexities in number of cycles are also represented via polynomials Fi-nally we use a static when possible or a code instrumentation method toconvert these polynomials to the numerical values required by BDSC

Path Transformer Analysis (Chapter 6)

The cost models that we define in this thesis use affine relations used inarray regions to estimate the communications and data volumes induced bythe execution of statements We used a set of operations on array regionssuch as the function regions intersection (see Section 631) Howeverone must be careful since region operations should be defined with respectto a common memory store In this thesis we introduced a new analysisthat we call ldquopath transformerrdquo analysis A path transformer permits tocompare array regions of statements originally defined in different memorystores We present a recursive algorithm for the computation of path trans-formers between two statements of specific iterations of their enclosing loops(if they exist) This algorithm computes affine relations between programvariables at any two points in the program from the memory store of thefirst statement to the memory store of the second one along any sequentialexecution that links the two memory stores

SPIRE (Chapter 4)

An automatic parallelization platform should be adapted to different parallellanguages In this thesis we introduced SPIRE which is a new and general3-step extension methodology for mapping any intermediate representation(IR) used in compilation platforms for representing sequential programmingconstructs to a parallel IR This extension of an existing IR introduces (1)a parallel execution attribute for each group of statements (2) a high-level

91 CONTRIBUTIONS 187

synchronization attribute on each statement and an API for low-level syn-chronization events and (3) two built-ins for implementing communicationsin message-passing memory systems

A formal semantics of SPIRE transformational definitions is specifiedusing a two-tiered approach a small-step operational semantics for its baseparallel concepts and a rewriting mechanism for high-level constructs TheSPIRE methodology is presented via a use case the intermediate represen-tation of PIPS a powerful source-to-source compilation infrastructure forFortran and C In addition we apply SPIRE on another IR namely the oneof the widely-used LLVM compilation infrastructure This semantics couldbe used in the future for proving the legality of parallel transformations ofparallel codes

Parallel Code Transformations and Generation (Chapter 7)

In order to test the flexibility of the parallelization approach presented inthis thesis we put the first brick of a framework of parallel code trans-formations and generation using our BDSC- and SPIRE-based hierarchicaltask parallelization techniques We provide two parallel transformations ofSPIRE-based constructs first from unstructured to structured parallelismfor high-level abstraction of parallel constructs and second the conversionof shared memory programs to distributed ones Moreover we show thegeneration of task parallelism and communications at different AST levelsequilevel and hierarchical Finally we generate both OpenMP (hierarchical)and MPI (only flat) from the same parallel IR derived from SPIRE

Implementation and Experimental Results (Chapter 8)

A prototype of the algorithms presented in this thesis has been implementedin the PIPS source-to-source compilation framework (1) HBDSC-based au-tomatic parallelization for programs encoded in SPIRE(PIPS IR) includingBDSC HBDSC SDG the cost models and the path transformer plus DSCfor comparison purposes (2) SPIRE-based parallel code generation target-ing two parallel languages OpenMP and MPI and (3) two SPIRE-basedparallel code transformations from unstructured to structured code andfrom shared to distributed memory code

We apply our parallelization technique on scientific applications Weprovide performance measurements for our parallelization approach basedon five significant programs the image and signal processing benchmarksHarris and ABF the SPEC2001 benchmark equake the NAS parallel bench-mark IS and the FFT targeting both shared and distributed memory archi-tectures We use their automatic translations into two parallel languagesOpenMP and MPI Finally we provide a comparative study between ourtask parallelization implementation in PIPS and that of the audio signal


processing language Faust using two Faust applications Karplus32 andFreeverb Our results illustrate the ability of SPIRE(PIPS IR) to efficientlyexpress parallelism present in scientific applications Moreover our experi-mental data suggests that our parallelization methodology is able to provideparallel codes that encode a significant amount of parallelism and is thuswell adapted to the design of parallel target formats for the efficient paral-lelization of scientific applications on both shared and distributed memorysystems

92 Future Work

The auto-parallelizing problem is a set of many tricky subproblems as illus-trated along this thesis or in the summary of contributions presented aboveMany new ideas and opportunities for further research have popped up allalong this thesis

SPIRE

SPIRE is a ldquowork in progressrdquo for the extension of sequential IRs to com-pletely fit multiple languages and different models Future work will addressthe representation via SPIRE of the PGAS memory model of other parallelexecution paradigms such as speculation and of more programming featuressuch as exceptions Another interesting question is to see how many of theexisting program transformations performed by compilers using sequentialIRs can be preserved when these are extended via SPIRE

HBDSC Scheduling

HBDSC assumes that all processors are homogeneous It does not handleheterogeneous devices efficiently However one can probably fit in easilythese architectures by adding another dimension to our cost models Wesuggest to add the parameter ldquowhich processorrdquo to all the functions de-termining the cost models such as task time (we have to define its time ifmapped on which processor) edge cost (we have to define its cost whenthe two tasks of the edge are mapped on which processors) Moreovereach processor κi has its own memory Mi Thus our parallelization processmay target subsequently an adapted language that targets heterogeneousarchitectures such as StarSs [21] StarSs introduces an extension to the taskmechanism in OpenMP 30 It adds many features to the construct pragmaomp task namely (1) dependences between tasks using the clauses inputoutput and inout (2) targeting of a task to hardware accelerators usingpragma omp target device(device-name-list) knowing that an accel-erator can be an SPE a GPU

92 FUTURE WORK 189

Moreover it might be interesting to try to find a more efficient hier-archical processor allocation strategy which would address load balancingand task granularity issues in order to yield an even better HBDSC-basedparallelization process

Cost Models

Currently there are some cases where our cost models cannot provide preciseor even approximate information When the array is a pointer (for examplewith the declaration float A and A is used as an array) the compiler hasno information about the size of the array Currently for such cases we haveto change manually the programmerrsquos code to help the compiler Moreovercurrently the complexity estimation of a while loop is not computed butthere is already on-going work (at CRI on this issue) based on transformeranalysis

Parallel Code Transformations and Generation

Future work could address more transformations for parallel languages (a la[82]) encoded in SPIRE Moreover the optimization of communications inthe generated code by eliminating redundant communications and aggre-gating small messages to larger ones is an important issue since our currentimplementation communicates more data than necessary

In our implementation of MPI code generation from SPIRE code we leftthe implementation of hierarchical MPI as future work Besides anothertype of hierarchical MPI can be obtained by merging MPI and OpenMPtogether (hybrid programming) However since we use in this thesis a hier-archical approach in both HBDSC and SPIRE generation OpenMP can beseen in this case as a sub-hierarchical model of MPI ie OpenMP taskscould be enclosed within MPI ones

1 Otildeccedil13A

centordfEuml Ntilde

J

cent

ordfEuml aacute

Jlaquo ugrave

macr Q

ordf

eth AeumlPA

ordf

Q

ordfEuml aacute

Jlaquo ugrave

macr Ntilde

centordf

Keth

1 In the eye of the small small things are huge but for the great souls huge thingsappear small Al-Mutanabbi

Bibliography

[1] The Mandelbrot Set httpwarppovusersorgMandelbrot

[2] Cilk 546 Reference Manual Supercomputing Technologies GroupMIT Laboratory for Computer Science httpsupertechlcsmiteducilk 1998

[3] MPI A Message-Passing Interface Standard httpwwwmpi-

forumorgdocsmpi-30mpi30-reportpdf Sep 2012

[4] OpenMP Application Program Interface httpwwwopenmporg

mp-documentsOpenMP_40_RC2pdf Mar 2013

[5] X10 Language Specification Feb 2013

[6] Chapel Language Specification 0796 Cray Inc 901 Fifth AvenueSuite 1000 Seattle WA 98164 Oct 21 2010

[7] Faust an Audio Signal Processing Language httpfaustgrame

fr

[8] StreamIt a Programming Language and a Compilation Infrastructurehttpgroupscsailmiteducagstreamit

[9] B Ackland A Anesko D Brinthaupt S Daubert A KalavadeJ Knobloch E Micca M Moturi C Nicol J OrsquoNeill J OthmerE Sackinger K Singh J Sweet C Terman and J Williams Asingle-chip 16-billion 16-b macs multiprocessor dsp Solid-StateCircuits IEEE Journal of 35(3)412 ndash424 Mar 2000

[10] T L Adam K M Chandy and J R Dickson A Comparison ofList Schedules For Parallel Processing Systems Commun ACM17(12)685ndash690 Dec 1974

[11] S V Adve and K Gharachorloo Shared Memory Consistency ModelsA Tutorial IEEE Computer 2966ndash76 1996

[12] S Alam R Barrett J Kuehn and S Poole Performance Char-acterization of a Hierarchical MPI Implementation on Large-scaleDistributed-memory Platforms In Parallel Processing 2009 ICPPrsquo09 International Conference on pages 132ndash139 2009

[13] E Allen D Chase J Hallett V Luchangco J-W Maessen S RyuJr and S Tobin-Hochstadt The Fortress Language SpecificationTechnical report Sun Microsystems Inc 2007

190

BIBLIOGRAPHY 191

[14] R Allen and K Kennedy Automatic Translation of FORTRAN Pro-grams to Vector Form ACM Trans Program Lang Syst 9(4)491ndash542 Oct 1987

[15] S P Amarasinghe and M S Lam Communication Optimization andCode Generation for Distributed Memory Machines In Proceedings ofthe ACM SIGPLAN 1993 conference on Programming language designand implementation PLDI rsquo93 pages 126ndash138 New York NY USA1993 ACM

[16] G M Amdahl Validity of the Single Processor Approach to AchievingLarge Scale Computing Capabilities Solid-State Circuits NewsletterIEEE 12(3)19 ndash20 summer 2007

[17] M Amini F Coelho F Irigoin and R Keryell Static CompilationAnalysis for Host-Accelerator Communication Optimization In 24thInt Workshop on Languages and Compilers for Parallel Computing(LCPC) Fort Collins Colorado USA Sep 2011 Also Technical Re-port MINES ParisTech A476CRI

[18] C Ancourt F Coelho and F Irigoin A Modular Static AnalysisApproach to Affine Loop Invariants Detection Electr Notes TheorComput Sci 267(1)3ndash16 2010

[19] C Ancourt B Creusillet F Coelho F Irigoin P Jouvelot andR Keryell PIPS a Workbench for Interprocedural Program Anal-yses and Parallelization In Meeting on data parallel languages andcompilers for portable parallel computing 1994

[20] C Ancourt and F Irigoin Scanning Polyhedra with DO Loops InProceedings of the third ACM SIGPLAN Symposium on Principlesand Practice of Parallel Programming PPOPP rsquo91 pages 39ndash50 NewYork NY USA 1991 ACM

[21] E Ayguade R M Badia P Bellens D Cabrera A Duran R Fer-rer M Gonzalez F D Igual D Jimenez-Gonzalez and J LabartaExtending OpenMP to Survive the Heterogeneous Multi-Core EraInternational Journal of Parallel Programming pages 440ndash459 2010

[22] H Bao J Bielak O Ghattas L F Kallivokas D R OrsquoHallaronJ R Shewchuk and J Xu Large-scale Simulation of Elastic WavePropagation in Heterogeneous Media on Parallel Computers Com-puter Methods in Applied Mechanics and Engineering 152(1ndash2)85ndash102 1998

[23] A Basumallik S-J Min and R Eigenmann Programming Dis-tributed Memory Sytems Using OpenMP In Parallel and Distributed

192 BIBLIOGRAPHY

Processing Symposium 2007 IPDPS 2007 IEEE International pages1ndash8 2007

[24] N Benoit and S Louise Kimble a Hierarchical Intermediate Rep-resentation for Multi-Grain Parallelism In F Bouchez S Hack andE Visser editors Proceedings of the Workshop on Intermediate Rep-resentations pages 21ndash28 2011

[25] G Blake R Dreslinski and T Mudge A Survey of Multicore Pro-cessors Signal Processing Magazine IEEE 26(6)26ndash37 Nov 2009

[26] R D Blumofe C F Joerg B C Kuszmaul C E Leiserson K HRandall and Y Zhou Cilk An Efficient Multithreaded RuntimeSystem In Journal of Parallel and Distributed Computing pages 207ndash216 1995

[27] U Bondhugula Automatic Distributed Memory Code Generationusing the Polyhedral Framework Technical Report IISc-CSA-TR-2011-3 Indian Institute of Science Bangalore 2011

[28] C T Brown L S Liebovitch and R Glendon Levy Flights in DobeJursquohoansi Foraging Patterns Human Ecology 35(1)129ndash138 2007

[29] S Campanoni T Jones G Holloway V J Reddi G-Y Wei andD Brooks HELIX Automatic Parallelization of Irregular Programsfor Chip Multiprocessing In Proceedings of the Tenth InternationalSymposium on Code Generation and Optimization CGO rsquo12 pages84ndash93 New York NY USA 2012 ACM

[30] V Cave J Zhao and V Sarkar Habanero-Java the New Adventuresof Old X10 In 9th International Conference on the Principles andPractice of Programming in Java (PPPJ) Aug 2011

[31] Y Choi Y Lin N Chong S Mahlke and T Mudge Stream Compi-lation for Real-Time Embedded Multicore Systems In Proceedings ofthe 7th annual IEEEACM International Symposium on Code Gener-ation and Optimization CGO rsquo09 pages 210ndash220 Washington DCUSA 2009

[32] F Coelho P Jouvelot C Ancourt and F Irigoin Data and ProcessAbstraction in PIPS Internal Representation In F Bouchez S Hackand E Visser editors Proceedings of the Workshop on IntermediateRepresentations pages 77ndash84 2011

[33] T U Consortium httpupcgwuedudocumentationhtml Jun2005

BIBLIOGRAPHY 193

[34] B Creusillet Analyses de Regions de Tableaux et Applications PhDthesis MINES ParisTech A295CRI Decembre 1996

[35] B Creusillet and F Irigoin Interprocedural Array Region AnalysesInternational Journal of Parallel Programming 24(6)513ndash546 1996

[36] E Cuevas A Garcia F J JFernandez R J Gadea and J CordonImportance of Simulations for Nuclear and Aeronautical Inspectionswith Ultrasonic and Eddy Current Testing Simulation in NDT OnlineWorkshop in wwwndtnet Sep 2010

[37] R Cytron J Ferrante B K Rosen M N Wegman and F K ZadeckEfficiently Computing Static Single Assignment Form and the ControlDependence Graph ACM Trans Program Lang Syst 13(4)451ndash490Oct 1991

[38] M I Daoud and N N Kharma GATS 10 a Novel GA-basedScheduling Algorithm for Task Scheduling on Heterogeneous ProcessorNets In GECCOrsquo05 pages 2209ndash2210 2005

[39] J B Dennis G R Gao and K W Todd Modeling The WeatherWith a Data Flow Supercomputer IEEE Trans Computers pages592ndash603 1984

[40] E W Dijkstra Re ldquoFormal Derivation of Strongly Correct Paral-lel Programsrdquo by Axel van Lamsweerde and MSintzoff circulatedprivately 1977

[41] E Ehrhart Polynomes arithmetiques et methode de polyedres en com-binatoire International Series of Numerical Mathematics page 351977

[42] A E Eichenberger J K OrsquoBrien K M OrsquoBrien P Wu T ChenP H Oden D A Prener J C Shepherd B So Z Sura A WangT Zhang P Zhao M K Gschwind R Archambault Y Gao andR Koo Using Advanced Compiler Technology to Exploit the Perfor-mance of the Cell Broadband EngineTM Architecture IBM Syst J45(1)59ndash84 Jan 2006

[43] P Feautrier Dataflow Analysis of Array and Scalar References In-ternational Journal of Parallel Programming 20 1991

[44] J Ferrante K J Ottenstein and J D Warren The Program De-pendence Graph And Its Use In Optimization ACM Trans ProgramLang Syst 9(3)319ndash349 July 1987

[45] M J Flynn Some Computer Organizations and Their EffectivenessComputers IEEE Transactions on C-21(9)948 ndash960 Sep 1972

194 BIBLIOGRAPHY

[46] M R Garey and D S Johnson Computers and Intractability AGuide to the Theory of NP-Completeness W H Freeman amp CoNew York NY USA 1990

[47] A Gerasoulis S Venugopal and T Yang Clustering Task Graphsfor Message Passing Architectures SIGARCH Comput Archit News18(3b)447ndash456 June 1990

[48] M Girkar and C D Polychronopoulos Automatic Extraction of Func-tional Parallelism from Ordinary Programs IEEE Trans Parallel Dis-trib Syst 3166ndash178 Mar 1992

[49] G Goumas N Drosinos M Athanasaki and N Koziris Message-Passing Code Generation for Non-rectangular Tiling TransformationsParallel Computing 32 2006

[50] Graphviz Graph Visualization Software httpwwwgraphviz

org

[51] L Griffiths A Simple Adaptive Algorithm for Real-Time Processingin Antenna Arrays Proceedings of the IEEE 571696 ndash 1704 1969

[52] R Habel F Silber-Chaussumier and F Irigoin Generating EfficientParallel Programs for Distributed Memory Systems Technical ReportCRIA-523 MINES ParisTech and Telecom SudParis 2013

[53] C Harris and M Stephens A Combined Corner and Edge Detector InProceedings of the 4th Alvey Vision Conference pages 147ndash151 1988

[54] M J Harrold B Malloy and G Rothermel Efficient Construc-tion of Program Dependence Graphs SIGSOFT Softw Eng Notes18(3)160ndash170 July 1993

[55] J Howard S Dighe Y Hoskote S Vangal D Finan G RuhlD Jenkins H Wilson N Borkar G Schrom F Pailet S Jain T Ja-cob S Yada S Marella P Salihundam V Erraguntla M KonowM Riepen G Droege J Lindemann M Gries T Apel K HenrissT Lund-Larsen S Steibl S Borkar V De R Van der Wijngaartand T Mattson A 48-Core IA-32 message-passing processor withDVFS in 45nm CMOS In Solid-State Circuits Conference Digest ofTechnical Papers (ISSCC) 2010 IEEE International pages 108ndash109Feb

[56] A Hurson J T Lim K M Kavi and B Lee Parallelization ofDOALL and DOACROSS Loops ndash a Survey In M V Zelkowitzeditor Emphasizing Parallel Programming Techniques volume 45 ofAdvances in Computers pages 53 ndash 103 Elsevier 1997

BIBLIOGRAPHY 195

[57] Insieme Insieme - an Optimization System for OpenMP MPIand OpenCL Programs httpwwwdpsuibkacatinsieme

architecturehtml 2011

[58] F Irigoin P Jouvelot and R Triolet Semantical InterproceduralParallelization An Overview of the PIPS Project In ICS pages 244ndash251 1991

[59] K Jainandunsing Optimal Partitioning Scheme for WavefrontSys-tolic Array Processors In Proceedings of IEEE Symposium on Circuitsand Systems pages 940ndash943 1986

[60] P Jouvelot and R Triolet Newgen A Language Independent Pro-gram Generator Technical report CRIA-191 MINES ParisTech Jul1989

[61] G Karypis and V Kumar A Fast and High Quality Multilevel Schemefor Partitioning Irregular Graphs SIAM Journal on Scientific Com-puting 1998

[62] H Kasahara H Honda A Mogi A Ogura K Fujiwara andS Narita A Multi-Grain Parallelizing Compilation Scheme for OS-CAR (Optimally Scheduled Advanced Multiprocessor) In Proceedingsof the Fourth International Workshop on Languages and Compilersfor Parallel Computing pages 283ndash297 London UK 1992 Springer-Verlag

[63] A Kayi and T El-Ghazawi PGAS Languages wsdmhp09hpclgwuedukayipdf 2009

[64] P Kenyon P Agrawal and S Seth High-Level Microprogrammingan Optimizing C Compiler for a Processing Element of a CAD Accel-erator In Microprogramming and Microarchitecture Micro 23 Pro-ceedings of the 23rd Annual Workshop and Symposium Workshop onpages 97ndash106 nov 1990

[65] D Khaldi P Jouvelot and C Ancourt Parallelizing with BDSCa Resource-Constrained Scheduling Algorithm for Shared and Dis-tributed Memory Systems Technical Report CRIA-499 (Submittedto Parallel Computing) MINES ParisTech 2012

[66] D Khaldi P Jouvelot C Ancourt and F Irigoin Task Parallelismand Data Distribution An Overview of Explicit Parallel ProgrammingLanguages In H Kasahara and K Kimura editors LCPC volume7760 of Lecture Notes in Computer Science pages 174ndash189 Springer2012

196 BIBLIOGRAPHY

[67] D Khaldi P Jouvelot F Irigoin and C Ancourt SPIRE A Method-ology for Sequential to Parallel Intermediate Representation Exten-sion In Proceedings of the 17th Workshop on Compilers for ParallelComputing CPCrsquo13 2013

[68] M A Khan Scheduling for Heterogeneous Systems using ConstrainedCritical Paths Parallel Computing 38(4ndash5)175 ndash 193 2012

[69] Khronos OpenCL Working Group The OpenCL Specification Nov2012

[70] B Kruatrachue and T G Lewis Duplication Scheduling Heuristics(DSH) A New Precedence Task Scheduler for Parallel Processor Sys-tems Oregon State University Corvallis OR 1987

[71] S Kumar D Kim M Smelyanskiy Y-K Chen J Chhugani C JHughes C Kim V W Lee and A D Nguyen Atomic Vector Op-erations on Chip Multiprocessors SIGARCH Comput Archit News36(3)441ndash452 June 2008

[72] Y-K Kwok and I Ahmad Static Scheduling Algorithms for Allocat-ing Directed Task Graphs to Multiprocessors ACM Comput Surv31406ndash471 Dec 1999

[73] J Larus and C Kozyrakis Transactional Memory Commun ACM5180ndash88 Jul 2008

[74] S Lee S-J Min and R Eigenmann OpenMP to GPGPU a CompilerFramework for Automatic Translation and Optimization In Proceed-ings of the 14th ACM SIGPLAN symposium on Principles and practiceof parallel programming PPoPP rsquo09 pages 101ndash110 New York NYUSA 2009 ACM

[75] The LLVM Development Team httpllvmorgdocsLangRef

html The LLVM Reference Manual (Version 26) Mar 2010

[76] V Maisonneuve Convex Invariant Refinement by Control Node Split-ting a Heuristic Approach Electron Notes Theor Comput Sci28849ndash59 Dec 2012

[77] J Merrill GENERIC and GIMPLE a New Tree Representation forEntire Functions In GCC developers summit 2003 pages 171ndash1802003

[78] D Millot A Muller C Parrot and F Silber-Chaussumier STEPa Distributed OpenMP for Coarse-Grain Parallelism Tool In Pro-ceedings of the 4th international conference on OpenMP in a newera of parallelism IWOMPrsquo08 pages 83ndash99 Berlin Heidelberg 2008Springer-Verlag

BIBLIOGRAPHY 197

[79] D I Moldovan and J A B Fortes Partitioning and Mapping Algo-rithms into Fixed Size Systolic Arrays IEEE Trans Comput 35(1)1ndash12 Jan 1986

[80] A Moody D Ahn and B Supinski Exascale Algorithms for Gener-alized MPI Comm split In Y Cotronis A Danalis D Nikolopoulosand J Dongarra editors Recent Advances in the Message PassingInterface volume 6960 of Lecture Notes in Computer Science pages9ndash18 Springer Berlin Heidelberg 2011

[81] G Moore Cramming More Components Onto Integrated CircuitsProceedings of the IEEE 86(1)82 ndash85 jan 1998

[82] V K Nandivada J Shirako J Zhao and V Sarkar A Transforma-tion Framework for Optimizing Task-Parallel Programs ACM TransProgram Lang Syst 35(1)31ndash348 Apr 2013

[83] C J Newburn and J P Shen Automatic Partitioning of Signal Pro-cessing Programs for Symmetric Multiprocessors In IEEE PACTpages 269ndash280 IEEE Computer Society 1996

[84] D Novillo OpenMP and Automatic Parallelization in GCC In theProceedings of the GCC Developers Summit Jun 2006

[85] NPB NAS Parallel Benchmarks httpwwwnasnasagov

publicationsnpbhtml

[86] Y Orlarey D Fober and S Letz Adding Automatic Parallelizationto Faust In Linux Audio Conference 2009

[87] D A Padua editor Encyclopedia of Parallel Computing Springer2011

[88] D A Padua and M J Wolfe Advanced Compiler Optimizations forSupercomputers Commun ACM 291184ndash1201 Dec 1986

[89] S Pai R Govindarajan and M J Thazhuthaveetil PLASMAPortable Programming for SIMD Heterogeneous Accelerators InWorkshop on Language Compiler and Architecture Support forGPGPU held in conjunction with HPCAPPoPP 2010 BangaloreIndia Jan 9 2010

[90] PolyLib A Library of Polyhedral Functions httpicpsu-

strasbgfrpolylib

[91] A Pop and A Cohen A Stream-Computing Extension to OpenMP InProceedings of the 6th International Conference on High Performanceand Embedded Architectures and Compilers HiPEAC rsquo11 pages 5ndash14New York NY USA 2011 ACM

198 BIBLIOGRAPHY

[92] A Pop and A Cohen OpenStream Expressiveness and Data-FlowCompilation of OpenMP Streaming Programs ACM Trans ArchitCode Optim 9(4)531ndash5325 Jan 2013

[93] T Saidani Optimisation multi-niveau drsquoune application de traitementdrsquoimages sur machines paralleles PhD thesis Universite Paris-SudNov 2012

[94] T Saidani L Lacassagne J Falcou C Tadonki and S Bouaziz Par-allelization Schemes for Memory Optimization on the Cell ProcessorA Case Study on the Harris Corner Detector In P Stenstrom editorTransactions on High-Performance Embedded Architectures and Com-pilers III volume 6590 of Lecture Notes in Computer Science pages177ndash200 Springer Berlin Heidelberg 2011

[95] V Sarkar Synchronization Using Counting Semaphores In ICSrsquo88pages 627ndash637 1988

[96] V Sarkar Partitioning and Scheduling Parallel Programs for Multi-processors MIT Press Cambridge MA USA 1989

[97] V Sarkar COMP 322 Principles of Parallel Programming http

wwwcsriceedu~vs3PDFcomp322-lec9-f09-v4pdf 2009

[98] V Sarkar and B Simons Parallel Program Graphs and their Classi-fication In U Banerjee D Gelernter A Nicolau and D A Paduaeditors LCPC volume 768 of Lecture Notes in Computer Sciencepages 633ndash655 Springer 1993

[99] L Seiler D Carmean E Sprangle T Forsyth P Dubey S JunkinsA Lake R Cavin R Espasa E Grochowski T Juan M AbrashJ Sugerman and P Hanrahan Larrabee A Many-Core x86 Archi-tecture for Visual Computing Micro IEEE 29(1)10ndash21 2009

[100] J Shirako D M Peixotto V Sarkar and W N Scherer PhasersA Unified Deadlock-Free Construct for Collective and Point-To-PointSynchronization In ICSrsquo08 pages 277ndash288 New York NY USA2008 ACM

[101] M Solar A Scheduling Algorithm to Optimize Parallel ProcessesIn Chilean Computer Science Society 2008 SCCC rsquo08 InternationalConference of the pages 73ndash78 2008

[102] M Solar and M Inostroza A Scheduling Algorithm to Optimize Real-World Applications In Proceedings of the 24th International Confer-ence on Distributed Computing Systems Workshops - W7 EC (ICD-CSWrsquo04) - Volume 7 ICDCSW rsquo04 pages 858ndash862 Washington DCUSA 2004 IEEE Computer Society

BIBLIOGRAPHY 199

[103] Supercomputing Technologies Group Massachusetts Institute of Tech-nology Laboratory for Computer Science Cilk 546 Reference Man-ual Nov 2001

[104] Thales THALES Group httpwwwthalesgroupcom

[105] H Topcuouglu S Hariri and M-y Wu Performance-Effectiveand Low-Complexity Task Scheduling for Heterogeneous ComputingIEEE Trans Parallel Distrib Syst 13(3)260ndash274 Mar 2002

[106] S Verdoolaege A Cohen and A Beletska Transitive Closures ofAffine Integer Tuple Relations and Their Overapproximations InProceedings of the 18th International Conference on Static AnalysisSASrsquo11 pages 216ndash232 Berlin Heidelberg 2011 Springer-Verlag

[107] M-Y Wu and D D Gajski Hypertool A Programming Aid ForMessage-Passing Systems IEEE Trans on Parallel and DistributedSystems 1330ndash343 1990

[108] T Yang and A Gerasoulis DSC Scheduling Parallel Tasks on an Un-bounded Number of Processors IEEE Trans Parallel Distrib Syst5(9)951ndash967 1994

[109] X-S Yang and S Deb Cuckoo Search via Levy Flights In NatureBiologically Inspired Computing 2009 NaBIC 2009 World Congresson pages 210ndash214 2009

[110] K Yelick D Bonachea W-Y Chen P Colella K Datta J Du-ell S L Graham P Hargrove P Hilfinger P Husbands C IancuA Kamil R Nishtala J Su W Michael and T Wen Productivityand Performance Using Partitioned Global Address Space LanguagesIn Proceedings of the 2007 International Workshop on Parallel Sym-bolic Computation PASCO rsquo07 pages 24ndash32 New York NY USA2007 ACM

[111] J Zhao and V Sarkar Intermediate Language Extensions for Paral-lelism In Proceedings of the compilation of the co-located workshopson DSMrsquo11 TMCrsquo11 AGERErsquo11 AOOPESrsquo11 NEATrsquo11 amp38VMILrsquo11 SPLASH rsquo11 Workshops pages 329ndash340 New York NYUSA 2011 ACM

Abstract

Reacutesumeacute

Introduction

Context

Motivation

Contributions

Thesis Outline

Perspectives Bridging Parallel Architectures and Software Parallelism

Parallel Architectures

Multiprocessors

Memory Models

Parallelism Paradigms

Mapping Paradigms to Architectures

Program Dependence Graph (PDG)

Control Dependence Graph (CDG)

Data Dependence Graph (DDG)

List Scheduling

Background

Algorithm

PIPS Automatic Parallelizer and Code Transformation Framework

Transformer Analysis

Precondition Analysis

Array Region Analysis

``Complexity Analysis

PIPS (Sequential) IR

Conclusion

Concepts in Task Parallel Programming Languages

Introduction

Mandelbrot Set Computation

Task Parallelism Issues

Task Creation

Synchronization

Atomicity

Parallel Programming Languages

Cilk

Chapel

X10 and Habanero-Java

OpenMP

MPI

OpenCL

Discussion and Comparison

Conclusion

SPIRE A Generic Sequential to Parallel Intermediate Representation Extension Methodology

Introduction

SPIRE a Sequential to Parallel IR Extension

Design Approach

Execution

Synchronization

Data Distribution

SPIRE Operational Semantics

Sequential Core Language

SPIRE as a Language Transformer

Rewriting Rules

Validation SPIRE Application to LLVM IR

Related Work Parallel Intermediate Representations

Conclusion

BDSC A Memory-Constrained Number of Processor-Bounded Extension of DSC

Introduction

The DSC List-Scheduling Algorithm

The DSC Algorithm

Dominant Sequence Length Reduction Warranty

BDSC A Resource-Constrained Extension of DSC

DSC Weaknesses

Resource Modeling

Resource Constraint Warranty

Efficient Task-to-Cluster Mapping

The BDSC Algorithm

Related Work Scheduling Algorithms

Conclusion

BDSC-Based Hierarchical Task Parallelization

Introduction

Hierarchical Sequence Dependence DAG Mapping

Sequence Dependence DAG (SDG)

Hierarchical SDG Mapping

Sequential Cost Models Generation

From Convex Polyhedra to Ehrhart Polynomials

From Polynomials to Values

Reachability Analysis The Path Transformer

Path Definition

Path Transformer Algorithm

Operations on Regions using Path Transformers

BDSC-Based Hierarchical Scheduling (HBDSC)

Closure of SDGs

Recursive Top-Down Scheduling

Iterative Scheduling for Resource Optimization

Complexity of HBDSC Algorithm

Parallel Cost Models

Related Work Task Parallelization Tools

Conclusion

SPIRE-Based Parallel Code Transformations and Generation

Introduction

Parallel Unstructured to Structured Transformation

Structuring Parallelism


From Shared Memory to Distributed Memory Transformation

Related Work Communications Generation

Difficulties

Equilevel and Hierarchical Communications

Equilevel Communications Insertion Algorithm

Hierarchical Communications Insertion Algorithm

Communications Insertion Main Algorithm

Conclusion and Future Work

Parallel Code Generation

Mapping Approach

OpenMP Generation

SPMDization MPI Generation

Conclusion

Experimental Evaluation with PIPS

Introduction

Implementation of HBDSC- and SPIRE-Based Parallelization Processes in PIPS

Preliminary Passes

Task Parallelization Passes

OpenMP Related Passes

MPI Related Passes SPMDization

Experimental Setting

Benchmarks ABF Harris Equake IS and FFT

Experimental Platforms Austin and Cmmcluster

Effective Cost Model Parameters

Protocol

BDSC vs DSC

Experiments on Shared Memory Systems

Experiments on Distributed Memory Systems

Scheduling Robustness

Faust Parallel Scheduling vs BDSC

OpenMP Version in Faust

Faust Programs Karplus32 and Freeverb

Experimental Comparison

Conclusion

Conclusion

Contributions

Future Work

Abstract

This thesis intends to show how to efficiently exploit the parallelism presentin applications in order to enjoy the performance benefits that multiproces-sors can provide using a new automatic task parallelization methodologyfor compilers The key characteristics we focus on are resource constraintsand static scheduling This methodology includes the techniques requiredto decompose applications into tasks and generate equivalent parallel codeusing a generic approach that targets both different parallel languages andarchitectures We apply this methodology in the existing tool PIPS a com-prehensive source-to-source compilation platform

This thesis mainly focuses on three issues First since extracting taskparallelism from sequential codes is a scheduling problem we design andimplement an efficient automatic scheduling algorithm called BDSC forparallelism detection the result is a scheduled SDG a new task graph datastructure In a second step we design a new generic parallel intermediaterepresentation extension called SPIRE in which parallelized code may beexpressed Finally we wrap up our goal of automatic parallelization in anew BDSC- and SPIRE-based parallel code generator which is integratedwithin the PIPS compiler framework It targets both shared and distributedmemory systems using automatically generated OpenMP and MPI code

Resume


Contents

Abstract ii

Resume iv








27 Conclusion 28




CONTENTS vii

341 Cilk 34

342 Chapel 35


344 OpenMP 39

345 MPI 40

346 OpenCL 42


36 Conclusion 48


41 Introduction 50



422 Execution 53









46 Conclusion 74


51 Introduction 75











55 Conclusion 89


61 Introduction 91



viii CONTENTS
















67 Conclusion 128


71 Introduction 130
















75 Conclusion 161


81 Introduction 163





CONTENTS ix







88 Conclusion 183


List of Tables









List of Figures





22 Memory models 11



























xii






























xiv LIST OF FIGURES



















LIST OF FIGURES xv














Chapter 1



11 Context






12 Motivation






12 MOTIVATION 3





in = InitHarris ()


Sobel

SobelX(Gx in)

SobelY(Gy in)

Multiply

MultiplY(Ixx Gx Gx)

MultiplY(Iyy Gy Gy)

MultiplY(Ixy Gx Gy)

Gauss

Gauss(Sxx Ixx)

Gauss(Syy Iyy)

Gauss(Sxy Ixy)

Coarsity


return


in

SobelX

SobelY

Gx

Gy

MultiplY

MultiplY

MultiplY

Ixx

Ixy

Iyy

Gauss

Gauss

Gauss

Sxx

Sxy

Syy

Coarsity out







13 Contributions













14 THESIS OUTLINE 5




14 Thesis Outline












14 THESIS OUTLINE 7

C Source Code



(Chapter 2)


DDG(Chapter 2)


(Chapters 5 and 6)





MPI Code(Chapter 3)

Run



Chapter 2












211 Multiprocessors


CPU

L1 Cache

CPU

L1 Cache

CPU

CPU

L1 Cache L1 Cache

L2 Cache L2 Cache

L1 Cache

CPU

L1 Cache

CPU

CPU

L1 CacheL1 Cache

CPU

Main Memory

Interconnect

Main Memory






















212 Memory Models


Distributed Memory

ThreadProcess

Memory Access

Address Space

Messages

Shared Memory PGAS


Shared Memory




Distributed Memory











Task Parallelism




Data Parallelism







kernel ()



launchTask


kernel ()




















1

2X

3Y

4

5

Z

6


6

5

23

4

1


E

1

25

6

34











void main()

int a[10] b[10]

int idc

c=42

for(i=1i lt=10i++)

a[i] = 0

for(i=1i lt=10i++)

a[i] = bar(a[i])

d = foo(c)

b[i] = a[i]+d

return




25 List Scheduling


251 Background






252 Algorithm


entry 0

τ1 1 τ4 2

τ2 3

τ3 2

exit 0

0

0

2

1 1

0

0


1 τ4 0 72 τ3 3 23 τ1 0 54 τ2 4 3









tl = 0



return tl

end




bl = 0

















update_graph(G)


UT = UT -τ end






















T () a[i] = f(a[i])








int ij

P (i j) for(i = 1 i lt= 10 i++)


P (i j) i = 11for(j = 6 j lt= 20 j++)


P (i j) i = 11 j = 21











for(i = 1 i lt= 10 i++)

a[i] = f(a[i])

b[i] = 42


for(j = 6 j lt= 20 j++)

b[j]=g(a[j]a[j+1])



































type = area + void









27 Conclusion



Chapter 3




31 Introduction

















N = 2 maxiter = 10000








zr = zi = 0





k = 0

do



++k









331 Task Creation





332 Synchronization



333 Atomicity








341 Cilk





sync





do



++k










Task Parallelism



Synchronization


Atomic Section


Data Distribution


342 Chapel




on loc




do



k = k+1



atomic




Task Parallelism


Synchronization



Atomic Section



Data Distribution





finish



async at (pl_row)




do



++k



atomic




Task Parallelism



Synchronization




finish async



async clocked(cl)

S

next

Sprime

finish




S

next

Sprime



Atomic Section



Data Distribution


344 OpenMP






pragma omp single


pragma omp task




do



++k



pragma omp critical





Task Parallelism


Synchronization


Atomic Section


Data Distribution


345 MPI




int rank

MPI_Status status





if(rank == P)




do



++k




MPI_COMM_WORLD )

if(rank == 0)











Task Parallelism



MPI Finalize

Synchronization


Atomic Section


Data Distribution


346 OpenCL


Task Parallelism








do

temp = zrzr-zizi+cr


++k







NULL NULL ampret)












NULL NULL)









NULL NULL)



clFinish(cQueue )







Synchronization





Atomic Section


Data Distribution




Design Paradigms




New Concepts




Atomic Section





finish async


async clocked(cl)


next

start_searching ()

async clocked(cl)

hide_oneself ()

next


finish async


async phased(ph)


next

start searching ()

async phased(ph)

hide_oneself ()

next





start_searching ()

cilk void hidder ()

hide_oneself ()



void main()

spawn searcher ()

spawn hidder ()






for (i=0 iltn i++)

pragma omp atomic



Thread 1


isolated

ready = true

Thread 2

isolated

if(ready)

temp -gtnext = ptr


Data Distribution




Summary Table




ge point section









36 Conclusion




Chapter 4









41 Introduction













421 Design Approach











cobegin










422 Execution
















NULL 0 NULL NULL)

forloop(I 1

global_work_size

1

kernel ()

parallel)



forall i in 1n do

t[i] = 0

forloop(i 1 n 1

t[i] = 0 parallel)






x = 3 statement 1

y = 5 statement 2



nopentry

x=3 y=5

z=x+yu=x+1

nopexit



pragma omp section

x = foo()

pragma omp section

y = bar()

z = baz(xy)

sequence(


parallel )

z = baz(xy)

sequential)




423 Synchronization






synchronization =


















barrier(



)






Cilk_lockvar l

Cilk_lock_init(l)

Cilk_lock(l)

x[index[i]] += f(i)

Cilk_unlock(l)

pragma omp critical

x[index[i]] += f(i)














int i = Fiforce ()






signal(ph)

wait(ph)

signal(ph)


finish




S

next

Sprime

barrier(


j = 1

loop(j lt= n

spawn(j

S

signal(ph)

wait(ph)

signal(ph)

Srsquo)

j = j+1)

freeEvent(ph)

)



finish






next


















if (rank == 0)



else if(rank == 1)

sum = 42


MPI_COMM_WORLD )

MPI_Finalize ()


if(rank==0

recv(one sum)

if(rank==1

sum =42

send(zero sum)

nop)

)

parallel)







)





if(rank==0


parallel)

recv(zero sum))











S isin Stmt =




spawn(IS) |






v = ζ(E)m





ζ(E)m


notζ(E)m






Syntax



Semantic Domains





Memory Models





Semantic Rules






κrarr κprime





(47)



(48)



n = ζ(I)m n gt 0


n = ζ(I)m




(412)






433 Rewriting Rules


Execution








S




spawn(three





signal(e3exit)

)

Synchronization


wait(I)Ssignal(I)




barrier(wait(I_S)

if(first_S

S first_S = false

nop)

signal(I_S))
























terminator

phi_node = call

instruction = call



return


label_falseentity




sum = 42

for(i=0 ilt10 i++)

sum = sum + 2

entry

br label bb1

bb preds = bb1



br label bb1














Intrinsic Functions



main()

int A B C

A = foo()

B = bar(A)

C = baz()

call_AB(int Aint B)

A = foo()

B = bar(A)

main ()

int A B C


C = baz()

















46 Conclusion






Chapter 5





51 Introduction


76 CHAPTER 5 BDSC












procedure DSC(G P)







update_graph(G)


UT = UT -τ end






end


78 CHAPTER 5 BDSC

τ1 1 τ4 2

τ2 3

τ3 2

2

1

1


1 τ4 0 7 7 02 τ3 3 2 5 2 33 τ1 0 5 5 04 τ2 4 3 7 2 4
















end





Graph G










return false

return true

end


80 CHAPTER 5 BDSC


κ0 κ1 κ2τ4 τ1τ2 τ3





531 DSC Weaknesses










source management




task_data(τ ))end


82 CHAPTER 5 BDSC





cessor number P



end



end














end_idle_p = TRUE





end


84 CHAPTER 5 BDSC

τ11

τ25

τ32

τ43

τ52

τ62

1

1

1 1

1

9

2


1 τ1 0 15 15 02 τ3 2 13 15 13 τ2 2 12 14 3 24 τ4 8 6 14 75 τ5 8 6 14 8 106 τ6 13 2 15 11 12




average of costs

E[Xκ] =

|clusters`1sumκ=0

Xκ


V (Xκ) =


|clusters|


2 = 025 For









τ1 1τ3 τ2 7

τ4 τ5 10τ6 14


τ1 1τ3 τ2 7τ5 τ4 10τ6 13










86 CHAPTER 5 BDSC


bounds P and M


if (P le 0) then


G = graph_copy(Gu)



UT = vertices(G)











else



else







URL = UT -RL


return G

end







MCW_idle_clusters =


MCW(τ κ M)






else





else

return false

return true

end

function error(m G)


end


|successors(τG)|





88 CHAPTER 5 BDSC









55 CONCLUSION 89





55 Conclusion



90 CHAPTER 5 BDSC


Chapter 6






61 Introduction









test(EcondStSf ) |


call |



L isin Control








PIPS Analyses

DDGConvex Array



Sizes

Polynomials

SequentialIR

SequenceDependence

Graph

SDG


the same base


Instrumentation

InstrumentedSource

Execution

Numerical Profile

HBDSC


yes

no







Design Analysis































add_vertex(τi G)







return G

end









void main()

int a[10] b[10]

int idc

S

c=42

for(i=1i lt=10i++)

a[i] = 0

for(i=1i lt=10i++)

S0

a[i] = bar(a[i])

d = foo(c)

b[i] = a[i]+d

return















switch (S)

case call






















ehrhart(r)









= 4timesN timesM








write regions(S))


data size(R) =sum

risinR ehrhart(r)






Execution Task Time







NM17+2

for(i = 0 i lt N i += 1)

M17+2

for(j = 0 j lt M j += 1)

17























for (j = 0 j lt 4 j++)



for(j = 0 j lt= 2 j += 1)







for (j = 0 j lt 3 j++)

Sbody



















task_time 62 = 35192835

task_time 163 = 718743

edge_cost 163 -gt 166 = 323433

task_time 166 = 3953176

task_time 167 = 6

task_time 168 = 18327873

task_time 175 = 2

task_time 176 = 2

task_time 177 = 2








641 Path Definition








forloop0

i 1 10 sequence0

[S0 S1 Sb S2 Se S3]















primeprime1 x


primeprimep xprime1













end




switch (S)

case call

(mode = permanent






else








return Tt t Tf

else


return T(Econd) Tt



else

return T(S)


return


end



Call


Sequence


S

Sb i = 42

i = i+5

i = i3

Se i = i+4


T(i)

i = 141


Test



S

Sb i=4

i = i3

if(igt0)

k1++

Se i = i+4

else

k2++

i--


T(i k1)

i = 12

k1 = k1init + 1


Loop












Tinc = T(I=I+1)








if(Sbegin sub S)


low = ibegin+1




if(Send sub S)



low = Elower


T = Tp Te




return Ts

else

return T

else

return T t Ts

end










Tinc = T(I=I+1)





return Ts

end



Tenter = T(I lt= h)

Tinc = T(I=I+1)





else


return Tstar

end




S

l1

for(i=0iltni++)

Sba[i] = a[i]+1

k1++

n++

l2

for(j=0jltnj++)

k2++



T(ijk1k2n)

i+k1init = iinit+k1

j+k2init = k2 -1

n = ninit+1

iinit+1 le i

ninit le i

k2init+1 le k2

k2 le k2init+n
























651 Closure of SDGs




S = sequence(S1 Sn)













return G

end













case call


case sequence(S1Sn)



do




iter ++




κ |clusters(G)|)]




κ nbclustersbody)]





end













putation




nbclustersL = 0

foreach τ isin L




nbclusters(σ(s)) 0






return σend
















S0

for(i=1ileni++) S

0body


S1

for(i=1ileni++)S

1body



1 Φ

1 le ngt

H(S)







κ1

a[i]=foo()

κ

2

bar(b[i])

κ3

sum+=a[i]

κ4

c[i]=baz()

κ1

import(a[i])

H(S)


1 = igt

closure(S1body

H(S1body

))

H(S0body

)

S0 κ

0

for(i=1ileni++)S

0body


S1

κ0

for(i=1ileni++)S

1body












Parallel Task Time



cluster time(κ)















ehrhart(r)


























Xradic radic radic

Shareddistributed

Sarkarrsquosradic



OSCAR Xradic

Xradic radic radic

X Shared

Pedigreeradic radic

Xradic radic radic

X Shared


Shared


67 Conclusion


67 CONCLUSION 129




Chapter 7



Gandhi




71 Introduction



71 INTRODUCTION 131










Structuring


DataTransfer

Generation

sendrecv-Based IR


spawn rarr omp task


PrettyprinterOpenMP


MPI-Based IR

PrettyprinterMPI

OpenMP Code

MPI Code










nopentry

x=3 y=5

z=x+yu=x+1

nopexit

(a)

barrier(

ex=newEvent (0)

spawn(0

x = 3

signal(ex)

u = x + 1

)

spawn(1

y = 5

wait(ex)

z = x + y

)

freeEvent(ex)

)

(b)

barrier(

spawn(0 x = 3)

spawn(1 y = 5)

)

barrier(

spawn(0 u = x + 1)

spawn(1 z = x + y)

)

(c)









case call


Sstructured = []










end




end








SL = []


Lprime = []

nbclustersL = 0

foreach s isin L














in = InitHarris ()

Sobel

SobelX(Gx in)

SobelY(Gy in)

Multiply

MultiplY(Ixx Gx Gx)

MultiplY(Iyy Gy Gy)

MultiplY(Ixy Gx Gy)

Gauss

Gauss(Sxx Ixx)

Gauss(Syy Iyy)

Gauss(Sxy Ixy)

Coarsity

CoarsitY(out

Sxx Syy Sxy)

barrier(


)

barrier(



)

barrier(




)

barrier(




)

barrier(


Sxx Syy Sxy))

)













Gauss(Sxy Ixy)



S

S0


A[i] = 5 S01

B[i] = 3 S02

S1



C[i] = bar() S12

S2



S3


C[i] += A[i]+ B[i]S31

barrier(


barrier(

spawn(1 B[i] = 3)

spawn(2 A[i] = 5)

)

sequential ))

)

barrier(


barrier(



)

sequential ))

spawn(1

forloop(i 1 N 1

B[i] = baz(B[i])

sequential ))

)

barrier(


C[i] += A[i]+B[i]

sequential ))

)






















732 Difficulties






S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = foo(A[i-1]) S11


forloop(i 0 N-1 1

barrier(

spawn(0 A[i] = 5

if(i==0

asend(1 A[i])

nop))

spawn(1 B[i] = 3

asend(0 B[i])

) sequential )

forloop(i 1 N-1 1

barrier(

spawn(1 if(i==1

recv(0 A[i-1])

nop)

A[i] = foo(A[i -1]))


B[i] = bar(B[i]))

)

) sequential)






S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = A[i-p] + a S11

B[i] = A[i] B[i]S12

p = B[i]

S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = A[pi-q] + aS11

B[i] = A[i] B[i] S12


Exact Analyses




S0



if(i = N2)

A[i] = foo(A[i])

S1





B[i] = bar(A[i+j])

S2



B[j] = baz(A[2j])



S0



if (i=n2)

A[i] = foo(A[i])

S1



B[i] = bar(A[i+j])

S2


B[j] = baz(A[j2])










(1)1




(0)0 τ











τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)

τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)

τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)



Assumptions


















a schedule σ








G (equilevel)







return coms

end




τ2 τ

3

τ1

τ4

τ3

τ1

τ4

A

N

A

N













Rprimeo cup= ro



Rprimei cup= ri


end



return coms

end







S1



A[i] = foo()

S2



B[i] = bar(A[i2])

(a) bull

spawn(0


A[i] = foo()

for(i = 2 i lt= 2M i++)

asend(1 A[i])

)

spawn(1

for(i = 2 i lt= 2M i++)

recv(0 A[i])


B[i] = bar(A[i2])

)

(b) bull


recv(1 Gy)

recv(0 Gx)









in_regions(S)


|| [S]



end





(1)0


(1)1 is




Equilevel




SPIRE


case call


G = H(S)















end

Hierarchical





Example








for(i = 0i lt Ni++)

A[i] = 5

B[i] = 3

for(i = 0i lt Ni++)

A[i] = foo(A[i])

C[i] = bar()

for(i = 0i lt Ni++)

B[i] = baz(B[i])

for(i = 0i lt Ni++)

C[i] += A[i]+ B[i]

barrier(




barrier(

spawn(1


B[i] = 3


spawn(2


A[i] = 5




sequential )


barrier(





barrier(

spawn(2



A[i] = foo(A[i])


spawn(3


C[i] = bar()




sequential ))


forloop(i 1 N 1



barrier(


forloop(i 1 N 1




κ

0

for(i=0iltni++)

for(i=0iltni++)


A[i]=5

B[i]=3

iA[i]

iB[i]

A[i]=foo(A[i])

C[i]=bar()

i

A[i]

i

C[i]

B

B

κ

1

κ

0

κ

2

κ

1

κ

2

κ

3


κ

0



A[i]








barrier(

forloop(i 1 N 1

asend(1 i)

barrier(

spawn(1

recv(0 i)

B[i] = 3

asend(0 B[i]))

A[i] = 5

)

recv(1 B[i])

sequential)

)

barrier(


asend(2 i)

asend(2 A[i])

barrier(

spawn(2

recv(0 i)

recv(0 A[i])

A[i] = foo(A[i])

asend(0 A[i]))

C[i] = bar()

)

recv(2 A[i])

sequential ))

forloop(i 1 N 1

B[i] = baz(B[i])

sequential)

)

barrier(

forloop(i 1 N 1

C[i] += A[i]+B[i]

sequential)

)

forloop(i 1 N 1

asend(1 i)

barrier(

spawn(1

recv(0 i)

B[i] = 3

asend(0 B[i]))

A[i] = 5

)

recv(1 B[i])

sequential )

barrier(


asend(2 i)

asend(2 A[i])

barrier(

spawn(2

recv(0 i)

recv(0 A[i])

A[i] = foo(A[i])

asend(0 A[i]))

C[i] = bar()

)

recv(2 A[i])

sequential ))

forloop(i 1 N 1

B[i] = baz(B[i])

sequential)

)

forloop(i 1 N 1

C[i] += A[i]+B[i]

sequential)





Task Parallelism


Data Distribution






Task Parallelism










Synchronization





barrier(




)

pragma omp single

pragma omp task

Gauss(Sxx Ixx)

pragma omp task

Gauss(Syy Iyy)

pragma omp task

Gauss(Sxy Ixy)



barrier(

spawn(0

forloop(i 1 N 1

barrier(

spawn(1 B[i] = 3)

spawn(2 A[i] = 5)

)

sequential ))

)

barrier(

spawn(0

forloop(i 1 N 1

C[i] = A[i]+B[i]

sequential ))

)

pragma omp single

for(i = 1 i lt= N i += 1)

pragma omp task

B[i] = 3

pragma omp task

A[i] = 5

pragma omp taskwait

for(i = 1 i lt= N i += 1)

C[i] = A[i]+B[i]






Task Parallelism







75 CONCLUSION 161

barrier(

spawn(0

Gauss(Sxx Ixx)

)

spawn(1

Gauss(Syy Iyy)

asend(0 Syy)

)

spawn(2

Gauss(Sxy Ixy)

asend(0 Sxy)

)

)



if (rank0 ==0)

Gauss(Sxx Ixx)

if (rank0 ==1)

Gauss(Syy Iyy)



if (rank0 ==2)

Gauss(Sxy Ixy)



MPI_Finalize ()


75 Conclusion





Chapter 8





81 Introduction













C Source Code

PIPS Analyses


Regions

InputData

SequentialIR

DDG


SDG


NumericalProfile

HBDSC


Structuring


OpenMPGeneration

MPIGeneration


















echo OMP style




echo MPI style








close

quit



manualpdf






lt PROGRAMentities

lt MODULEcode


lt MODULEdg

lt MODULEregions







lt PROGRAMentities

lt MODULEcode


lt MODULEdg

lt MODULEregions






lt PROGRAMentities

lt MODULEcode











lt MODULEcode


lt MODULEsdg

lt MODULEregions












lt MODULEcode

lt MODULEsdg

lt MODULEschedule

HBDSC Tuning




HBDSC_NB_CLUSTERS 4













lt PROGRAMentities

lt MODULEcode




lt PROGRAMentities






lt PROGRAMentities

lt MODULEcode

= MODULEdg

= MODULEschedule



= MODULEregions






lt PROGRAMentities

lt MODULEcode




lt PROGRAMentities



declare variables


parse arguments

main program

MPI_Finalize ()







ABF


Harris






FFT








Austin


Cmmcluster




84 Protocol



84 PROTOCOL 175


int float int float












Parallel loop

for (i=0ilt19i++)


CORR_out[i])


for (i=0ilt19i+=5)



CORR_out[j])



for (j=0jlt5j++)


for (j=5jlt10j++)


for (j=10jlt15j++)


for (j=15jlt19j++)





85 BDSC vs DSC





85 BDSC VS DSC 177


1 2 4 6 80

1

2

3

4

5


Number of cores

Spe

edup







1024x1024 2048x1024 2048x2048 0

05

1

15

2

25Harris-OpenMP

Harris-tiled-OpenMP

Image sizeSp

ee

du

p 3

thre

ad

s vs

se

qu

en

tial




05

1

15

2

25

3




Benchmark version

Spe

edup

Ope

nMP

vs

seq

uent

ial





85 BDSC VS DSC 179





1 2 4 60

05

1

15

2

25

3



Spe

edup






1024x1024 2048x1024 2048x20480

02040608

112141618

2


Image size

Sp

ee

du

p 3

pro

cess

es

vs s

eq

ue

ntia

l




05

1

15

2

25




Benchmark version

Spe

edup

MP

I vs

se

quen

tial












HarrisOpenMP 153 277 153 153 153 153 153 277

MPI 303 378 303 303 303 303 303 378

ABFOpenMP 214 321 214 230 230 246 246 297

MPI 240 310 240 260 260 287 287 310

equakeOpenMP 58 134 58 58 80 80 80 102

MPI 106 206 106 106 162 162 162 188

ISOpenMP 16 35 20 20 20 25 25 29

MPI 25 50 32 32 32 39 39 46












Karplus32


Freeverb


88 CONCLUSION 183




05

1

15

2 FaustBDSC

count

Spe

edup

2 t

hrea

ds v

s s

eque

ntia

l




88 Conclusion




05

1

15

2

25

3 FaustBDSC

count

Spe

edup

8 t

hrea

ds v

s s

eque

ntia

l





Chapter 9






91 Contributions


HBDSC (Chapter 6)








SPIRE (Chapter 4)












92 Future Work


SPIRE


HBDSC Scheduling


92 FUTURE WORK 189


Cost Models





1 Otildeccedil13A

centordfEuml Ntilde

J

cent

ordfEuml aacute

Jlaquo ugrave

macr Q

ordf

eth AeumlPA

ordf

Q

ordfEuml aacute

Jlaquo ugrave

macr Ntilde

centordf

Keth


Bibliography










fr







190

BIBLIOGRAPHY 191











192 BIBLIOGRAPHY












BIBLIOGRAPHY 193













194 BIBLIOGRAPHY






org







BIBLIOGRAPHY 195












196 BIBLIOGRAPHY














BIBLIOGRAPHY 197








publicationsnpbhtml






strasbgfrpolylib


198 BIBLIOGRAPHY













BIBLIOGRAPHY 199










Abstract

Reacutesumeacute

Introduction

Context

Motivation

Contributions

Thesis Outline



Multiprocessors

Memory Models






List Scheduling

Background

Algorithm







Conclusion


Introduction



Task Creation

Synchronization

Atomicity


Cilk

Chapel


OpenMP

MPI

OpenCL


Conclusion


Introduction


Design Approach

Execution

Synchronization

Data Distribution




Rewriting Rules



Conclusion


Introduction


The DSC Algorithm



DSC Weaknesses

Resource Modeling



The BDSC Algorithm


Conclusion


Introduction








Path Definition




Closure of SDGs






Conclusion


Introduction






Difficulties







Mapping Approach

OpenMP Generation


Conclusion


Introduction


Preliminary Passes








Protocol

BDSC vs DSC








Conclusion

Conclusion

Contributions

Future Work

Resume


Contents

Abstract ii

Resume iv








27 Conclusion 28




CONTENTS vii

341 Cilk 34

342 Chapel 35


344 OpenMP 39

345 MPI 40

346 OpenCL 42


36 Conclusion 48


41 Introduction 50



422 Execution 53









46 Conclusion 74


51 Introduction 75











55 Conclusion 89


61 Introduction 91



viii CONTENTS
















67 Conclusion 128


71 Introduction 130
















75 Conclusion 161


81 Introduction 163





CONTENTS ix







88 Conclusion 183


List of Tables









List of Figures





22 Memory models 11



























xii






























xiv LIST OF FIGURES



















LIST OF FIGURES xv














Chapter 1



11 Context






12 Motivation






12 MOTIVATION 3





in = InitHarris ()


Sobel

SobelX(Gx in)

SobelY(Gy in)

Multiply

MultiplY(Ixx Gx Gx)

MultiplY(Iyy Gy Gy)

MultiplY(Ixy Gx Gy)

Gauss

Gauss(Sxx Ixx)

Gauss(Syy Iyy)

Gauss(Sxy Ixy)

Coarsity


return


in

SobelX

SobelY

Gx

Gy

MultiplY

MultiplY

MultiplY

Ixx

Ixy

Iyy

Gauss

Gauss

Gauss

Sxx

Sxy

Syy

Coarsity out







13 Contributions













14 THESIS OUTLINE 5




14 Thesis Outline












14 THESIS OUTLINE 7

C Source Code



(Chapter 2)


DDG(Chapter 2)


(Chapters 5 and 6)





MPI Code(Chapter 3)

Run



Chapter 2












211 Multiprocessors


CPU

L1 Cache

CPU

L1 Cache

CPU

CPU

L1 Cache L1 Cache

L2 Cache L2 Cache

L1 Cache

CPU

L1 Cache

CPU

CPU

L1 CacheL1 Cache

CPU

Main Memory

Interconnect

Main Memory






















212 Memory Models


Distributed Memory

ThreadProcess

Memory Access

Address Space

Messages

Shared Memory PGAS


Shared Memory




Distributed Memory











Task Parallelism




Data Parallelism







kernel ()



launchTask


kernel ()




















1

2X

3Y

4

5

Z

6


6

5

23

4

1


E

1

25

6

34











void main()

int a[10] b[10]

int idc

c=42

for(i=1i lt=10i++)

a[i] = 0

for(i=1i lt=10i++)

a[i] = bar(a[i])

d = foo(c)

b[i] = a[i]+d

return




25 List Scheduling


251 Background






252 Algorithm


entry 0

τ1 1 τ4 2

τ2 3

τ3 2

exit 0

0

0

2

1 1

0

0


1 τ4 0 72 τ3 3 23 τ1 0 54 τ2 4 3









tl = 0



return tl

end




bl = 0

















update_graph(G)


UT = UT -τ end






















T () a[i] = f(a[i])








int ij

P (i j) for(i = 1 i lt= 10 i++)


P (i j) i = 11for(j = 6 j lt= 20 j++)


P (i j) i = 11 j = 21











for(i = 1 i lt= 10 i++)

a[i] = f(a[i])

b[i] = 42


for(j = 6 j lt= 20 j++)

b[j]=g(a[j]a[j+1])



































type = area + void









27 Conclusion



Chapter 3




31 Introduction

















N = 2 maxiter = 10000








zr = zi = 0





k = 0

do



++k









331 Task Creation





332 Synchronization



333 Atomicity








341 Cilk





sync





do



++k










Task Parallelism



Synchronization


Atomic Section


Data Distribution


342 Chapel




on loc




do



k = k+1



atomic




Task Parallelism


Synchronization



Atomic Section



Data Distribution





finish



async at (pl_row)




do



++k



atomic




Task Parallelism



Synchronization




finish async



async clocked(cl)

S

next

Sprime

finish




S

next

Sprime



Atomic Section



Data Distribution


344 OpenMP






pragma omp single


pragma omp task




do



++k



pragma omp critical





Task Parallelism


Synchronization


Atomic Section


Data Distribution


345 MPI




int rank

MPI_Status status





if(rank == P)




do



++k




MPI_COMM_WORLD )

if(rank == 0)











Task Parallelism



MPI Finalize

Synchronization


Atomic Section


Data Distribution


346 OpenCL


Task Parallelism








do

temp = zrzr-zizi+cr


++k







NULL NULL ampret)












NULL NULL)









NULL NULL)



clFinish(cQueue )







Synchronization





Atomic Section


Data Distribution




Design Paradigms




New Concepts




Atomic Section





finish async


async clocked(cl)


next

start_searching ()

async clocked(cl)

hide_oneself ()

next


finish async


async phased(ph)


next

start searching ()

async phased(ph)

hide_oneself ()

next





start_searching ()

cilk void hidder ()

hide_oneself ()



void main()

spawn searcher ()

spawn hidder ()






for (i=0 iltn i++)

pragma omp atomic



Thread 1


isolated

ready = true

Thread 2

isolated

if(ready)

temp -gtnext = ptr


Data Distribution




Summary Table




ge point section









36 Conclusion




Chapter 4









41 Introduction













421 Design Approach











cobegin










422 Execution
















NULL 0 NULL NULL)

forloop(I 1

global_work_size

1

kernel ()

parallel)



forall i in 1n do

t[i] = 0

forloop(i 1 n 1

t[i] = 0 parallel)






x = 3 statement 1

y = 5 statement 2



nopentry

x=3 y=5

z=x+yu=x+1

nopexit



pragma omp section

x = foo()

pragma omp section

y = bar()

z = baz(xy)

sequence(


parallel )

z = baz(xy)

sequential)




423 Synchronization






synchronization =


















barrier(



)






Cilk_lockvar l

Cilk_lock_init(l)

Cilk_lock(l)

x[index[i]] += f(i)

Cilk_unlock(l)

pragma omp critical

x[index[i]] += f(i)














int i = Fiforce ()






signal(ph)

wait(ph)

signal(ph)


finish




S

next

Sprime

barrier(


j = 1

loop(j lt= n

spawn(j

S

signal(ph)

wait(ph)

signal(ph)

Srsquo)

j = j+1)

freeEvent(ph)

)



finish






next


















if (rank == 0)



else if(rank == 1)

sum = 42


MPI_COMM_WORLD )

MPI_Finalize ()


if(rank==0

recv(one sum)

if(rank==1

sum =42

send(zero sum)

nop)

)

parallel)







)





if(rank==0


parallel)

recv(zero sum))











S isin Stmt =




spawn(IS) |






v = ζ(E)m





ζ(E)m


notζ(E)m






Syntax



Semantic Domains





Memory Models





Semantic Rules






κrarr κprime





(47)



(48)



n = ζ(I)m n gt 0


n = ζ(I)m




(412)






433 Rewriting Rules


Execution








S




spawn(three





signal(e3exit)

)

Synchronization


wait(I)Ssignal(I)




barrier(wait(I_S)

if(first_S

S first_S = false

nop)

signal(I_S))
























terminator

phi_node = call

instruction = call



return


label_falseentity




sum = 42

for(i=0 ilt10 i++)

sum = sum + 2

entry

br label bb1

bb preds = bb1



br label bb1














Intrinsic Functions



main()

int A B C

A = foo()

B = bar(A)

C = baz()

call_AB(int Aint B)

A = foo()

B = bar(A)

main ()

int A B C


C = baz()

















46 Conclusion






Chapter 5





51 Introduction


76 CHAPTER 5 BDSC












procedure DSC(G P)







update_graph(G)


UT = UT -τ end






end


78 CHAPTER 5 BDSC

τ1 1 τ4 2

τ2 3

τ3 2

2

1

1


1 τ4 0 7 7 02 τ3 3 2 5 2 33 τ1 0 5 5 04 τ2 4 3 7 2 4
















end





Graph G










return false

return true

end


80 CHAPTER 5 BDSC


κ0 κ1 κ2τ4 τ1τ2 τ3





531 DSC Weaknesses










source management




task_data(τ ))end


82 CHAPTER 5 BDSC





cessor number P



end



end














end_idle_p = TRUE





end


84 CHAPTER 5 BDSC

τ11

τ25

τ32

τ43

τ52

τ62

1

1

1 1

1

9

2


1 τ1 0 15 15 02 τ3 2 13 15 13 τ2 2 12 14 3 24 τ4 8 6 14 75 τ5 8 6 14 8 106 τ6 13 2 15 11 12




average of costs

E[Xκ] =

|clusters`1sumκ=0

Xκ


V (Xκ) =


|clusters|


2 = 025 For









τ1 1τ3 τ2 7

τ4 τ5 10τ6 14


τ1 1τ3 τ2 7τ5 τ4 10τ6 13










86 CHAPTER 5 BDSC


bounds P and M


if (P le 0) then


G = graph_copy(Gu)



UT = vertices(G)











else



else







URL = UT -RL


return G

end







MCW_idle_clusters =


MCW(τ κ M)






else





else

return false

return true

end

function error(m G)


end


|successors(τG)|





88 CHAPTER 5 BDSC









55 CONCLUSION 89





55 Conclusion



90 CHAPTER 5 BDSC


Chapter 6






61 Introduction









test(EcondStSf ) |


call |



L isin Control








PIPS Analyses

DDGConvex Array



Sizes

Polynomials

SequentialIR

SequenceDependence

Graph

SDG


the same base


Instrumentation

InstrumentedSource

Execution

Numerical Profile

HBDSC


yes

no







Design Analysis































add_vertex(τi G)







return G

end









void main()

int a[10] b[10]

int idc

S

c=42

for(i=1i lt=10i++)

a[i] = 0

for(i=1i lt=10i++)

S0

a[i] = bar(a[i])

d = foo(c)

b[i] = a[i]+d

return















switch (S)

case call






















ehrhart(r)









= 4timesN timesM








write regions(S))


data size(R) =sum

risinR ehrhart(r)






Execution Task Time







NM17+2

for(i = 0 i lt N i += 1)

M17+2

for(j = 0 j lt M j += 1)

17























for (j = 0 j lt 4 j++)



for(j = 0 j lt= 2 j += 1)







for (j = 0 j lt 3 j++)

Sbody



















task_time 62 = 35192835

task_time 163 = 718743

edge_cost 163 -gt 166 = 323433

task_time 166 = 3953176

task_time 167 = 6

task_time 168 = 18327873

task_time 175 = 2

task_time 176 = 2

task_time 177 = 2








641 Path Definition








forloop0

i 1 10 sequence0

[S0 S1 Sb S2 Se S3]















primeprime1 x


primeprimep xprime1













end




switch (S)

case call

(mode = permanent






else








return Tt t Tf

else


return T(Econd) Tt



else

return T(S)


return


end



Call


Sequence


S

Sb i = 42

i = i+5

i = i3

Se i = i+4


T(i)

i = 141


Test



S

Sb i=4

i = i3

if(igt0)

k1++

Se i = i+4

else

k2++

i--


T(i k1)

i = 12

k1 = k1init + 1


Loop












Tinc = T(I=I+1)








if(Sbegin sub S)


low = ibegin+1




if(Send sub S)



low = Elower


T = Tp Te




return Ts

else

return T

else

return T t Ts

end










Tinc = T(I=I+1)





return Ts

end



Tenter = T(I lt= h)

Tinc = T(I=I+1)





else


return Tstar

end




S

l1

for(i=0iltni++)

Sba[i] = a[i]+1

k1++

n++

l2

for(j=0jltnj++)

k2++



T(ijk1k2n)

i+k1init = iinit+k1

j+k2init = k2 -1

n = ninit+1

iinit+1 le i

ninit le i

k2init+1 le k2

k2 le k2init+n
























651 Closure of SDGs




S = sequence(S1 Sn)













return G

end













case call


case sequence(S1Sn)



do




iter ++




κ |clusters(G)|)]




κ nbclustersbody)]





end













putation




nbclustersL = 0

foreach τ isin L




nbclusters(σ(s)) 0






return σend
















S0

for(i=1ileni++) S

0body


S1

for(i=1ileni++)S

1body



1 Φ

1 le ngt

H(S)







κ1

a[i]=foo()

κ

2

bar(b[i])

κ3

sum+=a[i]

κ4

c[i]=baz()

κ1

import(a[i])

H(S)


1 = igt

closure(S1body

H(S1body

))

H(S0body

)

S0 κ

0

for(i=1ileni++)S

0body


S1

κ0

for(i=1ileni++)S

1body












Parallel Task Time



cluster time(κ)















ehrhart(r)


























Xradic radic radic

Shareddistributed

Sarkarrsquosradic



OSCAR Xradic

Xradic radic radic

X Shared

Pedigreeradic radic

Xradic radic radic

X Shared


Shared


67 Conclusion


67 CONCLUSION 129




Chapter 7



Gandhi




71 Introduction



71 INTRODUCTION 131










Structuring


DataTransfer

Generation

sendrecv-Based IR


spawn rarr omp task


PrettyprinterOpenMP


MPI-Based IR

PrettyprinterMPI

OpenMP Code

MPI Code










nopentry

x=3 y=5

z=x+yu=x+1

nopexit

(a)

barrier(

ex=newEvent (0)

spawn(0

x = 3

signal(ex)

u = x + 1

)

spawn(1

y = 5

wait(ex)

z = x + y

)

freeEvent(ex)

)

(b)

barrier(

spawn(0 x = 3)

spawn(1 y = 5)

)

barrier(

spawn(0 u = x + 1)

spawn(1 z = x + y)

)

(c)









case call


Sstructured = []










end




end








SL = []


Lprime = []

nbclustersL = 0

foreach s isin L














in = InitHarris ()

Sobel

SobelX(Gx in)

SobelY(Gy in)

Multiply

MultiplY(Ixx Gx Gx)

MultiplY(Iyy Gy Gy)

MultiplY(Ixy Gx Gy)

Gauss

Gauss(Sxx Ixx)

Gauss(Syy Iyy)

Gauss(Sxy Ixy)

Coarsity

CoarsitY(out

Sxx Syy Sxy)

barrier(


)

barrier(



)

barrier(




)

barrier(




)

barrier(


Sxx Syy Sxy))

)













Gauss(Sxy Ixy)



S

S0


A[i] = 5 S01

B[i] = 3 S02

S1



C[i] = bar() S12

S2



S3


C[i] += A[i]+ B[i]S31

barrier(


barrier(

spawn(1 B[i] = 3)

spawn(2 A[i] = 5)

)

sequential ))

)

barrier(


barrier(



)

sequential ))

spawn(1

forloop(i 1 N 1

B[i] = baz(B[i])

sequential ))

)

barrier(


C[i] += A[i]+B[i]

sequential ))

)






















732 Difficulties






S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = foo(A[i-1]) S11


forloop(i 0 N-1 1

barrier(

spawn(0 A[i] = 5

if(i==0

asend(1 A[i])

nop))

spawn(1 B[i] = 3

asend(0 B[i])

) sequential )

forloop(i 1 N-1 1

barrier(

spawn(1 if(i==1

recv(0 A[i-1])

nop)

A[i] = foo(A[i -1]))


B[i] = bar(B[i]))

)

) sequential)






S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = A[i-p] + a S11

B[i] = A[i] B[i]S12

p = B[i]

S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = A[pi-q] + aS11

B[i] = A[i] B[i] S12


Exact Analyses




S0



if(i = N2)

A[i] = foo(A[i])

S1





B[i] = bar(A[i+j])

S2



B[j] = baz(A[2j])



S0



if (i=n2)

A[i] = foo(A[i])

S1



B[i] = bar(A[i+j])

S2


B[j] = baz(A[j2])










(1)1




(0)0 τ











τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)

τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)

τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)



Assumptions


















a schedule σ








G (equilevel)







return coms

end




τ2 τ

3

τ1

τ4

τ3

τ1

τ4

A

N

A

N













Rprimeo cup= ro



Rprimei cup= ri


end



return coms

end







S1



A[i] = foo()

S2



B[i] = bar(A[i2])

(a) bull

spawn(0


A[i] = foo()

for(i = 2 i lt= 2M i++)

asend(1 A[i])

)

spawn(1

for(i = 2 i lt= 2M i++)

recv(0 A[i])


B[i] = bar(A[i2])

)

(b) bull


recv(1 Gy)

recv(0 Gx)









in_regions(S)


|| [S]



end





(1)0


(1)1 is




Equilevel




SPIRE


case call


G = H(S)















end

Hierarchical





Example








for(i = 0i lt Ni++)

A[i] = 5

B[i] = 3

for(i = 0i lt Ni++)

A[i] = foo(A[i])

C[i] = bar()

for(i = 0i lt Ni++)

B[i] = baz(B[i])

for(i = 0i lt Ni++)

C[i] += A[i]+ B[i]

barrier(




barrier(

spawn(1


B[i] = 3


spawn(2


A[i] = 5




sequential )


barrier(





barrier(

spawn(2



A[i] = foo(A[i])


spawn(3


C[i] = bar()




sequential ))


forloop(i 1 N 1



barrier(


forloop(i 1 N 1




κ

0

for(i=0iltni++)

for(i=0iltni++)


A[i]=5

B[i]=3

iA[i]

iB[i]

A[i]=foo(A[i])

C[i]=bar()

i

A[i]

i

C[i]

B

B

κ

1

κ

0

κ

2

κ

1

κ

2

κ

3


κ

0



A[i]








barrier(

forloop(i 1 N 1

asend(1 i)

barrier(

spawn(1

recv(0 i)

B[i] = 3

asend(0 B[i]))

A[i] = 5

)

recv(1 B[i])

sequential)

)

barrier(


asend(2 i)

asend(2 A[i])

barrier(

spawn(2

recv(0 i)

recv(0 A[i])

A[i] = foo(A[i])

asend(0 A[i]))

C[i] = bar()

)

recv(2 A[i])

sequential ))

forloop(i 1 N 1

B[i] = baz(B[i])

sequential)

)

barrier(

forloop(i 1 N 1

C[i] += A[i]+B[i]

sequential)

)

forloop(i 1 N 1

asend(1 i)

barrier(

spawn(1

recv(0 i)

B[i] = 3

asend(0 B[i]))

A[i] = 5

)

recv(1 B[i])

sequential )

barrier(


asend(2 i)

asend(2 A[i])

barrier(

spawn(2

recv(0 i)

recv(0 A[i])

A[i] = foo(A[i])

asend(0 A[i]))

C[i] = bar()

)

recv(2 A[i])

sequential ))

forloop(i 1 N 1

B[i] = baz(B[i])

sequential)

)

forloop(i 1 N 1

C[i] += A[i]+B[i]

sequential)





Task Parallelism


Data Distribution






Task Parallelism










Synchronization





barrier(




)

pragma omp single

pragma omp task

Gauss(Sxx Ixx)

pragma omp task

Gauss(Syy Iyy)

pragma omp task

Gauss(Sxy Ixy)



barrier(

spawn(0

forloop(i 1 N 1

barrier(

spawn(1 B[i] = 3)

spawn(2 A[i] = 5)

)

sequential ))

)

barrier(

spawn(0

forloop(i 1 N 1

C[i] = A[i]+B[i]

sequential ))

)

pragma omp single

for(i = 1 i lt= N i += 1)

pragma omp task

B[i] = 3

pragma omp task

A[i] = 5

pragma omp taskwait

for(i = 1 i lt= N i += 1)

C[i] = A[i]+B[i]






Task Parallelism







75 CONCLUSION 161

barrier(

spawn(0

Gauss(Sxx Ixx)

)

spawn(1

Gauss(Syy Iyy)

asend(0 Syy)

)

spawn(2

Gauss(Sxy Ixy)

asend(0 Sxy)

)

)



if (rank0 ==0)

Gauss(Sxx Ixx)

if (rank0 ==1)

Gauss(Syy Iyy)



if (rank0 ==2)

Gauss(Sxy Ixy)



MPI_Finalize ()


75 Conclusion





Chapter 8





81 Introduction













C Source Code

PIPS Analyses


Regions

InputData

SequentialIR

DDG


SDG


NumericalProfile

HBDSC


Structuring


OpenMPGeneration

MPIGeneration


















echo OMP style




echo MPI style








close

quit



manualpdf






lt PROGRAMentities

lt MODULEcode


lt MODULEdg

lt MODULEregions







lt PROGRAMentities

lt MODULEcode


lt MODULEdg

lt MODULEregions






lt PROGRAMentities

lt MODULEcode











lt MODULEcode


lt MODULEsdg

lt MODULEregions












lt MODULEcode

lt MODULEsdg

lt MODULEschedule

HBDSC Tuning




HBDSC_NB_CLUSTERS 4













lt PROGRAMentities

lt MODULEcode




lt PROGRAMentities






lt PROGRAMentities

lt MODULEcode

= MODULEdg

= MODULEschedule



= MODULEregions






lt PROGRAMentities

lt MODULEcode




lt PROGRAMentities



declare variables


parse arguments

main program

MPI_Finalize ()







ABF


Harris






FFT








Austin


Cmmcluster




84 Protocol



84 PROTOCOL 175


int float int float












Parallel loop

for (i=0ilt19i++)


CORR_out[i])


for (i=0ilt19i+=5)



CORR_out[j])



for (j=0jlt5j++)


for (j=5jlt10j++)


for (j=10jlt15j++)


for (j=15jlt19j++)





85 BDSC vs DSC





85 BDSC VS DSC 177


1 2 4 6 80

1

2

3

4

5


Number of cores

Spe

edup







1024x1024 2048x1024 2048x2048 0

05

1

15

2

25Harris-OpenMP

Harris-tiled-OpenMP

Image sizeSp

ee

du

p 3

thre

ad

s vs

se

qu

en

tial




05

1

15

2

25

3




Benchmark version

Spe

edup

Ope

nMP

vs

seq

uent

ial





85 BDSC VS DSC 179





1 2 4 60

05

1

15

2

25

3



Spe

edup






1024x1024 2048x1024 2048x20480

02040608

112141618

2


Image size

Sp

ee

du

p 3

pro

cess

es

vs s

eq

ue

ntia

l




05

1

15

2

25




Benchmark version

Spe

edup

MP

I vs

se

quen

tial












HarrisOpenMP 153 277 153 153 153 153 153 277

MPI 303 378 303 303 303 303 303 378

ABFOpenMP 214 321 214 230 230 246 246 297

MPI 240 310 240 260 260 287 287 310

equakeOpenMP 58 134 58 58 80 80 80 102

MPI 106 206 106 106 162 162 162 188

ISOpenMP 16 35 20 20 20 25 25 29

MPI 25 50 32 32 32 39 39 46












Karplus32


Freeverb


88 CONCLUSION 183




05

1

15

2 FaustBDSC

count

Spe

edup

2 t

hrea

ds v

s s

eque

ntia

l




88 Conclusion




05

1

15

2

25

3 FaustBDSC

count

Spe

edup

8 t

hrea

ds v

s s

eque

ntia

l





Chapter 9






91 Contributions


HBDSC (Chapter 6)








SPIRE (Chapter 4)












92 Future Work


SPIRE


HBDSC Scheduling


92 FUTURE WORK 189


Cost Models





1 Otildeccedil13A

centordfEuml Ntilde

J

cent

ordfEuml aacute

Jlaquo ugrave

macr Q

ordf

eth AeumlPA

ordf

Q

ordfEuml aacute

Jlaquo ugrave

macr Ntilde

centordf

Keth


Bibliography










fr







190

BIBLIOGRAPHY 191











192 BIBLIOGRAPHY












BIBLIOGRAPHY 193













194 BIBLIOGRAPHY






org







BIBLIOGRAPHY 195












196 BIBLIOGRAPHY














BIBLIOGRAPHY 197








publicationsnpbhtml






strasbgfrpolylib


198 BIBLIOGRAPHY













BIBLIOGRAPHY 199










Abstract

Reacutesumeacute

Introduction

Context

Motivation

Contributions

Thesis Outline



Multiprocessors

Memory Models






List Scheduling

Background

Algorithm







Conclusion


Introduction



Task Creation

Synchronization

Atomicity


Cilk

Chapel


OpenMP

MPI

OpenCL


Conclusion


Introduction


Design Approach

Execution

Synchronization

Data Distribution




Rewriting Rules



Conclusion


Introduction


The DSC Algorithm



DSC Weaknesses

Resource Modeling



The BDSC Algorithm


Conclusion


Introduction








Path Definition




Closure of SDGs






Conclusion


Introduction






Difficulties







Mapping Approach

OpenMP Generation


Conclusion


Introduction


Preliminary Passes








Protocol

BDSC vs DSC








Conclusion

Conclusion

Contributions

Future Work

Contents

Abstract ii

Resume iv








27 Conclusion 28




CONTENTS vii

341 Cilk 34

342 Chapel 35


344 OpenMP 39

345 MPI 40

346 OpenCL 42


36 Conclusion 48


41 Introduction 50



422 Execution 53









46 Conclusion 74


51 Introduction 75











55 Conclusion 89


61 Introduction 91



viii CONTENTS
















67 Conclusion 128


71 Introduction 130
















75 Conclusion 161


81 Introduction 163





CONTENTS ix







88 Conclusion 183


List of Tables









List of Figures





22 Memory models 11



























xii






























xiv LIST OF FIGURES



















LIST OF FIGURES xv














Chapter 1



11 Context






12 Motivation






12 MOTIVATION 3





in = InitHarris ()


Sobel

SobelX(Gx in)

SobelY(Gy in)

Multiply

MultiplY(Ixx Gx Gx)

MultiplY(Iyy Gy Gy)

MultiplY(Ixy Gx Gy)

Gauss

Gauss(Sxx Ixx)

Gauss(Syy Iyy)

Gauss(Sxy Ixy)

Coarsity


return


in

SobelX

SobelY

Gx

Gy

MultiplY

MultiplY

MultiplY

Ixx

Ixy

Iyy

Gauss

Gauss

Gauss

Sxx

Sxy

Syy

Coarsity out







13 Contributions













14 THESIS OUTLINE 5




14 Thesis Outline












14 THESIS OUTLINE 7

C Source Code



(Chapter 2)


DDG(Chapter 2)


(Chapters 5 and 6)





MPI Code(Chapter 3)

Run



Chapter 2












211 Multiprocessors


CPU

L1 Cache

CPU

L1 Cache

CPU

CPU

L1 Cache L1 Cache

L2 Cache L2 Cache

L1 Cache

CPU

L1 Cache

CPU

CPU

L1 CacheL1 Cache

CPU

Main Memory

Interconnect

Main Memory






















212 Memory Models


Distributed Memory

ThreadProcess

Memory Access

Address Space

Messages

Shared Memory PGAS


Shared Memory




Distributed Memory











Task Parallelism




Data Parallelism







kernel ()



launchTask


kernel ()




















1

2X

3Y

4

5

Z

6


6

5

23

4

1


E

1

25

6

34











void main()

int a[10] b[10]

int idc

c=42

for(i=1i lt=10i++)

a[i] = 0

for(i=1i lt=10i++)

a[i] = bar(a[i])

d = foo(c)

b[i] = a[i]+d

return




25 List Scheduling


251 Background






252 Algorithm


entry 0

τ1 1 τ4 2

τ2 3

τ3 2

exit 0

0

0

2

1 1

0

0


1 τ4 0 72 τ3 3 23 τ1 0 54 τ2 4 3









tl = 0



return tl

end




bl = 0

















update_graph(G)


UT = UT -τ end






















T () a[i] = f(a[i])








int ij

P (i j) for(i = 1 i lt= 10 i++)


P (i j) i = 11for(j = 6 j lt= 20 j++)


P (i j) i = 11 j = 21











for(i = 1 i lt= 10 i++)

a[i] = f(a[i])

b[i] = 42


for(j = 6 j lt= 20 j++)

b[j]=g(a[j]a[j+1])



































type = area + void









27 Conclusion



Chapter 3




31 Introduction

















N = 2 maxiter = 10000








zr = zi = 0





k = 0

do



++k









331 Task Creation





332 Synchronization



333 Atomicity








341 Cilk





sync





do



++k










Task Parallelism



Synchronization


Atomic Section


Data Distribution


342 Chapel




on loc




do



k = k+1



atomic




Task Parallelism


Synchronization



Atomic Section



Data Distribution





finish



async at (pl_row)




do



++k



atomic




Task Parallelism



Synchronization




finish async



async clocked(cl)

S

next

Sprime

finish




S

next

Sprime



Atomic Section



Data Distribution


344 OpenMP






pragma omp single


pragma omp task




do



++k



pragma omp critical





Task Parallelism


Synchronization


Atomic Section


Data Distribution


345 MPI




int rank

MPI_Status status





if(rank == P)




do



++k




MPI_COMM_WORLD )

if(rank == 0)











Task Parallelism



MPI Finalize

Synchronization


Atomic Section


Data Distribution


346 OpenCL


Task Parallelism








do

temp = zrzr-zizi+cr


++k







NULL NULL ampret)












NULL NULL)









NULL NULL)



clFinish(cQueue )







Synchronization





Atomic Section


Data Distribution




Design Paradigms




New Concepts




Atomic Section





finish async


async clocked(cl)


next

start_searching ()

async clocked(cl)

hide_oneself ()

next


finish async


async phased(ph)


next

start searching ()

async phased(ph)

hide_oneself ()

next





start_searching ()

cilk void hidder ()

hide_oneself ()



void main()

spawn searcher ()

spawn hidder ()






for (i=0 iltn i++)

pragma omp atomic



Thread 1


isolated

ready = true

Thread 2

isolated

if(ready)

temp -gtnext = ptr


Data Distribution




Summary Table




ge point section









36 Conclusion




Chapter 4









41 Introduction













421 Design Approach











cobegin










422 Execution
















NULL 0 NULL NULL)

forloop(I 1

global_work_size

1

kernel ()

parallel)



forall i in 1n do

t[i] = 0

forloop(i 1 n 1

t[i] = 0 parallel)






x = 3 statement 1

y = 5 statement 2



nopentry

x=3 y=5

z=x+yu=x+1

nopexit



pragma omp section

x = foo()

pragma omp section

y = bar()

z = baz(xy)

sequence(


parallel )

z = baz(xy)

sequential)




423 Synchronization






synchronization =


















barrier(



)






Cilk_lockvar l

Cilk_lock_init(l)

Cilk_lock(l)

x[index[i]] += f(i)

Cilk_unlock(l)

pragma omp critical

x[index[i]] += f(i)














int i = Fiforce ()






signal(ph)

wait(ph)

signal(ph)


finish




S

next

Sprime

barrier(


j = 1

loop(j lt= n

spawn(j

S

signal(ph)

wait(ph)

signal(ph)

Srsquo)

j = j+1)

freeEvent(ph)

)



finish






next


















if (rank == 0)



else if(rank == 1)

sum = 42


MPI_COMM_WORLD )

MPI_Finalize ()


if(rank==0

recv(one sum)

if(rank==1

sum =42

send(zero sum)

nop)

)

parallel)







)





if(rank==0


parallel)

recv(zero sum))











S isin Stmt =




spawn(IS) |






v = ζ(E)m





ζ(E)m


notζ(E)m






Syntax



Semantic Domains





Memory Models





Semantic Rules






κrarr κprime





(47)



(48)



n = ζ(I)m n gt 0


n = ζ(I)m




(412)






433 Rewriting Rules


Execution








S




spawn(three





signal(e3exit)

)

Synchronization


wait(I)Ssignal(I)




barrier(wait(I_S)

if(first_S

S first_S = false

nop)

signal(I_S))
























terminator

phi_node = call

instruction = call



return


label_falseentity




sum = 42

for(i=0 ilt10 i++)

sum = sum + 2

entry

br label bb1

bb preds = bb1



br label bb1














Intrinsic Functions



main()

int A B C

A = foo()

B = bar(A)

C = baz()

call_AB(int Aint B)

A = foo()

B = bar(A)

main ()

int A B C


C = baz()

















46 Conclusion






Chapter 5





51 Introduction


76 CHAPTER 5 BDSC












procedure DSC(G P)







update_graph(G)


UT = UT -τ end






end


78 CHAPTER 5 BDSC

τ1 1 τ4 2

τ2 3

τ3 2

2

1

1


1 τ4 0 7 7 02 τ3 3 2 5 2 33 τ1 0 5 5 04 τ2 4 3 7 2 4
















end





Graph G










return false

return true

end


80 CHAPTER 5 BDSC


κ0 κ1 κ2τ4 τ1τ2 τ3





531 DSC Weaknesses










source management




task_data(τ ))end


82 CHAPTER 5 BDSC





cessor number P



end



end














end_idle_p = TRUE





end


84 CHAPTER 5 BDSC

τ11

τ25

τ32

τ43

τ52

τ62

1

1

1 1

1

9

2


1 τ1 0 15 15 02 τ3 2 13 15 13 τ2 2 12 14 3 24 τ4 8 6 14 75 τ5 8 6 14 8 106 τ6 13 2 15 11 12




average of costs

E[Xκ] =

|clusters`1sumκ=0

Xκ


V (Xκ) =


|clusters|


2 = 025 For









τ1 1τ3 τ2 7

τ4 τ5 10τ6 14


τ1 1τ3 τ2 7τ5 τ4 10τ6 13










86 CHAPTER 5 BDSC


bounds P and M


if (P le 0) then


G = graph_copy(Gu)



UT = vertices(G)











else



else







URL = UT -RL


return G

end







MCW_idle_clusters =


MCW(τ κ M)






else





else

return false

return true

end

function error(m G)


end


|successors(τG)|





88 CHAPTER 5 BDSC









55 CONCLUSION 89





55 Conclusion



90 CHAPTER 5 BDSC


Chapter 6






61 Introduction









test(EcondStSf ) |


call |



L isin Control








PIPS Analyses

DDGConvex Array



Sizes

Polynomials

SequentialIR

SequenceDependence

Graph

SDG


the same base


Instrumentation

InstrumentedSource

Execution

Numerical Profile

HBDSC


yes

no







Design Analysis































add_vertex(τi G)







return G

end









void main()

int a[10] b[10]

int idc

S

c=42

for(i=1i lt=10i++)

a[i] = 0

for(i=1i lt=10i++)

S0

a[i] = bar(a[i])

d = foo(c)

b[i] = a[i]+d

return















switch (S)

case call






















ehrhart(r)









= 4timesN timesM








write regions(S))


data size(R) =sum

risinR ehrhart(r)






Execution Task Time







NM17+2

for(i = 0 i lt N i += 1)

M17+2

for(j = 0 j lt M j += 1)

17























for (j = 0 j lt 4 j++)



for(j = 0 j lt= 2 j += 1)







for (j = 0 j lt 3 j++)

Sbody



















task_time 62 = 35192835

task_time 163 = 718743

edge_cost 163 -gt 166 = 323433

task_time 166 = 3953176

task_time 167 = 6

task_time 168 = 18327873

task_time 175 = 2

task_time 176 = 2

task_time 177 = 2








641 Path Definition








forloop0

i 1 10 sequence0

[S0 S1 Sb S2 Se S3]















primeprime1 x


primeprimep xprime1













end




switch (S)

case call

(mode = permanent






else








return Tt t Tf

else


return T(Econd) Tt



else

return T(S)


return


end



Call


Sequence


S

Sb i = 42

i = i+5

i = i3

Se i = i+4


T(i)

i = 141


Test



S

Sb i=4

i = i3

if(igt0)

k1++

Se i = i+4

else

k2++

i--


T(i k1)

i = 12

k1 = k1init + 1


Loop












Tinc = T(I=I+1)








if(Sbegin sub S)


low = ibegin+1




if(Send sub S)



low = Elower


T = Tp Te




return Ts

else

return T

else

return T t Ts

end










Tinc = T(I=I+1)





return Ts

end



Tenter = T(I lt= h)

Tinc = T(I=I+1)





else


return Tstar

end




S

l1

for(i=0iltni++)

Sba[i] = a[i]+1

k1++

n++

l2

for(j=0jltnj++)

k2++



T(ijk1k2n)

i+k1init = iinit+k1

j+k2init = k2 -1

n = ninit+1

iinit+1 le i

ninit le i

k2init+1 le k2

k2 le k2init+n
























651 Closure of SDGs




S = sequence(S1 Sn)













return G

end













case call


case sequence(S1Sn)



do




iter ++




κ |clusters(G)|)]




κ nbclustersbody)]





end













putation




nbclustersL = 0

foreach τ isin L




nbclusters(σ(s)) 0






return σend
















S0

for(i=1ileni++) S

0body


S1

for(i=1ileni++)S

1body



1 Φ

1 le ngt

H(S)







κ1

a[i]=foo()

κ

2

bar(b[i])

κ3

sum+=a[i]

κ4

c[i]=baz()

κ1

import(a[i])

H(S)


1 = igt

closure(S1body

H(S1body

))

H(S0body

)

S0 κ

0

for(i=1ileni++)S

0body


S1

κ0

for(i=1ileni++)S

1body












Parallel Task Time



cluster time(κ)















ehrhart(r)


























Xradic radic radic

Shareddistributed

Sarkarrsquosradic



OSCAR Xradic

Xradic radic radic

X Shared

Pedigreeradic radic

Xradic radic radic

X Shared


Shared


67 Conclusion


67 CONCLUSION 129




Chapter 7



Gandhi




71 Introduction



71 INTRODUCTION 131










Structuring


DataTransfer

Generation

sendrecv-Based IR


spawn rarr omp task


PrettyprinterOpenMP


MPI-Based IR

PrettyprinterMPI

OpenMP Code

MPI Code










nopentry

x=3 y=5

z=x+yu=x+1

nopexit

(a)

barrier(

ex=newEvent (0)

spawn(0

x = 3

signal(ex)

u = x + 1

)

spawn(1

y = 5

wait(ex)

z = x + y

)

freeEvent(ex)

)

(b)

barrier(

spawn(0 x = 3)

spawn(1 y = 5)

)

barrier(

spawn(0 u = x + 1)

spawn(1 z = x + y)

)

(c)









case call


Sstructured = []










end




end








SL = []


Lprime = []

nbclustersL = 0

foreach s isin L














in = InitHarris ()

Sobel

SobelX(Gx in)

SobelY(Gy in)

Multiply

MultiplY(Ixx Gx Gx)

MultiplY(Iyy Gy Gy)

MultiplY(Ixy Gx Gy)

Gauss

Gauss(Sxx Ixx)

Gauss(Syy Iyy)

Gauss(Sxy Ixy)

Coarsity

CoarsitY(out

Sxx Syy Sxy)

barrier(


)

barrier(



)

barrier(




)

barrier(




)

barrier(


Sxx Syy Sxy))

)













Gauss(Sxy Ixy)



S

S0


A[i] = 5 S01

B[i] = 3 S02

S1



C[i] = bar() S12

S2



S3


C[i] += A[i]+ B[i]S31

barrier(


barrier(

spawn(1 B[i] = 3)

spawn(2 A[i] = 5)

)

sequential ))

)

barrier(


barrier(



)

sequential ))

spawn(1

forloop(i 1 N 1

B[i] = baz(B[i])

sequential ))

)

barrier(


C[i] += A[i]+B[i]

sequential ))

)






















732 Difficulties






S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = foo(A[i-1]) S11


forloop(i 0 N-1 1

barrier(

spawn(0 A[i] = 5

if(i==0

asend(1 A[i])

nop))

spawn(1 B[i] = 3

asend(0 B[i])

) sequential )

forloop(i 1 N-1 1

barrier(

spawn(1 if(i==1

recv(0 A[i-1])

nop)

A[i] = foo(A[i -1]))


B[i] = bar(B[i]))

)

) sequential)






S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = A[i-p] + a S11

B[i] = A[i] B[i]S12

p = B[i]

S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = A[pi-q] + aS11

B[i] = A[i] B[i] S12


Exact Analyses




S0



if(i = N2)

A[i] = foo(A[i])

S1





B[i] = bar(A[i+j])

S2



B[j] = baz(A[2j])



S0



if (i=n2)

A[i] = foo(A[i])

S1



B[i] = bar(A[i+j])

S2


B[j] = baz(A[j2])










(1)1




(0)0 τ











τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)

τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)

τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)



Assumptions


















a schedule σ








G (equilevel)







return coms

end




τ2 τ

3

τ1

τ4

τ3

τ1

τ4

A

N

A

N













Rprimeo cup= ro



Rprimei cup= ri


end



return coms

end







S1



A[i] = foo()

S2



B[i] = bar(A[i2])

(a) bull

spawn(0


A[i] = foo()

for(i = 2 i lt= 2M i++)

asend(1 A[i])

)

spawn(1

for(i = 2 i lt= 2M i++)

recv(0 A[i])


B[i] = bar(A[i2])

)

(b) bull


recv(1 Gy)

recv(0 Gx)









in_regions(S)


|| [S]



end





(1)0


(1)1 is




Equilevel




SPIRE


case call


G = H(S)















end

Hierarchical





Example








for(i = 0i lt Ni++)

A[i] = 5

B[i] = 3

for(i = 0i lt Ni++)

A[i] = foo(A[i])

C[i] = bar()

for(i = 0i lt Ni++)

B[i] = baz(B[i])

for(i = 0i lt Ni++)

C[i] += A[i]+ B[i]

barrier(




barrier(

spawn(1


B[i] = 3


spawn(2


A[i] = 5




sequential )


barrier(





barrier(

spawn(2



A[i] = foo(A[i])


spawn(3


C[i] = bar()




sequential ))


forloop(i 1 N 1



barrier(


forloop(i 1 N 1




κ

0

for(i=0iltni++)

for(i=0iltni++)


A[i]=5

B[i]=3

iA[i]

iB[i]

A[i]=foo(A[i])

C[i]=bar()

i

A[i]

i

C[i]

B

B

κ

1

κ

0

κ

2

κ

1

κ

2

κ

3


κ

0



A[i]








barrier(

forloop(i 1 N 1

asend(1 i)

barrier(

spawn(1

recv(0 i)

B[i] = 3

asend(0 B[i]))

A[i] = 5

)

recv(1 B[i])

sequential)

)

barrier(


asend(2 i)

asend(2 A[i])

barrier(

spawn(2

recv(0 i)

recv(0 A[i])

A[i] = foo(A[i])

asend(0 A[i]))

C[i] = bar()

)

recv(2 A[i])

sequential ))

forloop(i 1 N 1

B[i] = baz(B[i])

sequential)

)

barrier(

forloop(i 1 N 1

C[i] += A[i]+B[i]

sequential)

)

forloop(i 1 N 1

asend(1 i)

barrier(

spawn(1

recv(0 i)

B[i] = 3

asend(0 B[i]))

A[i] = 5

)

recv(1 B[i])

sequential )

barrier(


asend(2 i)

asend(2 A[i])

barrier(

spawn(2

recv(0 i)

recv(0 A[i])

A[i] = foo(A[i])

asend(0 A[i]))

C[i] = bar()

)

recv(2 A[i])

sequential ))

forloop(i 1 N 1

B[i] = baz(B[i])

sequential)

)

forloop(i 1 N 1

C[i] += A[i]+B[i]

sequential)





Task Parallelism


Data Distribution






Task Parallelism










Synchronization





barrier(




)

pragma omp single

pragma omp task

Gauss(Sxx Ixx)

pragma omp task

Gauss(Syy Iyy)

pragma omp task

Gauss(Sxy Ixy)



barrier(

spawn(0

forloop(i 1 N 1

barrier(

spawn(1 B[i] = 3)

spawn(2 A[i] = 5)

)

sequential ))

)

barrier(

spawn(0

forloop(i 1 N 1

C[i] = A[i]+B[i]

sequential ))

)

pragma omp single

for(i = 1 i lt= N i += 1)

pragma omp task

B[i] = 3

pragma omp task

A[i] = 5

pragma omp taskwait

for(i = 1 i lt= N i += 1)

C[i] = A[i]+B[i]






Task Parallelism







75 CONCLUSION 161

barrier(

spawn(0

Gauss(Sxx Ixx)

)

spawn(1

Gauss(Syy Iyy)

asend(0 Syy)

)

spawn(2

Gauss(Sxy Ixy)

asend(0 Sxy)

)

)



if (rank0 ==0)

Gauss(Sxx Ixx)

if (rank0 ==1)

Gauss(Syy Iyy)



if (rank0 ==2)

Gauss(Sxy Ixy)



MPI_Finalize ()


75 Conclusion





Chapter 8





81 Introduction













C Source Code

PIPS Analyses


Regions

InputData

SequentialIR

DDG


SDG


NumericalProfile

HBDSC


Structuring


OpenMPGeneration

MPIGeneration


















echo OMP style




echo MPI style








close

quit



manualpdf






lt PROGRAMentities

lt MODULEcode


lt MODULEdg

lt MODULEregions







lt PROGRAMentities

lt MODULEcode


lt MODULEdg

lt MODULEregions






lt PROGRAMentities

lt MODULEcode











lt MODULEcode


lt MODULEsdg

lt MODULEregions












lt MODULEcode

lt MODULEsdg

lt MODULEschedule

HBDSC Tuning




HBDSC_NB_CLUSTERS 4













lt PROGRAMentities

lt MODULEcode




lt PROGRAMentities






lt PROGRAMentities

lt MODULEcode

= MODULEdg

= MODULEschedule



= MODULEregions






lt PROGRAMentities

lt MODULEcode




lt PROGRAMentities



declare variables


parse arguments

main program

MPI_Finalize ()







ABF


Harris






FFT








Austin


Cmmcluster




84 Protocol



84 PROTOCOL 175


int float int float












Parallel loop

for (i=0ilt19i++)


CORR_out[i])


for (i=0ilt19i+=5)



CORR_out[j])



for (j=0jlt5j++)


for (j=5jlt10j++)


for (j=10jlt15j++)


for (j=15jlt19j++)





85 BDSC vs DSC





85 BDSC VS DSC 177


1 2 4 6 80

1

2

3

4

5


Number of cores

Spe

edup







1024x1024 2048x1024 2048x2048 0

05

1

15

2

25Harris-OpenMP

Harris-tiled-OpenMP

Image sizeSp

ee

du

p 3

thre

ad

s vs

se

qu

en

tial




05

1

15

2

25

3




Benchmark version

Spe

edup

Ope

nMP

vs

seq

uent

ial





85 BDSC VS DSC 179





1 2 4 60

05

1

15

2

25

3



Spe

edup






1024x1024 2048x1024 2048x20480

02040608

112141618

2


Image size

Sp

ee

du

p 3

pro

cess

es

vs s

eq

ue

ntia

l




05

1

15

2

25




Benchmark version

Spe

edup

MP

I vs

se

quen

tial












HarrisOpenMP 153 277 153 153 153 153 153 277

MPI 303 378 303 303 303 303 303 378

ABFOpenMP 214 321 214 230 230 246 246 297

MPI 240 310 240 260 260 287 287 310

equakeOpenMP 58 134 58 58 80 80 80 102

MPI 106 206 106 106 162 162 162 188

ISOpenMP 16 35 20 20 20 25 25 29

MPI 25 50 32 32 32 39 39 46












Karplus32


Freeverb


88 CONCLUSION 183




05

1

15

2 FaustBDSC

count

Spe

edup

2 t

hrea

ds v

s s

eque

ntia

l




88 Conclusion




05

1

15

2

25

3 FaustBDSC

count

Spe

edup

8 t

hrea

ds v

s s

eque

ntia

l





Chapter 9






91 Contributions


HBDSC (Chapter 6)








SPIRE (Chapter 4)












92 Future Work


SPIRE


HBDSC Scheduling


92 FUTURE WORK 189


Cost Models





1 Otildeccedil13A

centordfEuml Ntilde

J

cent

ordfEuml aacute

Jlaquo ugrave

macr Q

ordf

eth AeumlPA

ordf

Q

ordfEuml aacute

Jlaquo ugrave

macr Ntilde

centordf

Keth


Bibliography










fr







190

BIBLIOGRAPHY 191











192 BIBLIOGRAPHY












BIBLIOGRAPHY 193













194 BIBLIOGRAPHY






org







BIBLIOGRAPHY 195












196 BIBLIOGRAPHY














BIBLIOGRAPHY 197








publicationsnpbhtml






strasbgfrpolylib


198 BIBLIOGRAPHY













BIBLIOGRAPHY 199










Abstract

Reacutesumeacute

Introduction

Context

Motivation

Contributions

Thesis Outline



Multiprocessors

Memory Models






List Scheduling

Background

Algorithm







Conclusion


Introduction



Task Creation

Synchronization

Atomicity


Cilk

Chapel


OpenMP

MPI

OpenCL


Conclusion


Introduction


Design Approach

Execution

Synchronization

Data Distribution




Rewriting Rules



Conclusion


Introduction


The DSC Algorithm



DSC Weaknesses

Resource Modeling



The BDSC Algorithm


Conclusion


Introduction








Path Definition




Closure of SDGs






Conclusion


Introduction






Difficulties







Mapping Approach

OpenMP Generation


Conclusion


Introduction


Preliminary Passes








Protocol

BDSC vs DSC








Conclusion

Conclusion

Contributions

Future Work

CONTENTS vii

341 Cilk 34

342 Chapel 35


344 OpenMP 39

345 MPI 40

346 OpenCL 42


36 Conclusion 48


41 Introduction 50



422 Execution 53









46 Conclusion 74


51 Introduction 75











55 Conclusion 89


61 Introduction 91



viii CONTENTS
















67 Conclusion 128


71 Introduction 130
















75 Conclusion 161


81 Introduction 163





CONTENTS ix







88 Conclusion 183


List of Tables









List of Figures





22 Memory models 11



























xii






























xiv LIST OF FIGURES



















LIST OF FIGURES xv














Chapter 1



11 Context






12 Motivation






12 MOTIVATION 3





in = InitHarris ()


Sobel

SobelX(Gx in)

SobelY(Gy in)

Multiply

MultiplY(Ixx Gx Gx)

MultiplY(Iyy Gy Gy)

MultiplY(Ixy Gx Gy)

Gauss

Gauss(Sxx Ixx)

Gauss(Syy Iyy)

Gauss(Sxy Ixy)

Coarsity


return


in

SobelX

SobelY

Gx

Gy

MultiplY

MultiplY

MultiplY

Ixx

Ixy

Iyy

Gauss

Gauss

Gauss

Sxx

Sxy

Syy

Coarsity out







13 Contributions













14 THESIS OUTLINE 5




14 Thesis Outline












14 THESIS OUTLINE 7

C Source Code



(Chapter 2)


DDG(Chapter 2)


(Chapters 5 and 6)





MPI Code(Chapter 3)

Run



Chapter 2












211 Multiprocessors


CPU

L1 Cache

CPU

L1 Cache

CPU

CPU

L1 Cache L1 Cache

L2 Cache L2 Cache

L1 Cache

CPU

L1 Cache

CPU

CPU

L1 CacheL1 Cache

CPU

Main Memory

Interconnect

Main Memory






















212 Memory Models


Distributed Memory

ThreadProcess

Memory Access

Address Space

Messages

Shared Memory PGAS


Shared Memory




Distributed Memory











Task Parallelism




Data Parallelism







kernel ()



launchTask


kernel ()




















1

2X

3Y

4

5

Z

6


6

5

23

4

1


E

1

25

6

34











void main()

int a[10] b[10]

int idc

c=42

for(i=1i lt=10i++)

a[i] = 0

for(i=1i lt=10i++)

a[i] = bar(a[i])

d = foo(c)

b[i] = a[i]+d

return




25 List Scheduling


251 Background






252 Algorithm


entry 0

τ1 1 τ4 2

τ2 3

τ3 2

exit 0

0

0

2

1 1

0

0


1 τ4 0 72 τ3 3 23 τ1 0 54 τ2 4 3









tl = 0



return tl

end




bl = 0

















update_graph(G)


UT = UT -τ end






















T () a[i] = f(a[i])








int ij

P (i j) for(i = 1 i lt= 10 i++)


P (i j) i = 11for(j = 6 j lt= 20 j++)


P (i j) i = 11 j = 21











for(i = 1 i lt= 10 i++)

a[i] = f(a[i])

b[i] = 42


for(j = 6 j lt= 20 j++)

b[j]=g(a[j]a[j+1])



































type = area + void









27 Conclusion



Chapter 3




31 Introduction

















N = 2 maxiter = 10000








zr = zi = 0





k = 0

do



++k









331 Task Creation





332 Synchronization



333 Atomicity








341 Cilk





sync





do



++k










Task Parallelism



Synchronization


Atomic Section


Data Distribution


342 Chapel




on loc




do



k = k+1



atomic




Task Parallelism


Synchronization



Atomic Section



Data Distribution





finish



async at (pl_row)




do



++k



atomic




Task Parallelism



Synchronization




finish async



async clocked(cl)

S

next

Sprime

finish




S

next

Sprime



Atomic Section



Data Distribution


344 OpenMP






pragma omp single


pragma omp task




do



++k



pragma omp critical





Task Parallelism


Synchronization


Atomic Section


Data Distribution


345 MPI




int rank

MPI_Status status





if(rank == P)




do



++k




MPI_COMM_WORLD )

if(rank == 0)











Task Parallelism



MPI Finalize

Synchronization


Atomic Section


Data Distribution


346 OpenCL


Task Parallelism








do

temp = zrzr-zizi+cr


++k







NULL NULL ampret)












NULL NULL)









NULL NULL)



clFinish(cQueue )







Synchronization





Atomic Section


Data Distribution




Design Paradigms




New Concepts




Atomic Section





finish async


async clocked(cl)


next

start_searching ()

async clocked(cl)

hide_oneself ()

next


finish async


async phased(ph)


next

start searching ()

async phased(ph)

hide_oneself ()

next





start_searching ()

cilk void hidder ()

hide_oneself ()



void main()

spawn searcher ()

spawn hidder ()






for (i=0 iltn i++)

pragma omp atomic



Thread 1


isolated

ready = true

Thread 2

isolated

if(ready)

temp -gtnext = ptr


Data Distribution




Summary Table




ge point section









36 Conclusion




Chapter 4









41 Introduction













421 Design Approach











cobegin










422 Execution
















NULL 0 NULL NULL)

forloop(I 1

global_work_size

1

kernel ()

parallel)



forall i in 1n do

t[i] = 0

forloop(i 1 n 1

t[i] = 0 parallel)






x = 3 statement 1

y = 5 statement 2



nopentry

x=3 y=5

z=x+yu=x+1

nopexit



pragma omp section

x = foo()

pragma omp section

y = bar()

z = baz(xy)

sequence(


parallel )

z = baz(xy)

sequential)




423 Synchronization






synchronization =


















barrier(



)






Cilk_lockvar l

Cilk_lock_init(l)

Cilk_lock(l)

x[index[i]] += f(i)

Cilk_unlock(l)

pragma omp critical

x[index[i]] += f(i)














int i = Fiforce ()






signal(ph)

wait(ph)

signal(ph)


finish




S

next

Sprime

barrier(


j = 1

loop(j lt= n

spawn(j

S

signal(ph)

wait(ph)

signal(ph)

Srsquo)

j = j+1)

freeEvent(ph)

)



finish






next


















if (rank == 0)



else if(rank == 1)

sum = 42


MPI_COMM_WORLD )

MPI_Finalize ()


if(rank==0

recv(one sum)

if(rank==1

sum =42

send(zero sum)

nop)

)

parallel)







)





if(rank==0


parallel)

recv(zero sum))











S isin Stmt =




spawn(IS) |






v = ζ(E)m





ζ(E)m


notζ(E)m






Syntax



Semantic Domains





Memory Models





Semantic Rules






κrarr κprime





(47)



(48)



n = ζ(I)m n gt 0


n = ζ(I)m




(412)






433 Rewriting Rules


Execution








S




spawn(three





signal(e3exit)

)

Synchronization


wait(I)Ssignal(I)




barrier(wait(I_S)

if(first_S

S first_S = false

nop)

signal(I_S))
























terminator

phi_node = call

instruction = call



return


label_falseentity




sum = 42

for(i=0 ilt10 i++)

sum = sum + 2

entry

br label bb1

bb preds = bb1



br label bb1














Intrinsic Functions



main()

int A B C

A = foo()

B = bar(A)

C = baz()

call_AB(int Aint B)

A = foo()

B = bar(A)

main ()

int A B C


C = baz()

















46 Conclusion






Chapter 5





51 Introduction


76 CHAPTER 5 BDSC












procedure DSC(G P)







update_graph(G)


UT = UT -τ end






end


78 CHAPTER 5 BDSC

τ1 1 τ4 2

τ2 3

τ3 2

2

1

1


1 τ4 0 7 7 02 τ3 3 2 5 2 33 τ1 0 5 5 04 τ2 4 3 7 2 4
















end





Graph G










return false

return true

end


80 CHAPTER 5 BDSC


κ0 κ1 κ2τ4 τ1τ2 τ3





531 DSC Weaknesses










source management




task_data(τ ))end


82 CHAPTER 5 BDSC





cessor number P



end



end














end_idle_p = TRUE





end


84 CHAPTER 5 BDSC

τ11

τ25

τ32

τ43

τ52

τ62

1

1

1 1

1

9

2


1 τ1 0 15 15 02 τ3 2 13 15 13 τ2 2 12 14 3 24 τ4 8 6 14 75 τ5 8 6 14 8 106 τ6 13 2 15 11 12




average of costs

E[Xκ] =

|clusters`1sumκ=0

Xκ


V (Xκ) =


|clusters|


2 = 025 For









τ1 1τ3 τ2 7

τ4 τ5 10τ6 14


τ1 1τ3 τ2 7τ5 τ4 10τ6 13










86 CHAPTER 5 BDSC


bounds P and M


if (P le 0) then


G = graph_copy(Gu)



UT = vertices(G)











else



else







URL = UT -RL


return G

end







MCW_idle_clusters =


MCW(τ κ M)






else





else

return false

return true

end

function error(m G)


end


|successors(τG)|





88 CHAPTER 5 BDSC









55 CONCLUSION 89





55 Conclusion



90 CHAPTER 5 BDSC


Chapter 6






61 Introduction









test(EcondStSf ) |


call |



L isin Control








PIPS Analyses

DDGConvex Array



Sizes

Polynomials

SequentialIR

SequenceDependence

Graph

SDG


the same base


Instrumentation

InstrumentedSource

Execution

Numerical Profile

HBDSC


yes

no







Design Analysis































add_vertex(τi G)







return G

end









void main()

int a[10] b[10]

int idc

S

c=42

for(i=1i lt=10i++)

a[i] = 0

for(i=1i lt=10i++)

S0

a[i] = bar(a[i])

d = foo(c)

b[i] = a[i]+d

return















switch (S)

case call






















ehrhart(r)









= 4timesN timesM








write regions(S))


data size(R) =sum

risinR ehrhart(r)






Execution Task Time







NM17+2

for(i = 0 i lt N i += 1)

M17+2

for(j = 0 j lt M j += 1)

17























for (j = 0 j lt 4 j++)



for(j = 0 j lt= 2 j += 1)







for (j = 0 j lt 3 j++)

Sbody



















task_time 62 = 35192835

task_time 163 = 718743

edge_cost 163 -gt 166 = 323433

task_time 166 = 3953176

task_time 167 = 6

task_time 168 = 18327873

task_time 175 = 2

task_time 176 = 2

task_time 177 = 2








641 Path Definition








forloop0

i 1 10 sequence0

[S0 S1 Sb S2 Se S3]















primeprime1 x


primeprimep xprime1













end




switch (S)

case call

(mode = permanent






else








return Tt t Tf

else


return T(Econd) Tt



else

return T(S)


return


end



Call


Sequence


S

Sb i = 42

i = i+5

i = i3

Se i = i+4


T(i)

i = 141


Test



S

Sb i=4

i = i3

if(igt0)

k1++

Se i = i+4

else

k2++

i--


T(i k1)

i = 12

k1 = k1init + 1


Loop












Tinc = T(I=I+1)








if(Sbegin sub S)


low = ibegin+1




if(Send sub S)



low = Elower


T = Tp Te




return Ts

else

return T

else

return T t Ts

end










Tinc = T(I=I+1)





return Ts

end



Tenter = T(I lt= h)

Tinc = T(I=I+1)





else


return Tstar

end




S

l1

for(i=0iltni++)

Sba[i] = a[i]+1

k1++

n++

l2

for(j=0jltnj++)

k2++



T(ijk1k2n)

i+k1init = iinit+k1

j+k2init = k2 -1

n = ninit+1

iinit+1 le i

ninit le i

k2init+1 le k2

k2 le k2init+n
























651 Closure of SDGs




S = sequence(S1 Sn)













return G

end













case call


case sequence(S1Sn)



do




iter ++




κ |clusters(G)|)]




κ nbclustersbody)]





end













putation




nbclustersL = 0

foreach τ isin L




nbclusters(σ(s)) 0






return σend
















S0

for(i=1ileni++) S

0body


S1

for(i=1ileni++)S

1body



1 Φ

1 le ngt

H(S)







κ1

a[i]=foo()

κ

2

bar(b[i])

κ3

sum+=a[i]

κ4

c[i]=baz()

κ1

import(a[i])

H(S)


1 = igt

closure(S1body

H(S1body

))

H(S0body

)

S0 κ

0

for(i=1ileni++)S

0body


S1

κ0

for(i=1ileni++)S

1body












Parallel Task Time



cluster time(κ)















ehrhart(r)


























Xradic radic radic

Shareddistributed

Sarkarrsquosradic



OSCAR Xradic

Xradic radic radic

X Shared

Pedigreeradic radic

Xradic radic radic

X Shared


Shared


67 Conclusion


67 CONCLUSION 129




Chapter 7



Gandhi




71 Introduction



71 INTRODUCTION 131










Structuring


DataTransfer

Generation

sendrecv-Based IR


spawn rarr omp task


PrettyprinterOpenMP


MPI-Based IR

PrettyprinterMPI

OpenMP Code

MPI Code










nopentry

x=3 y=5

z=x+yu=x+1

nopexit

(a)

barrier(

ex=newEvent (0)

spawn(0

x = 3

signal(ex)

u = x + 1

)

spawn(1

y = 5

wait(ex)

z = x + y

)

freeEvent(ex)

)

(b)

barrier(

spawn(0 x = 3)

spawn(1 y = 5)

)

barrier(

spawn(0 u = x + 1)

spawn(1 z = x + y)

)

(c)









case call


Sstructured = []










end




end








SL = []


Lprime = []

nbclustersL = 0

foreach s isin L














in = InitHarris ()

Sobel

SobelX(Gx in)

SobelY(Gy in)

Multiply

MultiplY(Ixx Gx Gx)

MultiplY(Iyy Gy Gy)

MultiplY(Ixy Gx Gy)

Gauss

Gauss(Sxx Ixx)

Gauss(Syy Iyy)

Gauss(Sxy Ixy)

Coarsity

CoarsitY(out

Sxx Syy Sxy)

barrier(


)

barrier(



)

barrier(




)

barrier(




)

barrier(


Sxx Syy Sxy))

)













Gauss(Sxy Ixy)



S

S0


A[i] = 5 S01

B[i] = 3 S02

S1



C[i] = bar() S12

S2



S3


C[i] += A[i]+ B[i]S31

barrier(


barrier(

spawn(1 B[i] = 3)

spawn(2 A[i] = 5)

)

sequential ))

)

barrier(


barrier(



)

sequential ))

spawn(1

forloop(i 1 N 1

B[i] = baz(B[i])

sequential ))

)

barrier(


C[i] += A[i]+B[i]

sequential ))

)






















732 Difficulties






S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = foo(A[i-1]) S11


forloop(i 0 N-1 1

barrier(

spawn(0 A[i] = 5

if(i==0

asend(1 A[i])

nop))

spawn(1 B[i] = 3

asend(0 B[i])

) sequential )

forloop(i 1 N-1 1

barrier(

spawn(1 if(i==1

recv(0 A[i-1])

nop)

A[i] = foo(A[i -1]))


B[i] = bar(B[i]))

)

) sequential)






S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = A[i-p] + a S11

B[i] = A[i] B[i]S12

p = B[i]

S

S0


A[i] = 5 S01

B[i] = 3 S02

S1


A[i] = A[pi-q] + aS11

B[i] = A[i] B[i] S12


Exact Analyses




S0



if(i = N2)

A[i] = foo(A[i])

S1





B[i] = bar(A[i+j])

S2



B[j] = baz(A[2j])



S0



if (i=n2)

A[i] = foo(A[i])

S1



B[i] = bar(A[i+j])

S2


B[j] = baz(A[j2])










(1)1




(0)0 τ











τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)

τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)

τa

(0)

τa

(2)

τb

(2)τa

(1)

τc

(2)

τd

(2)τb

(1)



Assumptions


















a schedule σ








G (equilevel)







return coms

end




τ2 τ

3

τ1

τ4

τ3

τ1

τ4

A

N

A

N













Rprimeo cup= ro



Rprimei cup= ri


end



return coms

end







S1



A[i] = foo()

S2



B[i] = bar(A[i2])

(a) bull

spawn(0


A[i] = foo()

for(i = 2 i lt= 2M i++)

asend(1 A[i])

)

spawn(1

for(i = 2 i lt= 2M i++)

recv(0 A[i])


B[i] = bar(A[i2])

)

(b) bull


recv(1 Gy)

recv(0 Gx)









in_regions(S)


|| [S]



end





(1)0


(1)1 is




Equilevel




SPIRE


case call


G = H(S)















end

Hierarchical





Example








for(i = 0i lt Ni++)

A[i] = 5

B[i] = 3

for(i = 0i lt Ni++)

A[i] = foo(A[i])

C[i] = bar()

for(i = 0i lt Ni++)

B[i] = baz(B[i])

for(i = 0i lt Ni++)

C[i] += A[i]+ B[i]

barrier(




barrier(

spawn(1


B[i] = 3


spawn(2


A[i] = 5




sequential )


barrier(





barrier(

spawn(2



A[i] = foo(A[i])


spawn(3


C[i] = bar()




sequential ))


forloop(i 1 N 1



barrier(


forloop(i 1 N 1




κ

0

for(i=0iltni++)

for(i=0iltni++)


A[i]=5

B[i]=3

iA[i]

iB[i]

A[i]=foo(A[i])

C[i]=bar()

i

A[i]

i

C[i]

B

B

κ

1

κ

0

κ

2

κ

1

κ

2

κ

3


κ

0



A[i]








barrier(

forloop(i 1 N 1

asend(1 i)

barrier(

spawn(1

recv(0 i)

B[i] = 3

asend(0 B[i]))

A[i] = 5

)

recv(1 B[i])

sequential)

)

barrier(


asend(2 i)

asend(2 A[i])

barrier(

spawn(2

recv(0 i)

recv(0 A[i])

A[i] = foo(A[i])

asend(0 A[i]))

C[i] = bar()

)

recv(2 A[i])

sequential ))

forloop(i 1 N 1

B[i] = baz(B[i])

sequential)

)

barrier(

forloop(i 1 N 1

C[i] += A[i]+B[i]

sequential)

)

forloop(i 1 N 1

asend(1 i)

barrier(

spawn(1

recv(0 i)

B[i] = 3

asend(0 B[i]))

A[i] = 5

)

recv(1 B[i])

sequential )

barrier(


asend(2 i)

asend(2 A[i])

barrier(

spawn(2

recv(0 i)

recv(0 A[i])

A[i] = foo(A[i])

asend(0 A[i]))

C[i] = bar()

)

recv(2 A[i])

sequential ))

forloop(i 1 N 1

B[i] = baz(B[i])

sequential)

)

forloop(i 1 N 1

C[i] += A[i]+B[i]

sequential)





Task Parallelism


Data Distribution






Task Parallelism










Synchronization





barrier(




)

pragma omp single

pragma omp task

Gauss(Sxx Ixx)

pragma omp task

Gauss(Syy Iyy)

pragma omp task

Gauss(Sxy Ixy)



barrier(

spawn(0

forloop(i 1 N 1

barrier(

spawn(1 B[i] = 3)

spawn(2 A[i] = 5)

)

sequential ))

)

barrier(

spawn(0

forloop(i 1 N 1

C[i] = A[i]+B[i]

sequential ))

)

pragma omp single

for(i = 1 i lt= N i += 1)

pragma omp task

B[i] = 3

pragma omp task

A[i] = 5

pragma omp taskwait

for(i = 1 i lt= N i += 1)

C[i] = A[i]+B[i]






Task Parallelism







75 CONCLUSION 161

barrier(

spawn(0

Gauss(Sxx Ixx)

)

spawn(1

Gauss(Syy Iyy)

asend(0 Syy)

)

spawn(2

Gauss(Sxy Ixy)

asend(0 Sxy)

)

)



if (rank0 ==0)

Gauss(Sxx Ixx)

if (rank0 ==1)

Gauss(Syy Iyy)



if (rank0 ==2)

Gauss(Sxy Ixy)



MPI_Finalize ()


75 Conclusion





Chapter 8





81 Introduction













C Source Code

PIPS Analyses


Regions

InputData

SequentialIR

DDG


SDG


NumericalProfile

HBDSC


Structuring


OpenMPGeneration

MPIGeneration


















echo OMP style




echo MPI style








close

quit



manualpdf






lt PROGRAMentities

lt MODULEcode


lt MODULEdg

lt MODULEregions







lt PROGRAMentities

lt MODULEcode


lt MODULEdg

lt MODULEregions






lt PROGRAMentities

lt MODULEcode











lt MODULEcode


lt MODULEsdg

lt MODULEregions












lt MODULEcode

lt MODULEsdg

lt MODULEschedule

HBDSC Tuning




HBDSC_NB_CLUSTERS 4













lt PROGRAMentities

lt MODULEcode




lt PROGRAMentities






lt PROGRAMentities

lt MODULEcode

= MODULEdg

= MODULEschedule



= MODULEregions






lt PROGRAMentities

lt MODULEcode




lt PROGRAMentities



declare variables


parse arguments

main program

MPI_Finalize ()







ABF


Harris






FFT








Austin


Cmmcluster




84 Protocol



84 PROTOCOL 175


int float int float












Parallel loop

for (i=0ilt19i++)


CORR_out[i])


for (i=0ilt19i+=5)



CORR_out[j])



for (j=0jlt5j++)


for (j=5jlt10j++)


for (j=10jlt15j++)


for (j=15jlt19j++)





85 BDSC vs DSC





85 BDSC VS DSC 177


1 2 4 6 80

1

2

3

4

5


Number of cores

Spe

edup







1024x1024 2048x1024 2048x2048 0

05

1

15

2

25Harris-OpenMP

Harris-tiled-OpenMP

Image sizeSp

ee

du

p 3

thre

ad

s vs

se

qu

en

tial




05

1

15

2

25

3




Benchmark version

Spe

edup

Ope

nMP

vs

seq

uent

ial





85 BDSC VS DSC 179





1 2 4 60

05

1

15

2

25

3



Spe

edup






1024x1024 2048x1024 2048x20480

02040608

112141618

2


Image size

Sp

ee

du

p 3

pro

cess

es

vs s

eq

ue

ntia

l




05

1

15

2

25




Benchmark version

Spe

edup

MP

I vs

se

quen

tial












HarrisOpenMP 153 277 153 153 153 153 153 277

MPI 303 378 303 303 303 303 303 378

ABFOpenMP 214 321 214 230 230 246 246 297

MPI 240 310 240 260 260 287 287 310

equakeOpenMP 58 134 58 58 80 80 80 102

MPI 106 206 106 106 162 162 162 188

ISOpenMP 16 35 20 20 20 25 25 29

MPI 25 50 32 32 32 39 39 46












Karplus32


Freeverb


88 CONCLUSION 183




05

1

15

2 FaustBDSC

count

Spe

edup

2 t

hrea

ds v

s s

eque

ntia

l




88 Conclusion




05

1

15

2

25

3 FaustBDSC

count

Spe

edup

8 t

hrea

ds v

s s

eque

ntia

l





Chapter 9






91 Contributions


HBDSC (Chapter 6)








SPIRE (Chapter 4)












92 Future Work


SPIRE


HBDSC Scheduling


92 FUTURE WORK 189


Cost Models





1 Otildeccedil13A

centordfEuml Ntilde

J

cent

ordfEuml aacute

Jlaquo ugrave

macr Q

ordf

eth AeumlPA

ordf

Q

ordfEuml aacute

Jlaquo ugrave

macr Ntilde

centordf

Keth


Bibliography










fr







190

BIBLIOGRAPHY 191











192 BIBLIOGRAPHY












BIBLIOGRAPHY 193













194 BIBLIOGRAPHY






org







BIBLIOGRAPHY 195












196 BIBLIOGRAPHY














BIBLIOGRAPHY 197








publicationsnpbhtml






strasbgfrpolylib


198 BIBLIOGRAPHY













BIBLIOGRAPHY 199










Abstract

Reacutesumeacute

Introduction

Context

Motivation

Contributions

Thesis Outline



Multiprocessors

Memory Models






List Scheduling

Background

Algorithm







Conclusion


Introduction



Task Creation

Synchronization

Atomicity


Cilk

Chapel


OpenMP

MPI

OpenCL


Conclusion


Introduction


Design Approach

Execution

Synchronization

Data Distribution




Rewriting Rules



Conclusion


Introduction


The DSC Algorithm



DSC Weaknesses

Resource Modeling



The BDSC Algorithm


Conclusion


Introduction








Path Definition




Closure of SDGs






Conclusion


Introduction






Difficulties







Mapping Approach

OpenMP Generation


Conclusion


Introduction


Preliminary Passes








Protocol

BDSC vs DSC








Conclusion

Conclusion

Contributions

Future Work

Doctorat européen ParisTech T H È S E l'École nationale ... · Les caract eristiques cl es de...

Documents

Transcript of Doctorat européen ParisTech T H È S E l'École nationale ... · Les caract eristiques cl es de...