Adaptabilité

Adaptabilité

€

7 3 6

0 1 8

0 0 5

⎡

⎣

⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥

Les données varientLes ressources varient

Application

Nécessité d’adaptation pour améliorer la performance

MiniSymposiumAdaptive Algortihms

for Scientific computing• 9h45 Adaptive algorithms - Theory and applications

– Collective work - AHA Team Jean-Louis Roch, INRIA-CNRS Grenoble, France

10h15 Hybrids in exact linear algebra– Dave Saunders, U. Delaware, USA

10h45 Adaptive programming with hierarchical multiprocessor tasksThomas Rauber, U. Bayreuth, Germany

11h15 Cache-Obloivious algorithmsMichael Bender, Stony Brook U., USA

Why adaptive algorithms ?

€

7 3 6

0 1 8

0 0 5

⎡

⎣

⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥

Data varyResources availability is versatile

Adaptations

Ordonnancement• planification (scheduling) volume calculs / hétérogénéité• redistribution (load-balancing)

Objectif de AHA : vision intégrée de l’adaptation

Approche algorithmique : combinaison auto-adaptative d’algorithmes

avec comportement global justifié d’un point de vue théorique

Mesures sur lesressources

Mesures sur lesdonnées

Choix algorithme • séquentiels/parallèle(s) • approché/exact• en mémoire / out of core

Calibrage • pré-paramétrage taille de blocs / cache choix d’instructions• gestion de priorités

Algorithmes parallèles à grain adaptatif

Exemple du préfixe

[email protected]

Projet MOAIS (www-id.imag.fr/MOAIS)Laboratoire ID-IMAG (CNRS-INRIA INPG-UJF)

How to adapt the application ?• By minimizing communications

• e.g. amortizing synchronizations in the simulation [Beaumont, Daoudi, Maillard, Manneback, Roch - PMAA 2004]

adaptive granularity • By contolling latency (interactivity constraints) :

• FlowVR [Allard, Menier, Raffin]

overhead• By managing node failures and resilience [Checkpoint/restart][checkers]

• FlowCert [Jafar, Krings, Leprevost; Roch, Varrette]

• By adapting granularity• malleable tasks [Trystram, Mounié]

• dataflow cactus-stack : Athapascan/Kaapi [Gautier] • recursive parallelism by « work-stealling »

[Blumofe-Leiserson 98, Cilk, Athapascan, ... ] [Bender Rabin 2002]

• Self-adaptive grain algorithms • dynamic extraction of paralllelism

[Daoudi, Gautier, Revire, Roch - J. TSI 2005 ]

[Roch, Traore, Bernard - … ]

Algorithmes parallèles à grain adaptatif :

Quelques exemples

• Ordonnancement de programme parallèle à grain fin :work-stealing

•Algorithmes à grain adaptatif : principe d’une « cascade » dynamiqueexemple du produit itéré

•Couplage séquentiel - parallèle : exemple du préfixe

In « practice »: coarse granularitySplitting into p = #resourcesDrawback : heterogeneous architecture, dynamic:

i(t) : speed of processor i at time t

In « theory »: fine granularity Maximal parallelismDrawback : overhead of tasks management

How to choose/adapt granularity ?

a b

H(a) O(b,7)

F(2,a) G(a,b) H(b)

High potentialdegree

of parallelism

Greedy scheduling

Homogeneous case [Graham 69] : greedy scheduling : No ready task when a processor is idle

Tp < W1/p + (1-1/p).W => Tp < W1/p + W

Heterogeneous case [Jaffe 80] Maximum utilization schedule

If i < p ready tasks, assign the threads to the i faster procs

High utilisation schedule [Bender 02] : parameter BIf i < p ready tasks, the fastest idle processor is

at most B times faster than the slowest busy processor

Tp < W1/(p. ave) + B.W /ave

«Depth »

parallel time on resources

W = #ops on a critcal path

∞T

« Work »sequential time

W1= #operations

Work stealing • Distributed randomized implementation of greedy scheduling

• Each processor manages locally the tasks it creates• When idle, a processor steals the oldest ready task on a remote -non idle- processor

(randomly chosen)• Implementation: local stack = deque [Cilk, Kaapi]

• Local parallelism is implemented by sequential function call• Local sequential execution correct => restrictions

• serie-parallel/Cilk - reference order/Kaapi

• On heteorogeneous processors : • Slight modification : when a processor steals a B-times slower busy processor, it

preempts its task

• Interests :=> with good probability, #succeeded steals < p. W few task migrations

[Blumofe 98, Narlikar 01, Bender 02,Revire-Roch 03, ....] => suited to heterogeneous architectures [Bender-Rabin 02]

• Tp < W1/(p. ave) + O ( W / ave ) with good probability

=> How to have W small and W1 = #ops seq ???

Best case : parallel algorithm is efficient

• W is small and W1 = Wseq

• The parallel algorithm is an optimal sequential one– Exemples: parallel D&C algorithms

• Implementation: work-first principle- no overhead when local execution of tasks

• Examples :– Cilk : THE protocol– Kaapi : Compare&swap only

Experimentation: knary benchmark

SMP ArchitectureOrigin 3800 (32 procs)

Cilk / Athapascan

Distributed Archi.iCluster

Athapascan

#procs Speed-Up

8 7,83

16 15,6

32 30,9

64 59,2

100 90,1

Ts = 2397 s T1 = 2435

But usually, when W is small W1 >> Wseq• Solution: to mix both sequential and parallel algorithm

• Basic technique : • Parallel algorithm until a certain « grain »; then use the sequential one

• Problem : T increases also, the number of migration … and the inefficiency ;o(

• Work-preserving speed-up [Bini-Pan 94] = cascading technique [Jaja92]

Careful interplay of both algorithms to build one withboth T small and T1 = O( Ts )

• Divide the sequential algorithm into block• Each block is compute with the (non-optimal) parallel algorithm• Drawback : sequential at coarse grain and parallel at fine grain ;o(

• Adaptive grain: dual approach : parallelism is extracted from any sequential task

How to obtain an efficientfine-grain algorithm ?

• Hypothesis for efficiency of work-stealing : • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism)

• Problem :• Fine grain (T small) parallel algorithms may involve a large

overhead with respect to a sequential efficient algorithm: • Overhead due to parallelism creation and synchronization• But also arithmetic overhead

Self-grain Adaptive algorithms

• Recursive computations– Local sequential computation

• Special case: – recursive extraction of parallelism when a resource becomes idle

– But local execution of a sequential algorithm

• Hypothesis : two algorithms : • - 1 sequential : SeqCompute• - 1 parallel : LastPartComputation => at any time, it is possible to

extract parallelism from the remaining computations of the sequential algorithm

• Example : – - iterated product [Vernizzi] - gzip / compression [Kerfali]– - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore]

Self-adaptive grain algorithmPrinciple :

To save parallelism overhead by privilegiating a sequential algorithm :=> use parallel algorithm only if a processor becomes idle by

extracting parallelism from a sequential computation

Hypothesis : two algorithms : • - 1 sequential : SeqCompute

- 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm

– Examples : – - iterated product [Vernizzi] - gzip / compression [Kerfali]– - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore]

SeqCompute

Extract_parLastPartComputation

SeqCompute

• Sequential algorithm : for (i= 0 ; i <= n; i++ ) P[ i ] = P[ i – 1 ] * a [ i ] ; T1 = n

• Parallel algorithm :

Indeed parallelism often costs ...

Eg : Prefix computation P1 = a0*a1, P2=a0*a1*a2, …, Pn=a0*a1*…*an

a0 a1 a2 ana3

* * *

an-1

Préfixe ( n / 2 )

P1 P3 Pn*

P4

*

P2

*

Pn-1

T =2. log n

but T1 = 2.n

Adaptive prefix computation

• Any (parallel) algorithm with depth T =d

performs at least 2n-d operations

• Slower bound on p identical processors: 2n/(p+1)– Block algorithm + pipeline [Nicolau 2000]

• Adaptive scheme :– One process performs sequential computation– p-1 processes perform a parallel « segmented » prefix

computation :

Tp < 2n/((p+1). ave) + O (log n/ ave)

Adaptive Prefix versus optimalon identical processors

Adaptive Prefix with variable speeds

Single user contextAdaptive is equivalent to: - sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors

Multiuser contextAdaptive is the fastest15% benefit over a static grain algorithm

Multiuser contextAdaptive is the fastest15% benefit over a static grain algorithm

- Lower bound: decreasing parallel time => #ops increases > 2n. (1-1/p)- Adaptive grain algorithm with provable performances :

dynamic cascading of two algorithms (sequential/parallel) [TSI2005}]

- Theorem : T = 2n / (p*+1) + O(log n) ~ optimal

on processors with average speed p* [soon 2006]

External charge

Parallel

Adaptive

Parallel

Adaptive

The race: sequential/parallel fixed/ Adaptive Prefix

Race between 9 algorithms (44 processes) on an octo-SMPSMP

0 5 10 15 20 25

1

2

3

4

5

6

7

8

9

Execution time (seconds)

Série1

Adaptative 8 proc.

Parallel 8 proc.

Parallel 7 proc.

Parallel 6 proc.Parallel 5 proc.

Parallel 4 proc.

Parallel 3 proc.

Parallel 2 proc.

Sequential

Conclusion

Adaptive algorithm with provable performances-> also confirmed by first experimentations

To experiment :- on SMP at fine grain [floating point prefix sum]

(memory, fixing workstealer on cpus)- on distributed heterogeneous architectures

The scheme (and its complexity analysis) appears general- to apply the technique on oher problems [AHA]

• Intérêt : Grain fin « statique », mais contrôle dynamique

• Inconvénient: surcôut possible de l’algorithme parallèle [ex. préfixes]

f2

Implementation of work-stealing

fork f2

f1() { ….

fork f2 ; …

} steal

f1

P

+ non-préemptive execution of ready task

P’

Hypothesis : a sequential schedule is valid

f1

Stack

Generic self-adaptive grain algorithm

Illustration : f(i), i=1..100

LastPart(w)

W=2..100

SeqComp(w)sur CPU=A

f(1)


LastPart(w)

W=3..100

SeqComp(w)sur CPU=Af(1);f(2)


LastPart(w) on CPU=B

W=3..100




LastPart(w)on CPU=B

W=3..51

SeqComp(w’)

LastPart(w’)

W’=52..100

LastPart(w)



W=3..51

SeqComp(w’)

LastPart(w’)

W’=52..100

LastPart(w)



W=3..51

SeqComp(w’)sur CPU=Bf(52)

LastPart(w’)

W’=53..100

LastPart(w)

Adaptivité

• Kaapi: réification, interaction avec l’environnement (ajout de ressources), … (interaction)

• Mais aussi : impact sur l’algorithmique / ordonnancement• Example : workstealing based algorithms

– Recursive parallel computations– Local sequential computation– Special case:

• recursive extraction of parallelism when a resource becomes idle• But local execution of a sequential algorithm

• Example : prefix computation– Sequential : n operations– Parallel on p identical resources : at least 2n.(p/(p+1)) operations– Adaptive with work-stealing :

• Coupling sequential and parallel partial-prefix computation• May benefit of an unbounded number or ressources • Performance : on p processors of variable speeds :2n/(p+1) + O(log n)

Adaptive algorithms

• Recursive computations– Local sequential computation

• Special case: – recursive extraction of parallelism when a resource becomes idle– But local execution of a sequential algorithm

• Example : prefix computation– Sequential : n operations– Parallel on p identical resources : at least 2n.(p/(p+1)) operations– Adaptive with work-stealing :

• Coupling sequential and parallel partial-prefix computation• May benefit of an unbounded number or ressources • Performance : on p processors of variable speeds :2n/(p+1) + O(log n)

E.g.Triangular system solving 0 .x = b

• Sequential algorithm : T1 = n2/2; T = n (fine grain)

0 .x = b

A

1/ x1 = - b1 / a11

2/ For k=2..n bk = bk - ak1.x1

0 .x = b

system of dimension n-1

system of dimension n

E.g.Triangular system solving 0 .x = b

• Sequential algorithm : T1 = n2/2; T = n (fine grain)

• Using parallel matrix inversion : T1 = n3; T = log2 n (fine grain)

0

A21 A22

A11

-1

=0

S A22

A11

-1

-1

S= -A22.A21.A11

-1 -1with A =-1

and x=A-1.b

• Self-adaptive granularity algorithm : T1 = n2; T = n.log n

0 .x = b

ExtractParand self-adaptive scalar product

self adaptive sequential algorithm

self-adaptivematrix inversion

choice of h = m

hm

Algorithmes parallèles à grain adaptatif :

Quelques exemples

• Ordonnancement de programme parallèle à grain fin :work-stealing et efficacité

•Algorithmes à grain adaptatif : principe d’une « cascade » dynamique

• Exemples• Produit itéré, préfixe• Compression gzip • Inversion de systèmes triangulaire• Vision 3D / Calcul d’oct-tree

Produit iteré Séquentiel, parallèle, adaptatif

[Davide Vernizzi]● Séquentiel :

● Entrée: tableau de n valeurs

● Sortie:

● c/c++ code:

for (i=0; i<n; i++)

res += atoi(x[i]);

● Algorithme parallèle :

● calcul récursif par bloc (arbre binaire avec fusion)

● Taille de bloc = pagesize

● Code kaapi : athapascan API

€

f (x i

i=1

n

∑ )

Expérimentation : parallèle <=> adaptatif

Variante : somme de pages

● Entrée: ensemble de n pages. Chaque page est un tableau de valeurs

● Sortie: une page où chaque élément estla somme des éléments de même indice des pages précédentes

● c/c++ code:

for (i=0; i<n; i++) for (j=0; j<pageSize; j++) res [j] += f (pages[i][j]);

res ji 0

n 1

f p a g e i , j

Expérimentation : - l’algorithme parallèle coûte environ 2 fois plus que l’algorithme séquentiel - l’algorithme adaptatif a une efficacité proche de 1

Démonstration sur ensibull

Script: [vernizzd@ensibull demo]$ more go-tout.sh #!/bin/sh./spg /tmp/data &./ppg /tmp/data 1 --a1 -thread.poolsize 3 &./apg /tmp/data 1 --a1 -thread.poolsize 3 &

Résultat: [vernizzd@ensibull demo]$ ./go-tout.sh Page size: 4096Memory allocatedMemory allocated0:In main: th = 1, parallel0: -----------------------------------------0: res = -2.048e+07

0: time = 0.408178 s ADAPTATIF (3 procs)0: Threads created: 540: -----------------------------------------0: res = -2.048e+07

0: time = 0.964014 s PARALLELE (3 procs)0: #fork = 74970: -----------------------------------------: -----------------------------------------: res = -2.048e+07

: time = 1.15204 s SEQUENTIEL (1 proc): -----------------------------------------

D’où vient la différence ?…Les sources des programmes

Source des codes pour la somme des pages :

parallèle / arbre binaire

adaptatif par couplage ;

- séquentiel + Fork<LastPartComp>

- LastParComp: génération (récursive) de 3 tâches

struct Iterated { void operator() (a1::Shared_w<Page> res, int start, int stop) { if ( (stop-start) <2) { // If max num of pages is reached, sequential algorithm Page resLocal (pageSize); IteratedSeq(start, resLocal); res.write(resLocal);} else { // If max num of pages is not reached int half = (start+stop)/2;

a1::Shared<Page> res1; // First thread result a1::Shared<Page> res2; // Second thread result a1::Fork<Iterated> () (res1, start, half); //First thread a1::Fork<Iterated> () (res2, half, stop); //Second thread a1::Fork<Merge> () (res, res1, res2); //Merging results... }}};

Algorithme parallèle

Parallélisation adaptative

● Calcul par bloc sur des entrées en k blocs:

● 1 bloc = pagesize● Exécution indépendante des k tâches● Fusion des resultats

Algorithme adaptatif (1/3)

● Hypothèse: ordonnancement non préemptif - de type work-stealing

● Couplage séquentiel adaptatif :

void Adaptative (a1::Shared_w<Page> *resLocal, DescWork dw) {// cout << "Adaptative" << endl; a1::Shared <Page> resLPC;

a1::Fork<LPC>() (resLPC, dw);

Page resSeq (pageSize); AdaptSeq (dw, &resSeq); a1::Fork <Merge> () (resLPC, *resLocal, resSeq);}

Algorithme adaptatif (2/3)

● Côté séquentiel :

void AdaptSeq (DescWork dw, Page *resSeq){ DescLocalWork w; Page resLoc (pageSize); double k; while (!dw.desc->extractSeq(&w)) { for (int i=0; i<pageSize; i++ ) { k = resLoc.get (i) + (double) buff[w*pageSize+i]; resLoc.put(i, k); } } *resSeq=resLoc;}

Algorithme adaptatif (3/3)● Côté extraction = algorithme parallèle :

struct LPC { void operator () (a1::Shared_w<Page> resLPC, DescWork dw){ DescWork dw2; dw2.Allocate(); dw2.desc->l.initialize(); if (dw.desc->extractPar(&dw2)) { a1::Shared<Page> res2; a1::Fork<AdaptativeMain>() (res2, dw2.desc->i, dw2.desc->j); a1::Shared<Page> resLPCold; a1::Fork<LPC>() (resLPCold, dw); a1::Fork<MergeLPC>() (resLPCold, res2, resLPC); } }};

Parallélisation adaptative

● Une seule tache de calcul est demarrée pour toutes les entrées

● Division du travail qui reste à faire seulement dans le cas où un processeur devient inactif

● Moins de taches, moins de fusions

Exemple 2 : parallélisation de gzip

• Gzip :

• Utilisé (web) et coûteux bien que de complexité linéaire

• Code source :10000 lignes C, structures de données complexes

• Principe : LZ77 + arbre Huffman

• Pourquoi gzip ?• Problème P-complet, mais parallélisation pratique possible• Inconvénient: toute parallélisation (connue) entraîne un surcoût

• -> perte de taux de compression

Fichiercompressé

Fichieren entrée

Compressionà la volée

Algorithme

Partition dynamique en blocs

Parallélisation « facile » ,100% compatible avec gzip/gunzip

Problèmes : perte de taux de compression, grain dépend de la machine, surcoût

Blocs compressés

Compression

parallèle

Partition statique en blocs

Parallélisation

=>

=>

Comment paralléliser gzip ?

Outputcompressedfile

InputFile

Compressionà la volée

SeqComp LastPartComputation

Outputcompressedblocks

Parallelcompression

Parallélisation gzip à grain adaptatif

Dynamicpartitionin blocks

cat

Taille

Fichiers

Gzip Adaptatif

2 procs

Adaptatif

8 procs

Adaptatif

16 procs

0,86 Mo 272573 275692 280660 280660

5,2 Mo 1,023Mo 1,027Mo 1,05Mo 1,08 Mo

9,4 Mo 6,60 Mo 6,62 Mo 6,73 Mo 6,79 Mo

10 Mo 1,12 Mo 1,13 Mo 1,14 Mo 1,17 Mo

5,2 Mo 3,35 s 0,96 s 0,55 s

9,4 Mo 7,67 s 6,73 s 6,79 s

10 Mo 6,79 s 1,71 s 0,88 s

Surcoût en taille de fichier comprimé

Gain en T

Performances

4 processors computer

0

10

20

30

40

50

60

70

80

90

1,106 2,089 2,263 4,260 6,769 7,905 8,960 10,957 15,962 19,298 21,914

Size of file (Ko)

Time (in seconds)

Sequential gzip

Athapascan gzipPentium 4x200Mhz

Adaptabilité

Documents

Transcript of Adaptabilité