Adaptabilité
description
Transcript of Adaptabilité
Adaptabilité
€
7 3 6
0 1 8
0 0 5
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥
Les données varientLes ressources varient
Application
Nécessité d’adaptation pour améliorer la performance
MiniSymposiumAdaptive Algortihms
for Scientific computing• 9h45 Adaptive algorithms - Theory and applications
– Collective work - AHA Team Jean-Louis Roch, INRIA-CNRS Grenoble, France
10h15 Hybrids in exact linear algebra– Dave Saunders, U. Delaware, USA
10h45 Adaptive programming with hierarchical multiprocessor tasksThomas Rauber, U. Bayreuth, Germany
11h15 Cache-Obloivious algorithmsMichael Bender, Stony Brook U., USA
Why adaptive algorithms ?
€
7 3 6
0 1 8
0 0 5
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥
Data varyResources availability is versatile
Adaptations
Ordonnancement• planification (scheduling) volume calculs / hétérogénéité• redistribution (load-balancing)
Objectif de AHA : vision intégrée de l’adaptation
Approche algorithmique : combinaison auto-adaptative d’algorithmes
avec comportement global justifié d’un point de vue théorique
Mesures sur lesressources
Mesures sur lesdonnées
Choix algorithme • séquentiels/parallèle(s) • approché/exact• en mémoire / out of core
Calibrage • pré-paramétrage taille de blocs / cache choix d’instructions• gestion de priorités
Algorithmes parallèles à grain adaptatif
Exemple du préfixe
Projet MOAIS (www-id.imag.fr/MOAIS)Laboratoire ID-IMAG (CNRS-INRIA INPG-UJF)
How to adapt the application ?• By minimizing communications
• e.g. amortizing synchronizations in the simulation [Beaumont, Daoudi, Maillard, Manneback, Roch - PMAA 2004]
adaptive granularity • By contolling latency (interactivity constraints) :
• FlowVR [Allard, Menier, Raffin]
overhead• By managing node failures and resilience [Checkpoint/restart][checkers]
• FlowCert [Jafar, Krings, Leprevost; Roch, Varrette]
• By adapting granularity• malleable tasks [Trystram, Mounié]
• dataflow cactus-stack : Athapascan/Kaapi [Gautier] • recursive parallelism by « work-stealling »
[Blumofe-Leiserson 98, Cilk, Athapascan, ... ] [Bender Rabin 2002]
• Self-adaptive grain algorithms • dynamic extraction of paralllelism
[Daoudi, Gautier, Revire, Roch - J. TSI 2005 ]
[Roch, Traore, Bernard - … ]
Algorithmes parallèles à grain adaptatif :
Quelques exemples
• Ordonnancement de programme parallèle à grain fin :work-stealing
•Algorithmes à grain adaptatif : principe d’une « cascade » dynamiqueexemple du produit itéré
•Couplage séquentiel - parallèle : exemple du préfixe
In « practice »: coarse granularitySplitting into p = #resourcesDrawback : heterogeneous architecture, dynamic:
i(t) : speed of processor i at time t
In « theory »: fine granularity Maximal parallelismDrawback : overhead of tasks management
How to choose/adapt granularity ?
a b
H(a) O(b,7)
F(2,a) G(a,b) H(b)
High potentialdegree
of parallelism
Greedy scheduling
Homogeneous case [Graham 69] : greedy scheduling : No ready task when a processor is idle
Tp < W1/p + (1-1/p).W => Tp < W1/p + W
Heterogeneous case [Jaffe 80] Maximum utilization schedule
If i < p ready tasks, assign the threads to the i faster procs
High utilisation schedule [Bender 02] : parameter BIf i < p ready tasks, the fastest idle processor is
at most B times faster than the slowest busy processor
Tp < W1/(p. ave) + B.W /ave
«Depth »
parallel time on resources
W = #ops on a critcal path
∞T
« Work »sequential time
W1= #operations
Work stealing • Distributed randomized implementation of greedy scheduling
• Each processor manages locally the tasks it creates• When idle, a processor steals the oldest ready task on a remote -non idle- processor
(randomly chosen)• Implementation: local stack = deque [Cilk, Kaapi]
• Local parallelism is implemented by sequential function call• Local sequential execution correct => restrictions
• serie-parallel/Cilk - reference order/Kaapi
• On heteorogeneous processors : • Slight modification : when a processor steals a B-times slower busy processor, it
preempts its task
• Interests :=> with good probability, #succeeded steals < p. W few task migrations
[Blumofe 98, Narlikar 01, Bender 02,Revire-Roch 03, ....] => suited to heterogeneous architectures [Bender-Rabin 02]
• Tp < W1/(p. ave) + O ( W / ave ) with good probability
=> How to have W small and W1 = #ops seq ???
Best case : parallel algorithm is efficient
• W is small and W1 = Wseq
• The parallel algorithm is an optimal sequential one– Exemples: parallel D&C algorithms
• Implementation: work-first principle- no overhead when local execution of tasks
• Examples :– Cilk : THE protocol– Kaapi : Compare&swap only
Experimentation: knary benchmark
SMP ArchitectureOrigin 3800 (32 procs)
Cilk / Athapascan
Distributed Archi.iCluster
Athapascan
#procs Speed-Up
8 7,83
16 15,6
32 30,9
64 59,2
100 90,1
Ts = 2397 s T1 = 2435
But usually, when W is small W1 >> Wseq• Solution: to mix both sequential and parallel algorithm
• Basic technique : • Parallel algorithm until a certain « grain »; then use the sequential one
• Problem : T increases also, the number of migration … and the inefficiency ;o(
• Work-preserving speed-up [Bini-Pan 94] = cascading technique [Jaja92]
Careful interplay of both algorithms to build one withboth T small and T1 = O( Ts )
• Divide the sequential algorithm into block• Each block is compute with the (non-optimal) parallel algorithm• Drawback : sequential at coarse grain and parallel at fine grain ;o(
• Adaptive grain: dual approach : parallelism is extracted from any sequential task
How to obtain an efficientfine-grain algorithm ?
• Hypothesis for efficiency of work-stealing : • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism)
• Problem :• Fine grain (T small) parallel algorithms may involve a large
overhead with respect to a sequential efficient algorithm: • Overhead due to parallelism creation and synchronization• But also arithmetic overhead
Self-grain Adaptive algorithms
• Recursive computations– Local sequential computation
• Special case: – recursive extraction of parallelism when a resource becomes idle
– But local execution of a sequential algorithm
• Hypothesis : two algorithms : • - 1 sequential : SeqCompute• - 1 parallel : LastPartComputation => at any time, it is possible to
extract parallelism from the remaining computations of the sequential algorithm
• Example : – - iterated product [Vernizzi] - gzip / compression [Kerfali]– - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore]
Self-adaptive grain algorithmPrinciple :
To save parallelism overhead by privilegiating a sequential algorithm :=> use parallel algorithm only if a processor becomes idle by
extracting parallelism from a sequential computation
Hypothesis : two algorithms : • - 1 sequential : SeqCompute
- 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm
– Examples : – - iterated product [Vernizzi] - gzip / compression [Kerfali]– - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore]
SeqCompute
Extract_parLastPartComputation
SeqCompute
• Sequential algorithm : for (i= 0 ; i <= n; i++ ) P[ i ] = P[ i – 1 ] * a [ i ] ; T1 = n
• Parallel algorithm :
Indeed parallelism often costs ...
Eg : Prefix computation P1 = a0*a1, P2=a0*a1*a2, …, Pn=a0*a1*…*an
a0 a1 a2 ana3
* * *
an-1
Préfixe ( n / 2 )
P1 P3 Pn*
P4
*
P2
*
Pn-1
T =2. log n
but T1 = 2.n
Adaptive prefix computation
• Any (parallel) algorithm with depth T =d
performs at least 2n-d operations
• Slower bound on p identical processors: 2n/(p+1)– Block algorithm + pipeline [Nicolau 2000]
• Adaptive scheme :– One process performs sequential computation– p-1 processes perform a parallel « segmented » prefix
computation :
Tp < 2n/((p+1). ave) + O (log n/ ave)
Adaptive Prefix versus optimalon identical processors
Adaptive Prefix with variable speeds
Single user contextAdaptive is equivalent to: - sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors
Multiuser contextAdaptive is the fastest15% benefit over a static grain algorithm
Multiuser contextAdaptive is the fastest15% benefit over a static grain algorithm
- Lower bound: decreasing parallel time => #ops increases > 2n. (1-1/p)- Adaptive grain algorithm with provable performances :
dynamic cascading of two algorithms (sequential/parallel) [TSI2005}]
- Theorem : T = 2n / (p*+1) + O(log n) ~ optimal
on processors with average speed p* [soon 2006]
External charge
Parallel
Adaptive
Parallel
Adaptive
The race: sequential/parallel fixed/ Adaptive Prefix
Race between 9 algorithms (44 processes) on an octo-SMPSMP
0 5 10 15 20 25
1
2
3
4
5
6
7
8
9
Execution time (seconds)
Série1
Adaptative 8 proc.
Parallel 8 proc.
Parallel 7 proc.
Parallel 6 proc.Parallel 5 proc.
Parallel 4 proc.
Parallel 3 proc.
Parallel 2 proc.
Sequential
Conclusion
Adaptive algorithm with provable performances-> also confirmed by first experimentations
To experiment :- on SMP at fine grain [floating point prefix sum]
(memory, fixing workstealer on cpus)- on distributed heterogeneous architectures
The scheme (and its complexity analysis) appears general- to apply the technique on oher problems [AHA]
Annex
• Intérêt : Grain fin « statique », mais contrôle dynamique
• Inconvénient: surcôut possible de l’algorithme parallèle [ex. préfixes]
f2
Implementation of work-stealing
fork f2
f1() { ….
fork f2 ; …
} steal
f1
P
+ non-préemptive execution of ready task
P’
Hypothesis : a sequential schedule is valid
f1
Stack
Generic self-adaptive grain algorithm
Illustration : f(i), i=1..100
LastPart(w)
W=2..100
SeqComp(w)sur CPU=A
f(1)
Illustration : f(i), i=1..100
LastPart(w)
W=3..100
SeqComp(w)sur CPU=Af(1);f(2)
Illustration : f(i), i=1..100
LastPart(w) on CPU=B
W=3..100
SeqComp(w)sur CPU=Af(1);f(2)
Illustration : f(i), i=1..100
SeqComp(w)sur CPU=Af(1);f(2)
LastPart(w)on CPU=B
W=3..51
SeqComp(w’)
LastPart(w’)
W’=52..100
LastPart(w)
Illustration : f(i), i=1..100
SeqComp(w)sur CPU=Af(1);f(2)
W=3..51
SeqComp(w’)
LastPart(w’)
W’=52..100
LastPart(w)
Illustration : f(i), i=1..100
SeqComp(w)sur CPU=Af(1);f(2)
W=3..51
SeqComp(w’)sur CPU=Bf(52)
LastPart(w’)
W’=53..100
LastPart(w)
Adaptivité
• Kaapi: réification, interaction avec l’environnement (ajout de ressources), … (interaction)
• Mais aussi : impact sur l’algorithmique / ordonnancement• Example : workstealing based algorithms
– Recursive parallel computations– Local sequential computation– Special case:
• recursive extraction of parallelism when a resource becomes idle• But local execution of a sequential algorithm
• Example : prefix computation– Sequential : n operations– Parallel on p identical resources : at least 2n.(p/(p+1)) operations– Adaptive with work-stealing :
• Coupling sequential and parallel partial-prefix computation• May benefit of an unbounded number or ressources • Performance : on p processors of variable speeds :2n/(p+1) + O(log n)
Adaptive algorithms
• Recursive computations– Local sequential computation
• Special case: – recursive extraction of parallelism when a resource becomes idle– But local execution of a sequential algorithm
• Example : prefix computation– Sequential : n operations– Parallel on p identical resources : at least 2n.(p/(p+1)) operations– Adaptive with work-stealing :
• Coupling sequential and parallel partial-prefix computation• May benefit of an unbounded number or ressources • Performance : on p processors of variable speeds :2n/(p+1) + O(log n)
E.g.Triangular system solving 0 .x = b
• Sequential algorithm : T1 = n2/2; T = n (fine grain)
0 .x = b
A
1/ x1 = - b1 / a11
2/ For k=2..n bk = bk - ak1.x1
0 .x = b
system of dimension n-1
system of dimension n
E.g.Triangular system solving 0 .x = b
• Sequential algorithm : T1 = n2/2; T = n (fine grain)
• Using parallel matrix inversion : T1 = n3; T = log2 n (fine grain)
0
A21 A22
A11
-1
=0
S A22
A11
-1
-1
S= -A22.A21.A11
-1 -1with A =-1
and x=A-1.b
• Self-adaptive granularity algorithm : T1 = n2; T = n.log n
0 .x = b
ExtractParand self-adaptive scalar product
self adaptive sequential algorithm
self-adaptivematrix inversion
choice of h = m
hm
Algorithmes parallèles à grain adaptatif :
Quelques exemples
• Ordonnancement de programme parallèle à grain fin :work-stealing et efficacité
•Algorithmes à grain adaptatif : principe d’une « cascade » dynamique
• Exemples• Produit itéré, préfixe• Compression gzip • Inversion de systèmes triangulaire• Vision 3D / Calcul d’oct-tree
Produit iteré Séquentiel, parallèle, adaptatif
[Davide Vernizzi]● Séquentiel :
● Entrée: tableau de n valeurs
● Sortie:
● c/c++ code:
for (i=0; i<n; i++)
res += atoi(x[i]);
● Algorithme parallèle :
● calcul récursif par bloc (arbre binaire avec fusion)
● Taille de bloc = pagesize
● Code kaapi : athapascan API
€
f (x i
i=1
n
∑ )
Expérimentation : parallèle <=> adaptatif
Variante : somme de pages
● Entrée: ensemble de n pages. Chaque page est un tableau de valeurs
● Sortie: une page où chaque élément estla somme des éléments de même indice des pages précédentes
● c/c++ code:
for (i=0; i<n; i++) for (j=0; j<pageSize; j++) res [j] += f (pages[i][j]);
res ji 0
n 1
f p a g e i , j
Expérimentation : - l’algorithme parallèle coûte environ 2 fois plus que l’algorithme séquentiel - l’algorithme adaptatif a une efficacité proche de 1
Démonstration sur ensibull
Script: [vernizzd@ensibull demo]$ more go-tout.sh #!/bin/sh./spg /tmp/data &./ppg /tmp/data 1 --a1 -thread.poolsize 3 &./apg /tmp/data 1 --a1 -thread.poolsize 3 &
Résultat: [vernizzd@ensibull demo]$ ./go-tout.sh Page size: 4096Memory allocatedMemory allocated0:In main: th = 1, parallel0: -----------------------------------------0: res = -2.048e+07
0: time = 0.408178 s ADAPTATIF (3 procs)0: Threads created: 540: -----------------------------------------0: res = -2.048e+07
0: time = 0.964014 s PARALLELE (3 procs)0: #fork = 74970: -----------------------------------------: -----------------------------------------: res = -2.048e+07
: time = 1.15204 s SEQUENTIEL (1 proc): -----------------------------------------
D’où vient la différence ?…Les sources des programmes
Source des codes pour la somme des pages :
parallèle / arbre binaire
adaptatif par couplage ;
- séquentiel + Fork<LastPartComp>
- LastParComp: génération (récursive) de 3 tâches
struct Iterated { void operator() (a1::Shared_w<Page> res, int start, int stop) { if ( (stop-start) <2) { // If max num of pages is reached, sequential algorithm Page resLocal (pageSize); IteratedSeq(start, resLocal); res.write(resLocal);} else { // If max num of pages is not reached int half = (start+stop)/2;
a1::Shared<Page> res1; // First thread result a1::Shared<Page> res2; // Second thread result a1::Fork<Iterated> () (res1, start, half); //First thread a1::Fork<Iterated> () (res2, half, stop); //Second thread a1::Fork<Merge> () (res, res1, res2); //Merging results... }}};
Algorithme parallèle
Parallélisation adaptative
● Calcul par bloc sur des entrées en k blocs:
● 1 bloc = pagesize● Exécution indépendante des k tâches● Fusion des resultats
Algorithme adaptatif (1/3)
● Hypothèse: ordonnancement non préemptif - de type work-stealing
● Couplage séquentiel adaptatif :
void Adaptative (a1::Shared_w<Page> *resLocal, DescWork dw) {// cout << "Adaptative" << endl; a1::Shared <Page> resLPC;
a1::Fork<LPC>() (resLPC, dw);
Page resSeq (pageSize); AdaptSeq (dw, &resSeq); a1::Fork <Merge> () (resLPC, *resLocal, resSeq);}
Algorithme adaptatif (2/3)
● Côté séquentiel :
void AdaptSeq (DescWork dw, Page *resSeq){ DescLocalWork w; Page resLoc (pageSize); double k; while (!dw.desc->extractSeq(&w)) { for (int i=0; i<pageSize; i++ ) { k = resLoc.get (i) + (double) buff[w*pageSize+i]; resLoc.put(i, k); } } *resSeq=resLoc;}
Algorithme adaptatif (3/3)● Côté extraction = algorithme parallèle :
struct LPC { void operator () (a1::Shared_w<Page> resLPC, DescWork dw){ DescWork dw2; dw2.Allocate(); dw2.desc->l.initialize(); if (dw.desc->extractPar(&dw2)) { a1::Shared<Page> res2; a1::Fork<AdaptativeMain>() (res2, dw2.desc->i, dw2.desc->j); a1::Shared<Page> resLPCold; a1::Fork<LPC>() (resLPCold, dw); a1::Fork<MergeLPC>() (resLPCold, res2, resLPC); } }};
Parallélisation adaptative
● Une seule tache de calcul est demarrée pour toutes les entrées
● Division du travail qui reste à faire seulement dans le cas où un processeur devient inactif
● Moins de taches, moins de fusions
Exemple 2 : parallélisation de gzip
• Gzip :
• Utilisé (web) et coûteux bien que de complexité linéaire
• Code source :10000 lignes C, structures de données complexes
• Principe : LZ77 + arbre Huffman
• Pourquoi gzip ?• Problème P-complet, mais parallélisation pratique possible• Inconvénient: toute parallélisation (connue) entraîne un surcoût
• -> perte de taux de compression
Fichiercompressé
Fichieren entrée
Compressionà la volée
Algorithme
Partition dynamique en blocs
Parallélisation « facile » ,100% compatible avec gzip/gunzip
Problèmes : perte de taux de compression, grain dépend de la machine, surcoût
Blocs compressés
Compression
parallèle
Partition statique en blocs
Parallélisation
=>
=>
Comment paralléliser gzip ?
Outputcompressedfile
InputFile
Compressionà la volée
SeqComp LastPartComputation
Outputcompressedblocks
Parallelcompression
Parallélisation gzip à grain adaptatif
Dynamicpartitionin blocks
cat
Taille
Fichiers
Gzip Adaptatif
2 procs
Adaptatif
8 procs
Adaptatif
16 procs
0,86 Mo 272573 275692 280660 280660
5,2 Mo 1,023Mo 1,027Mo 1,05Mo 1,08 Mo
9,4 Mo 6,60 Mo 6,62 Mo 6,73 Mo 6,79 Mo
10 Mo 1,12 Mo 1,13 Mo 1,14 Mo 1,17 Mo
5,2 Mo 3,35 s 0,96 s 0,55 s
9,4 Mo 7,67 s 6,73 s 6,79 s
10 Mo 6,79 s 1,71 s 0,88 s
Surcoût en taille de fichier comprimé
Gain en T
Performances
4 processors computer
0
10
20
30
40
50
60
70
80
90
1,106 2,089 2,263 4,260 6,769 7,905 8,960 10,957 15,962 19,298 21,914
Size of file (Ko)
Time (in seconds)
Sequential gzip
Athapascan gzipPentium 4x200Mhz