PP16-lec3-arch2

23
1.1 Parallel Processing sp2016 lec#3 Dr M Shamim Baig 

Transcript of PP16-lec3-arch2

Page 1: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 1/23

1.1

Parallel Processingsp2016

lec#3

Dr M Shamim Baig 

Page 2: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 2/23

1.2

Implicit Parallel Architectures:

ILP processors

• Pipelined Processors

• Superscalar Processor 

• LI! Processor 

Page 3: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 3/23

1.3

Pipeline Per"ormance

• Instruction Arithmetic$unit Pipeline

• Ideal pipeline Speed$up calculation Limits

• %hained Pipeline Per"ormance

• The speed-up of a pipeline is eventually limited by thenumber of stages & time of slowest stage.

• For this reason, onventional proessors tried on verydeep-pipeline !"# stage pipeline is an e$ample of deep pipeline ompared to normal pipeline of %- stages'

Page 4: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 4/23

1.&

Pipeline Per"ormance 'ottlenec(s

• Pipeline has "ollo)ing per"ormance *ottlenec(s

+esource %onstraint

,ata ,ependenc-

'ranch Prediction

•  (ppro$  every )-th instrution is a onditional *ump+ This

reuires very aurate branh predition.• The penalty of a predition error grows with the depth ofthe pipeline, sine a larger number of instrutions willhave to be flushed .

• ence need "or *etter solutions /than deep pipeline

Page 5: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 5/23

1.

Implicit Parallel Architectures:

ILP processors

• Pipelined Processor • Superscalar Processor 

• LI! Processor 

Page 6: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 6/23

1.6

Superscalar Processor 

• ne simple )a- o" alle4iating the deep pipeline*ottlenec(s is to use multiple /concurrent short

pipelines.• Issue multiple independent instructions 

simultaneousl- 5 7amples: 8IPS10009 Po)erP% Pentium

• he ;uestion then *ecomes one o" selecting orscheduling these instructions "or simultaneousissuing.

Page 7: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 7/231.<

Superscalar Scheduler

• Superscalar scheduler  is in$chip hard)are thatloo(s at num*er o" instructions in an instruction;ueue at runtime  selects appropriate num*ero" instructions to e7ecute concurrentl-. 

• Scheduling o" instructions concurrentl- isdetermined *- a num*er o" "actors: 5 esolve Data Dependeny ssues

 5 esolve esoure /onstraint ssues

 5 esolve Branh 0redition ssues

• %ost= comple7it- o" Scheduler hard)are  itsper"ormance constraints /discussed later areimportant issues o" superscalar processors.

Page 8: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 8/231.>

7ample: t)o$)a- superscalar e7ecution o" instructions

  IF ID NA NA WB

he e7ample illustrates that di""erent instruction mi7es )ith

identical semantics can ta(e signi"icantl- di""erent e7ecution time

 

== ? not re;uired

== ? not re;uired

7ecution @nit constraint or data$dependenc- can cause additional dela-s than Ideal pipeline

Page 9: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 9/231.

Superscalar 7ecution: +esource !aste• In the a*o4e e7ample9 there is some )astage o" 7ecution

unit resource

== ? not re;uired

== ? not re;uired  IF ID NA NA WB

Page 10: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 10/231.10

Superscalar 7ecution:

""icienc- %onsiderations

• Bot all "unctional units can *e (ept *us- at all times.

• I" during a c-cle9 no "unctional units are utiliCed9 this is

re"erred to as 4ertical )aste.

• I" during a c-cle9 onl- some o" the "unctional units are

utiliCed9 this is re"erred to as horiContal )aste.

• ,ue to limited parallelism in t-pical instruction traces

/dependencies limited time=scope o" the scheduler

to e7tract parallelism9 the per"ormance o" superscalar

processors is e4entuall- limited. 

• %on4entional microprocessors t-picall- support "our$

)a- superscalar e7ecution.

Page 11: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 11/231.11

Superscalar 7ecution:

Instruction Issue 8echanisms

• In the simpler model9 instructions can *e issuedonl- in the order in )hich the- are encountered

i.e i" the second instruction cannot *e issued

*ecause it has a data dependenc- )ith the "irst9

onl- one instruction is issued in the c-cle.

his is called in-order  issue.

• In a more aggressi4e model9 instructions can *e

issued out o" order . In this case9 i" the second

instruction has data dependencies )ith the "irst9

*ut the third instruction does not9 the "irst andthird instructions can *e co$scheduled.

his is also called d-namic issue.

• Per"ormance o" in$order issue is generall- limited

Page 12: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 12/231.12

Implicit Parallel Architectures:

ILP processors

• Pipelined Processor • Superscalar Processor 

• LI! Processor 

Page 13: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 13/231.13

er- Long Instruction !ord /LI!

Processors• Hardware cost /complexity  time/ scope constraint

of runtime scheduling of the superscalar are the

major issues in superscalar design.

• o address these issues9 LI! processors rel- on

compile time anal-sis to identi"- *undle together

instructions that can *e e7ecuted concurrentl-

• These instructions are pac(ed dispatched together

thus the name 4er- long instruction )ord

• -pical LI! processors are limited to & to >$)a-parallelism. ariants o" this concept are emplo-edin Intel (1 processors T TMS%"# /222 DS0s

Page 14: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 14/23

TMS%"#/$ has dual data paths

& orthogonal instrution units

whih boost overall performane

 ( high performane DS03

4-way 567 proessor 

Page 15: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 15/231.1

%omparison: Superscalar 4s

er- Long Instruction !ord /LI!

• Superscalar implements Scheduler as in$chip ard)are9)hile LI! implements it in compiler so"t)are.

• Superscalar schedules concurrent instructions at runtime9)hile LI! does it at compile$time.

• Superscalar scheduler scope is limited to "e) instructions"rom instruction$;ueue )hile LI! scheduler has *iggerconte7t !may be full program ' to process.

• ,ue to more time conte7t LI! scheduler can usemore po)er"ul algorithms !eg loop unrolling, branh predition

et' gi4ing *etter results9 )hich Superscalar canDt a""ord

• %ompilers9 however 9 do not ha4e runtime in"ormation !eg

ahe misses, branh variable state et', so LI! Scheduling isinherentl- more conser4ati4e than Superscalar 

Page 16: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 16/23

Explicitly Parallel Processorarchitectures:

 Tas8-level 0arallelism

1.16

Page 17: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 17/23

Page 18: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 18/231.1>

?l-nnDs %lassi"ication "or

Parallel Processor Architecture• Instruction Stream ,ata Streams *ased

classi"ication /SIS,9 8IS,9 SI8,9 8I8,• Processing units in parallel computers either

operate under the centraliCed control o" asingle control unit or )or( independentl-.

• I" there is a single control unit that dispatchesthe same instruction to 4arious processors/that )or( on di""erent data9 the model isre"erred to as single instruction stream9multiple data stream /SI8,.

• I" each processor has its o)n control unit9each processor can e7ecute di""erentinstructions on di""erent data items. his modelis called multiple instruction stream9 multiple

data stream /8I8,.

Page 19: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 19/231.1

SI8, and 8I8, Processors

 A t-pical SI8, architecture /a and a t-pical 8I8, architecture /*.

 

   8   6   8   3   +   E

 

   8   6

   8   3   +   E

IS

,S1

,S2

,S3

,Sn$1

,Sn

,S1

,S2

,Sn$1

,Sn

IS1

IS2

Isn$1

ISn

Page 20: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 20/231.20

SI8, Processors

• Some o" the earliest parallel computers such as the

Illiac I9 8PP9 ,AP9 %8$29 and 8asPar 8P$1 *elonged tothis class o" machines.

• ariants o" this concept ha4e "ound use in co$processing

units such as the 88F units in Intel processors9 ,SP

chips such as the Sharc  i4idiaDs GP@s.• SI8, relies on the regular structure o" computations /such

as those in image processing.

• It is o"ten necessar- to selecti4el- turn o""  operations on

certain data items. ?or this reason9 most SI8,programming paradigms allo) "or an HHacti4it- mas(9

)hich determines i" a processor should participate in a

computation or not.

% diti l ti i SI8, P

Page 21: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 21/231.21

7: %onditional 7ecution in SI8, Processors

Executing a conditional statement on an !"# computer with four processors:

 (a) the conditional statement$ (%) the execution of the statement in two steps.

Page 22: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 22/231.22

Programing 8odels: 8P8,= SP8,

• In contrast to SI8, processors9 8I8, processors cane7ecute di""erent programs on di""erent processors

• here are t)o programming models "or PP called

8ultiple=Single Program 8ultiple$,ata /8P8,= SP8,

e$eute di""erent=same program on di""erent processors• SI8, supports onl- SP8, model. Although 8I8,

supports both models o" programming /8P8, SP8,9

SP8, is pre"erred choice due to so"t)are management

• 7amples o" 8I8,$plat"orms inlude current generation

Sun @ltra Ser4ers9 SGI rigin Ser4ers9 multiprocessor  

P%s9 )or(station clusters  I'8 SP.

Page 23: PP16-lec3-arch2

8/16/2019 PP16-lec3-arch2

http://slidepdf.com/reader/full/pp16-lec3-arch2 23/23

1 23

%omparison: SI8, 4s 8I8,• %ontrol "lo):

S-nchronous in SI8, 4s As-nchronous in 8I8,

• Programming$model:SI8, supports onl- SP8,  prog-model  

while 8I8, supports *oth /SP8,  8P8,  prog-models

• %ost: SI8, computers re;uire less hard)are than

8I8, computers /single control unit.

 5 o)e4er9 since SI8, processors are speciall-

designed9 the- tend to *e e7pensi4e and ha4e long

design c-cles.

 5 In contrast9 8I8, processors can *e *uilt "rom

ine7pensi4e o""$the$shel"  components )ith relati4el-little e""ort in a short time

• ?le7i*ilit-: SI8, per"orm 4er- )ell "or specialiCed =

regular applications *ut Bot "or all applications9 )hile 

8I8, are more "le7i*le general purpose.