Asanovic/DevadasSpring 2002
6.823
Simple InstructionPipelining
Krste AsanovicLaboratory for Computer Science
Massachusetts Institute of Technology
Asanovic/Devadas Spring 2002Processor Performance 6.823
Equation Time = Instructions * Cycles * Time
Program Program Instruction Cycle
� Instructions per program depends on source code, compiler technology, and ISA
� Microcoded DLX from last lecture had cycles per instruction (CPI) of around 7 minimum
� Time per cycle for microcoded DLX fixed by microcode cycle time
— mostly ROM access + next µPC select logic
Asanovic/DevadasSpring 2002
6.823Pipelined DLX
To pipeline DLX:
� First build unpipelined DLX with CPI=1
� Next, add pipeline registers to reduce cycle time while maintaining CPI=1
Asanovic/DevadasSpring 2002
6.823A Simple Memory Model WriteEnable
Address ReadData
WriteData
MAGIC RAM
Clock
Reads and writes are always completed in one cycle� a Read can be done any time (i.e. combinational)� a Write is performed at the rising clock edge
if it is enabled ⇒ the write address, data, and enable
must be stable at the clock edge
Asanovic/DevadasSpring 2002
6.823Datapath for ALU Instructions
OpCode
rd1
rs1 rs2
rd2
inst<25:21> inst<20:16>
GPRs z
ALU
inst<15:0>
inst<31:26> <5:0>
ExtSel OpSel
ALU Control
Imm Ext
BSrc
clk
inst<15:11>
RegWrite
ws wd
we
RegDst
0x4 Add
clk
addr inst
Inst. Memory
PC
0x4 Add
clk
addr inst
Inst. Memory
PC
rf2 / rf3 Reg / Imm
6 5 5 5 5 6 rf3 ← (rf1) func (rf2)
opcode rf1 rf2 immediate rf2 ← (rf1) op immediate 0 rf1 rf2 func 0 rf3
Asanovic/DevadasDatapath for Memory Spring 2002 6.823
Instructions Should program and data memory be separate?
Harvard style: separate (Aiken and Mark 1 influence) - read-only program memory - read/write data memory
at some level the two memories have to be the same
Princeton style: the same (von Neumann’s influence) - A Load or Store instruction requires
accessing the memory more than once during its execution
Asanovic/DevadasLoad/Store Instructions: Spring 2002 6.823
Harvard-Style Datapath
RegWrite MemWrite 0x4
Add
clk
addr inst
Inst. Memory
PC
WBSrc ALU / Mem
“base”
disp
ALU Control
z ALU
clk
addr
wdata
rdataData Memory
we
clk
rd1
GPRs
rs1 rs2
ws wd rd2
we
Imm Ext
OpCode RegDst ExtSel OpSel BSrc
6 5 5 16 addressing mode (rf1) + displacementopcode rf1 rf2 ent displacem
31 26 25 21 20 16 15 0
rf1 is the base registerrf2 is the destination of a Load or the source for a Store
Asanovic/Devadas Spring 2002
Memory Hierarchy c.2002 6.823
Desktop & Small Server
Proc I$
D$
0.5-2ns 2-3 clk
8~64KB
On-chip Caches
L2
<10ns 5~15 clk 0.25-2MB
SRAM/ eDRAM
Interleaved Banks of DRAM
Hard DiskOff-chip L3 Cache
< 25ns 15~50 clk
1~8MB
~150ns100~300 clk64M~1GB
~10ms seek time~107 clk
20~100GB
Our memory model is a good approximation of the hierarchical memory system when we hit in the on-chip cache
Asanovic/Devadas Spring 2002Conditional Branches 6.823
PCSrc ( ~j / j ) RegWrite MemWrite WBSrc
0x4
clk
RegDst BSrcExtSelOpCode
Add
z
Add
OpSel
clk
zero?
clk
addr inst
Inst. Memory
PC rd1
GPRs
rs1 rs2 ws wd rd2
we
Imm Ext
addr
wdata
rdata Data Memory
we ALU
ALU Control
Asanovic/Devadas Spring 2002Register-Indirect Jumps 6.823
PCSrc ( ~j / j RInd / j PCR ) RegWrite MemWrite WBSrc
0x4
clk
Add
z
Add clk
Jump & Link?
clk
addr inst
Inst. Memory
PC rd1
GPRs
rs1 rs2
ws wd rd2
we
Imm Ext
addr
wdata
rdataData Memory
we ALU
ALU Control
OpCode RegDst ExtSel OpSel BSrc zero?
Asanovic/DevadasSpring 2002Jump & Link 6.823
0x4
clk
RegDst
RegWrite
BSrc zero?
WBSrc ALU / Mem / PC
31
ExtSelOpCode
Add
z
Add
OpSel
clk
MemWritePCSrc
clk
addr inst
Inst. Memory
PC rd1
GPRs
rs1 rs2
ws wd rd2
we
Imm Ext
addr
wdata
rdataData Memory
we ALU
ALU Control
rf3 / rf2 / R31
Asanovic/Devadas Spring 2002PC-Relative Jumps 6.823
RegWrite MemWrite WBSrc
0x4
clk
RegDst
31
OpCode
Add
z
Add
clk
PCSrc
clk
addr inst
Inst. Memory
PC rd1
GPRs
rs1 rs2
ws wd rd2
we
Imm Ext
addr
wdata
rdataData Memory
we ALU
No new datapath required
ALU Control
ExtSel OpSel BSrc zero?Ext16 / Ext26
Asanovic/DevadasSpring 2002
6.823
Hardwired Control is pureCombinational Logic:
Unpipelined DLX
op code
zero?
combinational logic
ExtSel BSrc OpSel MemWrite WBSrc RegDst RegWrite PCSrc
Asanovic/Devadas Spring 2002ALU Control & Immediate 6.823
Extension
Inst<31:26> (Opcode)
Decode Map
Inst<5:0> (Func)
ALUop
0? +
OpSel ( Func, Op, +, 0? )
ExtSel ( sExt16, uExt16,sExt26, High16)
Asanovic/Devadas Spring 2002Hardwired Control worksheet 6.823
PCSrc RegWrite MemWrite WBSrc
0x4
clk
PCR / RInd / ~j ALU / Mem / PC
inst<25:21> inst<20:16>
inst<15:11>
inst<25:0>
inst<31:26><5:0>
31
0x4 Add
z
Add
RegDst
BSrc Reg / Imm
zero?
OpCode
clk
Add
clk
addr inst
Inst. Memory
PC rd1
GPRs
rs1 rs2
ws wd rd2
we
Imm Ext
addr
wdata
rdataData Memory
we ALU
ALU Control
ExtSel OpSelrf2 / rf3 / sExt16/uExt16/ Func/R31 sExt26/High16 Op/+ / 0?
Asanovic/DevadasHardwired Control Table Spring 2002 6.823
BSrc = Reg / Imm WBSrc = ALU / Mem / PC RegDst = rf2 / rf3 / R31 PCSrc1 = j / ~j PCSrc2 = PCR / RInd
JR JALR
JAL J BEQZz=1
BEQZz=0
SW LW ALUiu ALUi ALUu ALU
PC Src
RegDest
WB Src
RegWrite
Mem Write
OpSel
BSrcExt Sel
Opcode
* * o yes RIndPC R31 RInd* * o no * *
PCRsExt26 * * no yes PC R31 PCRsExt26 * * no no * *
~jsExt16 * ? no no * * PCRsExt16 * ? no no * *
~jsExt16 Imm + yes no * *
~jImm Op no yes ALU rf2
~j* Reg Func no yes ALU rf3 ~j* Reg Func no yes ALU rf3
~jsExt16 Imm Op no yes ALU rf2
~jsExt16 Imm + no yes Mem rf2 uExt16
* n* n
00
Asanovic/DevadasSpring 2002
6.823
Hardwired Unpipelined Machine
� Simple
� One instruction per cycle
� Why wasn’t this a popular machine style?
Asanovic/DevadasSpring 2002
6.823Unpipelined DLX
Clock period must be sufficiently long for all of the following steps to be “completed”:
1. instruction fetch 2. decode and register fetch 3. ALU operation 4. data fetch if required 5. register write-back setup time
⇒ tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB
� At the rising edge of the following clock, the PC, the register file and the memory are updated
Asanovic/Devadas Spring 2002Pipelined DLX Datapath 6.823
addr
wdata
rdata Data Memory
we ALU
Imm Ext
PC
0x4 Add
addr rdata
Inst. Memory
rd1
GPRs
rs1 rs2
ws wd rd2
we
write -back phase
fetch phase
execute phase
decode & Reg-fetch phase
memory phase
IR PC
Clock period can be reduced by dividing the execution of an instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} = tDM (probably) However, CPI will increase unless instructions are pipelined
Asanovic/Devadas Spring 2002An Ideal Pipeline 6.823
stage stage stage 2 3 4
stage 1
� All objects go through the same stages
� No sharing of resources between any two stages
� Propagation delay through all pipeline stages is equal
� The scheduling of an object entering the pipeline is not affected by the objects in other stages
These conditions generally hold for industrial assembly lines. An instruction pipeline, however, cannot satisfy the last condition. Why?
Asanovic/DevadasSpring 2002
6.823Pipelining History
� Some very early machines had limited pipelined execution (e.g., Zuse Z4, WWII) � Usually overlap fetch of next instruction with current execution
� IBM Stretch first major “supercomputer” incorporating extensive pipelining, result bypassing, and branch prediction � project started in 1954, delivered in 1961 � didn’t meet initial performance goal of 100x faster with 10x
faster circuits � up to 11 macroinstructions in pipeline at same time � microcode engine highly pipelined also (up to 6
microinstructions in pipeline at same time) � Stretch was origin of 8-bit byte and lower case characters,
carried on into IBM 360
Asanovic/DevadasSpring 2002
6.823How to divide the datapathinto stages
Suppose memory is significantly slower than other stages. In particular, suppose
tIM = tDM = 10 units tALU = 5 units tRF = tRW = 1 unit
Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance
Asanovic/DevadasSpring 2002
6.823Minimizing Critical Path
write -back phase
fetch phase
decode & Reg-fetch phase
memory phase
addr
wdata
rdata Data Memory
we ALU
Imm Ext
PC
0 x4 Add
addr rdata
Inst. Memory
rd1
GPRs
rs1 rs2
ws wd rd2
we
IR
& execute
tC > max {tIM, tRF + tALU, tDM, tRW}
Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase
⇒ increase the critical path by 10%
Asanovic/DevadasMaximum Speedup by Spring 2002 6.823
PipeliningFor the 4-stage pipeline, given
tIM = tDM = 10 units, tALU = 5 units, tRF = tRW= 1 unit tC could be reduced from 27 units to 10 units
⇒ speedup = 2.7
However, if tIM = tDM = tALU = tRF = tRW = 5 unitsThe same 4-stage pipeline can reduce tC from 25 units to 10 units
⇒ speedup = 2.5
But, since tIM = tDM = tALU = tRF = tRW, it is possible to achieve higher speedup with more stages in the pipeline.
A 5-stage pipeline can reduce tC from 25 units to 5 units ⇒ speedup = 5
Asanovic/DevadasSpring 2002
6.823
Technology Assumptions We will assume
• A small amount of very fast memory (caches) backed up by a large, slower memory • Fast ALU (at least for integers) • Multiported Register files (slower!).
It makes the following timing assumption valid
tIM ≈ tRF ≈ tALU ≈ tDM ≈ tRW
A 5-stage pipelined Harvard-style architecture will be the focus of our detailed design
Asanovic/Devadas5-Stage Pipelined Execution Spring 2002 6.823
fetch
PC
0x4 Add
addr
wdata
rdata Memory
we
executedecode & Reg-fetch memory
IR rd1
GPRs
rs1 rs2 ws wd rd2
we
Imm Ext
addr
wdata
rdata Memory
we ALU
write -back
phase phase phase phase phase(IF) (ID) (EX) (MA) (WB)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .instruction1 IF1 ID1 EX1 MA1 WB1
instruction2 IF2 ID2 EX2 MA2 WB2
instruction3 IF3 ID3 EX3 MA3 WB3
instruction4 IF4 ID4 EX4 MA4 WB4
instruction5 IF5 ID5 EX5 MA5 WB5
Asanovic/Devadas5-Stage Pipelined Execution Spring 2002 6.823
Resource Usage Diagram
fetch
PC
0x4 Add
addr
wdata
rdata Memory
we
executedecode & Reg-fetch memory
IR rd1
GPRs
rs1 rs2 ws wd rd2
we
Imm Ext
addr
wdata
rdata Memory
we ALU
write -back
phase phase phase phase phase(IF) (ID) (EX) (MA) (WB)
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I1 I2 I3 I4 I5
ID I1 I2 I3 I4 I5
EX I1 I2 I3 I4 I5
MA I1 I2 I3 I4 I5
WB I1 I2 I3 I4 I5Res
ourc
es
Asanovic/DevadasPipelined Execution: Spring 2002 6.823
ALU Instructions
not quite correct!
31PC A
B
Y
R
MD1 MD2
addr inst
Inst Memory
0x4 Add
IR rd1
GPRs
rs1 rs2
ws wd rd2
we
Imm Ext
addr
wdata
rdata Data Memory
we ALU
Asanovic/DevadasPipelined Execution: Spring 2002 6.823
Need for Several IR’s
31 IR IR IR
PC A
B
Y
R
MD1 MD2
addr inst
Inst Memory
0x4 Add
IR rd1
GPRs
rs1 rs2
ws wd rd2
we
Imm Ext
addr
wdata
rdata Data Memory
we ALU
Asanovic/DevadasSpring 2002
6.823IRs and Control points
31 IR IR IR
PC A
B
Y
R
MD1 MD2
addr inst
Inst Memory
0x4 Add
IR rd1
GPRs
rs1 rs2
ws wd rd2
we
Imm Ext
addr
wdata
rdata Data Memory
we ALU
Are control points connected properly? - Load/Store instructions - ALU instructions
Asanovic/DevadasSpring 2002Pipelined DLX Datapath 6.823
without jumps
31 IR IR IR
PC A
B
Y
R
MD1 MD2
addr inst
Inst Memory
0x4 Add
IR rd1
GPRs
rs1 rs2
ws wd rd2
we
Imm Ext
addr
wdata
rdata Data Memory
we ALU
ExtSel
OpSel
BSrc
RegDst
WBSrc MemWrite
RegWrite
Asanovic/DevadasSpring 2002
6.823
How Instructions can Interact with each other in a Pipeline
• An instruction in the pipeline may need a resource being used by another instruction in the pipeline
structural hazard
• An instruction may produce data that is needed by a later instruction
data hazard
• In the extreme case, an instruction may determine the next instruction to be executed
control hazard (branches, interrupts,...)
Asanovic/DevadasSpring 2002
6.823Feedback to Resolve Hazards
stage 1
stage 2
stage 3
stage 4
FB1 FB2 FB3 FB4
Controlling pipeline in this manner works providedthe instruction at stage i+1 can complete withoutany interference from instructions in stages 1 to i
(otherwise deadlocks may occur)
Feedback to previous stages is used to stall or kill instructions
Top Related