7/29/2019 Lec4a Supp
1/43
Chapter 4A: The Processor,
Mary Jane Irwin ( www.cse.psu.edu/~mji )
[Adapted from Computer Organization and Design, 4th Edition,
CSE431 Chapter 4A.1 Irwin, PSU, 2008
, ,
7/29/2019 Lec4a Supp
2/43
Review: MIPS (RISC) Design Principles
Simplicity favors regularity
z fixed size instructions
z
z opcode always the first 6 bits
Smaller is faster
z limited instruction set
z limited number of registers in register filez m e num er o a ress ng mo es
Make the common case fast
-z allow instructions to contain immediate operands
CSE431 Chapter 4A.2 Irwin, PSU, 2008
z
three instruction formats
7/29/2019 Lec4a Supp
3/43
The Processor: Datapath & Control
z memory-reference instructions: lw, sw
z arithmetic-logical instructions: add, sub, and, or, slt
z control flow instructions:beq, j
Generic implementation
z use the program counter (PC) to supplythe instruction address and fetch the
instruction from memory (and update the PC)
FetchPC = PC+4
DecodeExec
z decode the instruction (and read registers)
z execute the instruction
registers
CSE431 Chapter 4A.3 Irwin, PSU, 2008
How? memory-reference? arithmetic? control flow?
7/29/2019 Lec4a Supp
4/43
Aside: Clocking Methodologies
element is valid and stable relative to the clockz State elements - a memory element such as a register
z Edge-triggered all state changes occur on a clock edge
Typical executionz read contents of state elements -> send values throu h
combinational logic -> write results to one or more state elements
Stateelement StateelementCombinationallogic
clock
one clock cycle
Assumes state elements are written on every clockcycle; if not, need explicit write control signal
CSE431 Chapter 4A.4 Irwin, PSU, 2008
z write occurs only when both the write control is asserted and the
clock edge occurs
7/29/2019 Lec4a Supp
5/43
Fetching Instructions
z reading the instruction from the Instruction Memory
z updating the PC value to be the address of the next
Add
Instruction
4
Fetch
PC = PC+4
ReadAddress
InstructionPCDecodeExec
z PC is updated every clock cycle, so it does not need anexplicit write control signal just a clock signal
CSE431 Chapter 4A.5 Irwin, PSU, 2008
z Reading from the Instruction Memory is a combinational
activity, so it doesnt need an explicit read control signal
7/29/2019 Lec4a Supp
6/43
Decoding Instructions
z sending the fetched instructions opcode and function field
bits to the control unit
Control
Unit
FetchPC = PC+4
Read Addr 1Re ister
Read
DecodeExec
and Instruction
Write Data
Read Addr 2
Write AddrFile
Data 1
ReadData 2
z reading two values from the Register File
CSE431 Chapter 4A.6 Irwin, PSU, 2008
- Register File addresses are contained in the instruction
7/29/2019 Lec4a Supp
7/43
Executing R Format Operations
, , , ,
R-type:
31 25 20 15 5 0
o rs rt rd functshamt
10
z
perform operation (op and funct) on values in rs and rtz store the result back into the Register File (into location rd)
ALU controlRegWrite
Instruction
ea r
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadALU
overflow
zero
FetchPC = PC+4
DecodeExec
Write Data Data 2
CSE431 Chapter 4A.7 Irwin, PSU, 2008
z o e a eg s er e s no wr en every cyc e e.g. sw , so
we need an explicit write control signal for the Register File
7/29/2019 Lec4a Supp
8/43
Executing Load and Store Operations
z compute memory address by adding the base register (read from
the Register File during decode) to the 16-bit signed-extended
z store value (read from the Register File during decode) written tothe Data Memory
, ,File
overflow
ALU controlRegWrite MemWrite
Instruction
ea r
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadALU
Data
Memory
Address
Read Data
Write Data Data 2 r e a a
Sign MemRead
CSE431 Chapter 4A.8 Irwin, PSU, 2008
Extend16 32
7/29/2019 Lec4a Supp
9/43
Executing Branch Operations
z compare the operands read from the Register File during decode
for equality (zeroALU output)
the 16-bit signed-extended offset field in the instrAdd Branch
ALU control
Shift
left 2
4address
Read Addr 1
RegisterRead
Data 1
zero
PC
(to branchcontrol logic)
ns ruc on
Write Data
Write AddrFile
ReadData 2
ALU
CSE431 Chapter 4A.9 Irwin, PSU, 2008
Sign
Extend16 32
7/29/2019 Lec4a Supp
10/43
Executing Jump Operations
z replace the lower 28 bits of the PC with the lower 26 bits of the
fetched instruction shifted left by 2 bits
Add
4
Read
Instruction
MemoryShift
left 2
Jumpaddress
28
Address 26
CSE431 Chapter 4A.10 Irwin, PSU, 2008
7/29/2019 Lec4a Supp
11/43
Creating a Single Datapath from the Parts
Assemble the datapath segments and add control linesand multiplexors as needed
,
instructions in one clock cyclez no datapath resource can be used more than once per
instruction, so some must be duplicated (e.g., separateInstruction Memory and Data Memory, several adders)
z multiplexors needed at the input of shared elements withcontrol lines to do the selection
z write signals to control writing to the Register File and DataMemor
Cycle time is determined by length of the longest path
CSE431 Chapter 4A.11 Irwin, PSU, 2008
7/29/2019 Lec4a Supp
12/43
Fetch, R, and Memory Access Portions
MemtoReg
Instruction
Add
4
Read Addr 1
ovf
zero
ALU controlRegWrite MemWriteALUSrc
ReadAddress
Instruction
Memory
PC
Read Addr 2
Write Addr
Register
File
Data 1
ReadData 2
ALU
Data
Memory
Write Data
Read Data
r e a a
MemReadSign
CSE431 Chapter 4A.12 Irwin, PSU, 2008
7/29/2019 Lec4a Supp
13/43
Adding the Control
,and Memory read/write)
Controlling the flow of data (multiplexor inputs)
R-type:
31 25 20 15 5 0
op rs rt rd functshamt
10
I-Type: op rs rt address offset
serva ons
z op field always
in bits 31-26 31 25 0
z addr of registersto be read arealways specified by the
J-type: op target address
rs e s - an r e s - ; or w an sw rs s e aseregister
z addr. of register to be written is in one oftwo places in rt (bits 20-16)
CSE431 Chapter 4A.13 Irwin, PSU, 2008
- -
z offset for beq, lw, and sw always in bits 15-0
7/29/2019 Lec4a Supp
14/43
Single Cycle Datapath with Control Unit
Add
4 ShiftAdd
PCSrc
0
1
MemWriteMemReadMemtoReg
ALUSrc
ALUOp
Control
UnitInstr[31-26]
Branch
Instruction Read Addr 1Read
ovf
RegWrite
Address
RegDst
Instr[25-21]
ReadAddress
Instr[31-0]
Memory
PC
Read Addr 2
Write Addr
eg s er
File
Data 1
ReadData 2
ALU
zeroData
Memory
Write Data
Read Data 1
1
0
00
Instr[20-16]
Instr[15
Sign
Extend16 32ALU
control
Instr[15-0]
-11]
CSE431 Chapter 4A.14 Irwin, PSU, 2008
Instr[5-0]
7/29/2019 Lec4a Supp
15/43
R-type Instruction Data/Control Flow
Add
4 ShiftAdd
PCSrc
0
1
MemWriteMemReadMemtoReg
ALUSrc
ALUOp
Control
UnitInstr[31-26]
Branch
Instruction Read Addr 1Read
ovf
RegWrite
Address
RegDst
Instr[25-21]
ReadAddress
Instr[31-0]
Memory
PC
Read Addr 2
Write Addr
eg s er
File
Data 1
ReadData 2
ALU
zeroData
Memory
Write Data
Read Data 1
1
0
00
Instr[20-16]
Instr[15
Sign
Extend16 32ALU
control
Instr[15-0]
-11]
CSE431 Chapter 4A.15 Irwin, PSU, 2008
Instr[5-0]
7/29/2019 Lec4a Supp
16/43
Load Word Instruction Data/Control Flow
Add
4 ShiftAdd
PCSrc
0
1
MemWriteMemReadMemtoReg
ALUSrc
ALUOp
Control
UnitInstr[31-26]
Branch
Instruction Read Addr 1Read
ovf
RegWrite
Address
RegDst
Instr[25-21]
ReadAddress
Instr[31-0]
Memory
PC
Read Addr 2
Write Addr
eg s er
File
Data 1
ReadData 2
ALU
zeroData
Memory
Write Data
Read Data 1
1
0
00
Instr[20-16]
Instr[15
Sign
Extend16 32ALU
control
Instr[15-0]
-11]
CSE431 Chapter 4A.16 Irwin, PSU, 2008
Instr[5-0]
7/29/2019 Lec4a Supp
17/43
Load Word Instruction Data/Control Flow
Add
4 ShiftAdd
PCSrc
0
1
MemWriteMemReadMemtoReg
ALUSrc
ALUOp
Control
UnitInstr[31-26]
Branch
Instruction Read Addr 1Read
ovf
RegWrite
Address
RegDst
Instr[25-21]
ReadAddress
Instr[31-0]
Memory
PC
Read Addr 2
Write Addr
eg s er
File
Data 1
ReadData 2
ALU
zeroData
Memory
Write Data
Read Data 1
1
0
00
Instr[20-16]
Instr[15
Sign
Extend16 32ALU
control
Instr[15-0]
-11]
CSE431 Chapter 4A.17 Irwin, PSU, 2008
Instr[5-0]
7/29/2019 Lec4a Supp
18/43
Branch Instruction Data/Control Flow
Add
4 ShiftAdd
PCSrc
0
1
MemWriteMemReadMemtoReg
ALUSrc
ALUOp
Control
UnitInstr[31-26]
Branch
Instruction Read Addr 1Read
ovf
RegWrite
Address
RegDst
Instr[25-21]
ReadAddress
Instr[31-0]
Memory
PC
Read Addr 2
Write Addr
eg s er
File
Data 1
ReadData 2
ALU
zeroData
Memory
Write Data
Read Data 1
1
0
00
Instr[20-16]
Instr[15
Sign
Extend16 32ALU
control
Instr[15-0]
-11]
CSE431 Chapter 4A.18 Irwin, PSU, 2008
Instr[5-0]
7/29/2019 Lec4a Supp
19/43
Branch Instruction Data/Control Flow
Add
4 ShiftAdd
PCSrc
0
1
MemWriteMemReadMemtoReg
ALUSrc
ALUOp
Control
UnitInstr[31-26]
Branch
Instruction Read Addr 1Read
ovf
RegWrite
Address
RegDst
Instr[25-21]
ReadAddress
Instr[31-0]
Memory
PC
Read Addr 2
Write Addr
eg s er
File
Data 1
ReadData 2
ALU
zeroData
Memory
Write Data
Read Data 1
1
0
00
Instr[20-16]
Instr[15
Sign
Extend16 32ALU
control
Instr[15-0]
-11]
CSE431 Chapter 4A.19 Irwin, PSU, 2008
Instr[5-0]
7/29/2019 Lec4a Supp
20/43
Adding the Jump Operation
Instr 25-0
Add
4 ShiftAdd
0
1
Shift
left 2
0
3226PC+4[31-28]
28
MemWrite
MemReadMemtoReg
ALUSrc
left 2 PCSrcALUOp
Control
UnitInstr[31-26]
Branch
Jump
Read Addr 1
ovf
RegWriteRegDst
Instr[25-21]
ReadAddress
Instr[31-0]
ns ruc on
Memory
PC
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadALU
zeroData
Memory
Address
Read Data 1
1
0
0
Instr[20-16]
Write Data
Sign
ExtendALU
1
Instr[15-0]
ns r-11]
CSE431 Chapter 4A.20 Irwin, PSU, 2008
Instr[5-0]
7/29/2019 Lec4a Supp
21/43
Instruction Times (Critical Paths)
delays for muxes, control unit, sign extend, PC access,shift left 2, wires, setup and hold times except:
z Instruction and Data Memory (200 ps)
zALU and adders (200 ps)
Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total
-type
load
store
beq
CSE431 Chapter 4A.21 Irwin, PSU, 2008
jump
7/29/2019 Lec4a Supp
22/43
Instruction Critical Paths
delays for muxes, control unit, sign extend, PC access,shift left 2, wires, setup and hold times except:
z Instruction and Data Memory (200 ps)
zALU and adders (200 ps)
Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total
-type
load
200 100 200 100 600
200 100 200 200 100 800
store
beq
200 100 200 200 700
200 100 200 500
CSE431 Chapter 4A.22 Irwin, PSU, 2008
jump 200 200
7/29/2019 Lec4a Supp
23/43
Single Cycle Disadvantages & Advantages
ses e c oc cyc e ne c en y e c oc cyc e musbe timed to accommodate the slowest instructionz especially problematic for more complex instructions like
floating point multiply
Cycle 1 Cycle 2
lw sw Waste
May be wasteful of area since some functional units. .,
shared during a clock cycle
but
CSE431 Chapter 4A.23 Irwin, PSU, 2008
Is simple and easy to understand
7/29/2019 Lec4a Supp
24/43
How Can We Make It Faster?
Start fetching and executing the next instruction beforethe current one has completed
z Pipelining (all?) modern processors are pipelined forperformance
z Rememberthe performance equation:CPU time = CPI * CC * IC
Underideal conditions and with a large number of
instructions, the speedup from pipelining is
z A five stage pipeline is nearly five times faster because the CC isnearly five times faster
Fetch (and execute) more than one instruction at a time
CSE431 Chapter 4A.24 Irwin, PSU, 2008
z upersca ar process ng s ay une
7/29/2019 Lec4a Supp
25/43
The Five Stages of Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
e c ec xec emw
IFetch: Instruction Fetch and Update PC
Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address
Mem: Read/write the data from/to the Data Memor
WB: Write the result data into the register file
CSE431 Chapter 4A.25 Irwin, PSU, 2008
7/29/2019 Lec4a Supp
26/43
A Pipelined MIPS Processor
completed
z improves throughput - total amount of work done in a given time
z instruction latency (execution time, delay time, response time -
time from the start of an instruction to its completion) is notreduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 7Cycle 6 Cycle 8
sw IFetch Dec Exec Mem WB
- IFetch Dec Exec Mem WB
- clock cycle (pipeline stage time) is limited by the slowest stage
- for some sta es dont need the whole clock c cle e. . WB
CSE431 Chapter 4A.26 Irwin, PSU, 2008
- for some instructions, some stages are wasted cycles (i.e.,
nothing is done during that cycle for that instruction)
7/29/2019 Lec4a Supp
27/43
Single Cycle versus Pipeline
Clk
Single Cycle Implementation (CC = 800 ps):Cycle 1 Cycle 2
lw sw Waste
lw IFetch Dec Exec Mem WB
Pipeline Implementation (CC = 200 ps): ps
IFetch Dec Exec Mem WBR-type
To complete an entire instruction in the pipelined casetakes 1000 ps (as compared to 800 ps for the singlec cle case . Wh ?
CSE431 Chapter 4A.27 Irwin, PSU, 2008
How long does each take to complete 1,000,000 adds ?
7/29/2019 Lec4a Supp
28/43
7/29/2019 Lec4a Supp
29/43
MIPS Pipeline Datapath Additions/Mods
IF:IFetch ID:Dec EX:Execute MEM:
MemAccess
WB:
WriteBack
IF/ID ID/EX EX/MEM
Add
4
Read Addr 1
Shift
left 2
Add MEM/WB
ReadAddress
Memory
PC
Read Addr 2
Write Addr
eg s er
File
eaData 1
ReadData 2
ALU
Memory
Address ReadData
r e a a
16 32
Sign
Extend
CSE431 Chapter 4A.29 Irwin, PSU, 2008
System Clock
7/29/2019 Lec4a Supp
30/43
MIPS Pipeline Control Path Modifications
z and held in the state registers between pipeline stages
PCSrc
IF/ID
EX/MEM
Control
Instruction
4
Read Addr 1Register Read
Shift
left 2
Add
Data
MEM/WB
RegWriteBranch
ReadAddressP
C
Write Data
Read Addr 2
Write AddrFile
Data 1
ReadData 2
ALU
emory
Address
Write Data
ReadData
MemtoRegALUSrc
16 32
Sign
Extend
ALU
cntrlMemRead
ALUOp
CSE431 Chapter 4A.30 Irwin, PSU, 2008RegDst
7/29/2019 Lec4a Supp
31/43
Pipeline Control
IF Stage: read Instr Memory (always asserted) and writePC (on System Clock)
EX Stage MEM Stage WB Stage
Reg
Dst
ALU
Op1
ALU
Op0
ALU
Src
Brch Mem
Read
Mem
Write
Reg
Write
Mem
toRegR 1 1 0 0 0 0 0 1 0
l w 0 0 0 1 0 1 0 1 1
sw
beq X 0 1 0 1 0 0 0 X
CSE431 Chapter 4A.31 Irwin, PSU, 2008
7/29/2019 Lec4a Supp
32/43
Graphically Representing MIPS Pipeline
AL
UIM Reg DM Reg
z How many cycles does it take to execute this code?
z What is the ALU doing during cycle 4?
z Is there a hazard, why does it occur, and how can it be fixed?
CSE431 Chapter 4A.32 Irwin, PSU, 2008
7/29/2019 Lec4a Supp
33/43
Why Pipeline? For Performance!
me c oc cyc es
A Once the
I
ns
Inst 1
LUeg eg
AL
IM Re DM Re
p pe ne s u ,
one instructionis completed
r.
O Inst 2
U
A
LUIM Reg DM Reg
,CPI = 1
r
d
e Inst 3ALUIM Reg DM Reg
Inst 4ALUIM Reg DM Reg
CSE431 Chapter 4A.33 Irwin, PSU, 2008
7/29/2019 Lec4a Supp
34/43
Can Pipelining Get Us Into Trouble?
Yes: Pipeline Hazardsz structural hazards: attempt to use the same resource by two
different instructions at the same time
z data hazards: attempt to use data before it is ready
- An instructions source operand(s) are produced by a prior
z control hazards: attempt to make a decision about programcontrol flow before the condition has been evaluated and the
- branch and jump instructions, exceptions
Can usually resolve hazards by waiting
z i eline control must detect the hazard
CSE431 Chapter 4A.34 Irwin, PSU, 2008
z and take action to resolve hazards
7/29/2019 Lec4a Supp
35/43
A Single Memory Would Be a Structural Hazard
me c oc cyc es
A Reading data fromI
ns
Inst 1
LUem eg em eg
AL
Mem Re Mem Re
memory
r.
O Inst 2
U
A
LUMem Reg Mem Reg
r
d
e Inst 3ALUMem Reg Mem Reg
Inst 4ALUMem Reg Mem RegReading instruction
from memor
CSE431 Chapter 4A.35 Irwin, PSU, 2008
Fix with separate instr and data memories (I$ and D$)
7/29/2019 Lec4a Supp
36/43
How About Register File Access?Time clock c cles
AL
IM Reg DM Reg
Fix register fileaccess hazard badd$1,
n
s
t Inst 1
U
ALUIM Reg DM Reg
doing reads in the
second half of thecycle and writes in
r.
O Inst 2AL
UIM Reg DM Reg
the first half
d
e
r
ALUIM Reg DM Regadd $2,$1,
CSE431 Chapter 4A.36 Irwin, PSU, 2008
clock edge that controlsregister writing
c oc e ge a con ro sloading of pipeline state
registers
7/29/2019 Lec4a Supp
37/43
Register Usage Can Cause Data Hazards
A
I
ns
,
sub $4 $1 $5
LUeg eg
AL
IM Re DM Re
r.
O and $6,$1,$7
U
A
LUIM Reg DM Reg
r
d
e or $8,$1,$9ALUIM Reg DM Reg
xor $4,$1,$5ALUIM Reg DM Reg
CSE431 Chapter 4A.37 Irwin, PSU, 2008
Read before write data hazard
7/29/2019 Lec4a Supp
38/43
Register Usage Can Cause Data Hazards
A
LU
AL
IM Reg DM Reg
,
sub $4 $1 $5
ALUIM
RegDM Reg
and $6,$1,$7ALUIM Reg DM Regor $8,$1,$9
ALUIM Reg DM Regxor $4,$1,$5
CSE431 Chapter 4A.38 Irwin, PSU, 2008
Read before write data hazard
7/29/2019 Lec4a Supp
39/43
Loads Can Cause Data Hazards
A
I
ns
,
sub $4 $1 $5
LU
AL
IM Reg DM Reg
r.
O and $6,$1,$7
ALUIM
RegDM Reg
r
d
e or $8,$1,$9ALUIM Reg DM Reg
xor $4,$1,$5
ALUIM Reg DM Reg
CSE431 Chapter 4A.39 Irwin, PSU, 2008
Load-use data hazard
7/29/2019 Lec4a Supp
40/43
Branch Instructions Cause Control Hazards
A
I
ns
LUeg eg
AL
IM Re DM Re
r.
O Inst 3
U
A
LUIM Reg DM Reg
r
d
e Inst 4ALUIM Reg DM Reg
CSE431 Chapter 4A.40 Irwin, PSU, 2008
7/29/2019 Lec4a Supp
41/43
Other Pipeline Structures Are Possible
z Make the clock twice as slow or
z let it take two cycles (since it doesnt use the DM stage)
A
MUL
U
What if the data memor access is twice as slow asthe instruction memory?
z make the clock twice as slow or
ALUIM Reg DM1 RegDM2
clock rate)
CSE431 Chapter 4A.41 Irwin, PSU, 2008
7/29/2019 Lec4a Supp
42/43
Other Sample Pipeline Alternatives
ARM7IM Reg EX
IM access reg
access
DM access
shift/rotatecommit result
ALUIM1 IM2 DM1
Reg
DM2Reg SHFT
PC updateBTB access
decodere 1 access ALU o
DM write
start IM accessIM access
shift/rotatereg 2 access
start DM accessexception
CSE431 Chapter 4A.42 Irwin, PSU, 2008
7/29/2019 Lec4a Supp
43/43
Summary
All modern day processors use pipelining
Pipelining doesnt help latency of single task, it helps
Potential speedup: a CPI of 1 and fast a CC
z Unbalanced pipe stages makes for inefficiencies
z The time to fill i eline and time to drain it can im actspeedup for deep pipelines and short code runs
Must detect and resolve hazards
z
Stalling negatively affects CPI (makes CPI less than the idealof 1)
CSE431 Chapter 4A.43 Irwin, PSU, 2008
Top Related