Download - Lec4a Supp

7/29/2019 Lec4a Supp

1/43

Chapter 4A: The Processor,

Mary Jane Irwin ( www.cse.psu.edu/~mji )

[Adapted from Computer Organization and Design, 4th Edition,

CSE431 Chapter 4A.1 Irwin, PSU, 2008

, ,


2/43

Review: MIPS (RISC) Design Principles

Simplicity favors regularity

z fixed size instructions

z

z opcode always the first 6 bits

Smaller is faster

z limited instruction set

z limited number of registers in register filez m e num er o a ress ng mo es

Make the common case fast

-z allow instructions to contain immediate operands


z

three instruction formats


3/43

The Processor: Datapath & Control

z memory-reference instructions: lw, sw

z arithmetic-logical instructions: add, sub, and, or, slt

z control flow instructions:beq, j

Generic implementation

z use the program counter (PC) to supplythe instruction address and fetch the

instruction from memory (and update the PC)

FetchPC = PC+4

DecodeExec

z decode the instruction (and read registers)

z execute the instruction

registers


How? memory-reference? arithmetic? control flow?


4/43

Aside: Clocking Methodologies

element is valid and stable relative to the clockz State elements - a memory element such as a register

z Edge-triggered all state changes occur on a clock edge

Typical executionz read contents of state elements -> send values throu h

combinational logic -> write results to one or more state elements

Stateelement StateelementCombinationallogic

clock

one clock cycle

Assumes state elements are written on every clockcycle; if not, need explicit write control signal


z write occurs only when both the write control is asserted and the

clock edge occurs


5/43

Fetching Instructions

z reading the instruction from the Instruction Memory

z updating the PC value to be the address of the next

Add

Instruction

4

Fetch

PC = PC+4

ReadAddress

InstructionPCDecodeExec

z PC is updated every clock cycle, so it does not need anexplicit write control signal just a clock signal


z Reading from the Instruction Memory is a combinational

activity, so it doesnt need an explicit read control signal


6/43

Decoding Instructions

z sending the fetched instructions opcode and function field

bits to the control unit

Control

Unit

FetchPC = PC+4

Read Addr 1Re ister

Read

DecodeExec

and Instruction

Write Data

Read Addr 2

Write AddrFile

Data 1

ReadData 2

z reading two values from the Register File


- Register File addresses are contained in the instruction


7/43

Executing R Format Operations

, , , ,

R-type:

31 25 20 15 5 0

o rs rt rd functshamt

10

z

perform operation (op and funct) on values in rs and rtz store the result back into the Register File (into location rd)

ALU controlRegWrite

Instruction

ea r

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadALU

overflow

zero

FetchPC = PC+4

DecodeExec

Write Data Data 2


z o e a eg s er e s no wr en every cyc e e.g. sw , so

we need an explicit write control signal for the Register File


8/43

Executing Load and Store Operations

z compute memory address by adding the base register (read from

the Register File during decode) to the 16-bit signed-extended

z store value (read from the Register File during decode) written tothe Data Memory

, ,File

overflow

ALU controlRegWrite MemWrite

Instruction

ea r

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadALU

Data

Memory

Address

Read Data

Write Data Data 2 r e a a

Sign MemRead


Extend16 32


9/43

Executing Branch Operations

z compare the operands read from the Register File during decode

for equality (zeroALU output)

the 16-bit signed-extended offset field in the instrAdd Branch

ALU control

Shift

left 2

4address

Read Addr 1

RegisterRead

Data 1

zero

PC

(to branchcontrol logic)

ns ruc on

Write Data

Write AddrFile

ReadData 2

ALU


Sign

Extend16 32


10/43

Executing Jump Operations

z replace the lower 28 bits of the PC with the lower 26 bits of the

fetched instruction shifted left by 2 bits

Add

4

Read

Instruction

MemoryShift

left 2

Jumpaddress

28

Address 26



11/43

Creating a Single Datapath from the Parts

Assemble the datapath segments and add control linesand multiplexors as needed

,

instructions in one clock cyclez no datapath resource can be used more than once per

instruction, so some must be duplicated (e.g., separateInstruction Memory and Data Memory, several adders)

z multiplexors needed at the input of shared elements withcontrol lines to do the selection

z write signals to control writing to the Register File and DataMemor

Cycle time is determined by length of the longest path



12/43

Fetch, R, and Memory Access Portions

MemtoReg

Instruction

Add

4

Read Addr 1

ovf

zero

ALU controlRegWrite MemWriteALUSrc

ReadAddress

Instruction

Memory

PC

Read Addr 2

Write Addr

Register

File

Data 1

ReadData 2

ALU

Data

Memory

Write Data

Read Data

r e a a

MemReadSign



13/43

Adding the Control

,and Memory read/write)

Controlling the flow of data (multiplexor inputs)

R-type:

31 25 20 15 5 0

op rs rt rd functshamt

10

I-Type: op rs rt address offset

serva ons

z op field always

in bits 31-26 31 25 0

z addr of registersto be read arealways specified by the

J-type: op target address

rs e s - an r e s - ; or w an sw rs s e aseregister

z addr. of register to be written is in one oftwo places in rt (bits 20-16)


- -

z offset for beq, lw, and sw always in bits 15-0


14/43

Single Cycle Datapath with Control Unit

Add

4 ShiftAdd

PCSrc

0

1

MemWriteMemReadMemtoReg

ALUSrc

ALUOp

Control

UnitInstr[31-26]

Branch

Instruction Read Addr 1Read

ovf

RegWrite

Address

RegDst

Instr[25-21]

ReadAddress

Instr[31-0]

Memory

PC

Read Addr 2

Write Addr

eg s er

File

Data 1

ReadData 2

ALU

zeroData

Memory

Write Data

Read Data 1

1

0

00

Instr[20-16]

Instr[15

Sign

Extend16 32ALU

control

Instr[15-0]

-11]


Instr[5-0]


15/43

R-type Instruction Data/Control Flow

Add

4 ShiftAdd

PCSrc

0

1


ALUSrc

ALUOp

Control

UnitInstr[31-26]

Branch


ovf

RegWrite

Address

RegDst

Instr[25-21]

ReadAddress

Instr[31-0]

Memory

PC

Read Addr 2

Write Addr

eg s er

File

Data 1

ReadData 2

ALU

zeroData

Memory

Write Data

Read Data 1

1

0

00

Instr[20-16]

Instr[15

Sign

Extend16 32ALU

control

Instr[15-0]

-11]


Instr[5-0]


16/43

Load Word Instruction Data/Control Flow

Add

4 ShiftAdd

PCSrc

0

1


ALUSrc

ALUOp

Control

UnitInstr[31-26]

Branch


ovf

RegWrite

Address

RegDst

Instr[25-21]

ReadAddress

Instr[31-0]

Memory

PC

Read Addr 2

Write Addr

eg s er

File

Data 1

ReadData 2

ALU

zeroData

Memory

Write Data

Read Data 1

1

0

00

Instr[20-16]

Instr[15

Sign

Extend16 32ALU

control

Instr[15-0]

-11]


Instr[5-0]


17/43

Load Word Instruction Data/Control Flow

Add

4 ShiftAdd

PCSrc

0

1


ALUSrc

ALUOp

Control

UnitInstr[31-26]

Branch


ovf

RegWrite

Address

RegDst

Instr[25-21]

ReadAddress

Instr[31-0]

Memory

PC

Read Addr 2

Write Addr

eg s er

File

Data 1

ReadData 2

ALU

zeroData

Memory

Write Data

Read Data 1

1

0

00

Instr[20-16]

Instr[15

Sign

Extend16 32ALU

control

Instr[15-0]

-11]


Instr[5-0]


18/43

Branch Instruction Data/Control Flow

Add

4 ShiftAdd

PCSrc

0

1


ALUSrc

ALUOp

Control

UnitInstr[31-26]

Branch


ovf

RegWrite

Address

RegDst

Instr[25-21]

ReadAddress

Instr[31-0]

Memory

PC

Read Addr 2

Write Addr

eg s er

File

Data 1

ReadData 2

ALU

zeroData

Memory

Write Data

Read Data 1

1

0

00

Instr[20-16]

Instr[15

Sign

Extend16 32ALU

control

Instr[15-0]

-11]


Instr[5-0]


19/43

Branch Instruction Data/Control Flow

Add

4 ShiftAdd

PCSrc

0

1


ALUSrc

ALUOp

Control

UnitInstr[31-26]

Branch


ovf

RegWrite

Address

RegDst

Instr[25-21]

ReadAddress

Instr[31-0]

Memory

PC

Read Addr 2

Write Addr

eg s er

File

Data 1

ReadData 2

ALU

zeroData

Memory

Write Data

Read Data 1

1

0

00

Instr[20-16]

Instr[15

Sign

Extend16 32ALU

control

Instr[15-0]

-11]


Instr[5-0]


20/43

Adding the Jump Operation

Instr 25-0

Add

4 ShiftAdd

0

1

Shift

left 2

0

3226PC+4[31-28]

28

MemWrite

MemReadMemtoReg

ALUSrc

left 2 PCSrcALUOp

Control

UnitInstr[31-26]

Branch

Jump

Read Addr 1

ovf

RegWriteRegDst

Instr[25-21]

ReadAddress

Instr[31-0]

ns ruc on

Memory

PC

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadALU

zeroData

Memory

Address

Read Data 1

1

0

0

Instr[20-16]

Write Data

Sign

ExtendALU

1

Instr[15-0]

ns r-11]


Instr[5-0]


21/43

Instruction Times (Critical Paths)

delays for muxes, control unit, sign extend, PC access,shift left 2, wires, setup and hold times except:

z Instruction and Data Memory (200 ps)

zALU and adders (200 ps)

Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total

-type

load

store

beq


jump


22/43

Instruction Critical Paths

delays for muxes, control unit, sign extend, PC access,shift left 2, wires, setup and hold times except:

z Instruction and Data Memory (200 ps)

zALU and adders (200 ps)

Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total

-type

load

200 100 200 100 600

200 100 200 200 100 800

store

beq

200 100 200 200 700

200 100 200 500


jump 200 200


23/43

Single Cycle Disadvantages & Advantages

ses e c oc cyc e ne c en y e c oc cyc e musbe timed to accommodate the slowest instructionz especially problematic for more complex instructions like

floating point multiply

Cycle 1 Cycle 2

lw sw Waste

May be wasteful of area since some functional units. .,

shared during a clock cycle

but


Is simple and easy to understand


24/43

How Can We Make It Faster?

Start fetching and executing the next instruction beforethe current one has completed

z Pipelining (all?) modern processors are pipelined forperformance

z Rememberthe performance equation:CPU time = CPI * CC * IC

Underideal conditions and with a large number of

instructions, the speedup from pipelining is

z A five stage pipeline is nearly five times faster because the CC isnearly five times faster

Fetch (and execute) more than one instruction at a time


z upersca ar process ng s ay une


25/43

The Five Stages of Load Instruction

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

e c ec xec emw

IFetch: Instruction Fetch and Update PC

Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address

Mem: Read/write the data from/to the Data Memor

WB: Write the result data into the register file



26/43

A Pipelined MIPS Processor

completed

z improves throughput - total amount of work done in a given time

z instruction latency (execution time, delay time, response time -

time from the start of an instruction to its completion) is notreduced

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 7Cycle 6 Cycle 8

sw IFetch Dec Exec Mem WB

- IFetch Dec Exec Mem WB

- clock cycle (pipeline stage time) is limited by the slowest stage

- for some sta es dont need the whole clock c cle e. . WB


- for some instructions, some stages are wasted cycles (i.e.,

nothing is done during that cycle for that instruction)


27/43

Single Cycle versus Pipeline

Clk

Single Cycle Implementation (CC = 800 ps):Cycle 1 Cycle 2

lw sw Waste

lw IFetch Dec Exec Mem WB

Pipeline Implementation (CC = 200 ps): ps

IFetch Dec Exec Mem WBR-type

To complete an entire instruction in the pipelined casetakes 1000 ps (as compared to 800 ps for the singlec cle case . Wh ?


How long does each take to complete 1,000,000 adds ?


28/43


29/43

MIPS Pipeline Datapath Additions/Mods

IF:IFetch ID:Dec EX:Execute MEM:

MemAccess

WB:

WriteBack

IF/ID ID/EX EX/MEM

Add

4

Read Addr 1

Shift

left 2

Add MEM/WB

ReadAddress

Memory

PC

Read Addr 2

Write Addr

eg s er

File

eaData 1

ReadData 2

ALU

Memory

Address ReadData

r e a a

16 32

Sign

Extend


System Clock


30/43

MIPS Pipeline Control Path Modifications

z and held in the state registers between pipeline stages

PCSrc

IF/ID

EX/MEM

Control

Instruction

4

Read Addr 1Register Read

Shift

left 2

Add

Data

MEM/WB

RegWriteBranch

ReadAddressP

C

Write Data

Read Addr 2

Write AddrFile

Data 1

ReadData 2

ALU

emory

Address

Write Data

ReadData

MemtoRegALUSrc

16 32

Sign

Extend

ALU

cntrlMemRead

ALUOp

CSE431 Chapter 4A.30 Irwin, PSU, 2008RegDst


31/43

Pipeline Control

IF Stage: read Instr Memory (always asserted) and writePC (on System Clock)

EX Stage MEM Stage WB Stage

Reg

Dst

ALU

Op1

ALU

Op0

ALU

Src

Brch Mem

Read

Mem

Write

Reg

Write

Mem

toRegR 1 1 0 0 0 0 0 1 0

l w 0 0 0 1 0 1 0 1 1

sw

beq X 0 1 0 1 0 0 0 X



32/43

Graphically Representing MIPS Pipeline

AL

UIM Reg DM Reg

z How many cycles does it take to execute this code?

z What is the ALU doing during cycle 4?

z Is there a hazard, why does it occur, and how can it be fixed?



33/43

Why Pipeline? For Performance!

me c oc cyc es

A Once the

I

ns

Inst 1

LUeg eg

AL

IM Re DM Re

p pe ne s u ,

one instructionis completed

r.

O Inst 2

U

A

LUIM Reg DM Reg

,CPI = 1

r

d

e Inst 3ALUIM Reg DM Reg

Inst 4ALUIM Reg DM Reg



34/43

Can Pipelining Get Us Into Trouble?

Yes: Pipeline Hazardsz structural hazards: attempt to use the same resource by two

different instructions at the same time

z data hazards: attempt to use data before it is ready

- An instructions source operand(s) are produced by a prior

z control hazards: attempt to make a decision about programcontrol flow before the condition has been evaluated and the

- branch and jump instructions, exceptions

Can usually resolve hazards by waiting

z i eline control must detect the hazard


z and take action to resolve hazards


35/43

A Single Memory Would Be a Structural Hazard

me c oc cyc es

A Reading data fromI

ns

Inst 1

LUem eg em eg

AL

Mem Re Mem Re

memory

r.

O Inst 2

U

A

LUMem Reg Mem Reg

r

d

e Inst 3ALUMem Reg Mem Reg

Inst 4ALUMem Reg Mem RegReading instruction

from memor


Fix with separate instr and data memories (I$ and D$)


36/43

How About Register File Access?Time clock c cles

AL

IM Reg DM Reg

Fix register fileaccess hazard badd$1,

n

s

t Inst 1

U

ALUIM Reg DM Reg

doing reads in the

second half of thecycle and writes in

r.

O Inst 2AL

UIM Reg DM Reg

the first half

d

e

r

ALUIM Reg DM Regadd $2,$1,


clock edge that controlsregister writing

c oc e ge a con ro sloading of pipeline state

registers


37/43

Register Usage Can Cause Data Hazards

A

I

ns

,

sub $4 $1 $5

LUeg eg

AL

IM Re DM Re

r.

O and $6,$1,$7

U

A

LUIM Reg DM Reg

r

d

e or $8,$1,$9ALUIM Reg DM Reg

xor $4,$1,$5ALUIM Reg DM Reg


Read before write data hazard


38/43

Register Usage Can Cause Data Hazards

A

LU

AL

IM Reg DM Reg

,

sub $4 $1 $5

ALUIM

RegDM Reg

and $6,$1,$7ALUIM Reg DM Regor $8,$1,$9

ALUIM Reg DM Regxor $4,$1,$5


Read before write data hazard


39/43

Loads Can Cause Data Hazards

A

I

ns

,

sub $4 $1 $5

LU

AL

IM Reg DM Reg

r.

O and $6,$1,$7

ALUIM

RegDM Reg

r

d

e or $8,$1,$9ALUIM Reg DM Reg

xor $4,$1,$5

ALUIM Reg DM Reg


Load-use data hazard


40/43

Branch Instructions Cause Control Hazards

A

I

ns

LUeg eg

AL

IM Re DM Re

r.

O Inst 3

U

A

LUIM Reg DM Reg

r

d

e Inst 4ALUIM Reg DM Reg



41/43

Other Pipeline Structures Are Possible

z Make the clock twice as slow or

z let it take two cycles (since it doesnt use the DM stage)

A

MUL

U

What if the data memor access is twice as slow asthe instruction memory?

z make the clock twice as slow or

ALUIM Reg DM1 RegDM2

clock rate)



42/43

Other Sample Pipeline Alternatives

ARM7IM Reg EX

IM access reg

access

DM access

shift/rotatecommit result

ALUIM1 IM2 DM1

Reg

DM2Reg SHFT

PC updateBTB access

decodere 1 access ALU o

DM write

start IM accessIM access

shift/rotatereg 2 access

start DM accessexception



43/43

Summary

All modern day processors use pipelining

Pipelining doesnt help latency of single task, it helps

Potential speedup: a CPI of 1 and fast a CC

z Unbalanced pipe stages makes for inefficiencies

z The time to fill i eline and time to drain it can im actspeedup for deep pipelines and short code runs

Must detect and resolve hazards

z

Stalling negatively affects CPI (makes CPI less than the idealof 1)