Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware...

46
Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook Lattice QCD software for the IBM Blue Gene/P architecture Stefan Krieg <[email protected]> Jülich Supercomputing Center, Forschungszentrum Jülich and FB C, Mathematik und Naturwissenschaften, Bergische Universität Wuppertal Joint Seminar on High Performance Computing Trinity College, Dublin - Bergische Universität Wuppertal Stefan Krieg 30.01.2009 BlueGene software for LQCD FZ Jülich, BU Wuppertal

Transcript of Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware...

Page 1: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Lattice QCD software for the IBM Blue Gene/Parchitecture

Stefan Krieg<[email protected]>

Jülich Supercomputing Center,Forschungszentrum Jülich

andFB C, Mathematik und Naturwissenschaften,

Bergische Universität Wuppertal

Joint Seminar on High Performance ComputingTrinity College, Dublin - Bergische Universität Wuppertal

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 2: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 3: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 4: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Some Lattice QCD basics

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 5: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Some Lattice QCD basics

The lattice

Lattice QCD (LQCD) is defined on a 4 dim. periodic lattice;LQCD is a way to define QCD in a mathematically precise way

Key ingedients are:I The Quarks living on the

lattice sitesI The Gluons living on the

lattice linksI Typically the LQCD action

connects only neighboringsites (plain Wilson)

Simulations of LQCD are the only available method to directly accessthe low energy regime of QCD

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 6: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Some Lattice QCD basics

QCD particle spectrum (Blue Gene, BMW coll.)

Science 322, 1224 (2008)

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 7: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Simulation algorithm

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 8: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Simulation algorithm

Importance SamplingI After Wick-rotation, the path-integral becomes equivalent to a

partition function of statistical mechanics

〈O〉F ,GQCD =

∫DψDψDAO(ψ,ψ,A) exp

{−SQCD

E (ψ,ψ,A)}

=

∫DA det[D] 〈O(A)〉F exp

{−SQCD

E,G (A)}

I The Boltzmann-weight can be interpreted as probabilityI if Ai has the probability p ∝ exp{−SQCD

E (Ai )} to appear in theensemble of A’s then

〈O〉F ,GQCD =

1N

N∑i=0

〈O(Ai )〉F ,

gives a statistical estimate of an operator expectation value

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 9: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Simulation algorithm

Simulation algorithm: HMCThe fermionic determinant∫ ∏

i

dηidη†i exp{−

∑ij

η†i Dijηj} = det[D]

is included via the pseudofermions (pos. definite K ):∫ ∏i

dcidc∗i exp{−∑

ij

c∗i Kijcj} =1

det[K ]

The Hybrid Monte Carlo algorithm now proceeds as follows1. Generate a pseudofermion vector2. Molecular Dynamics evolution of the gauge links (⇒ Inversions)3. Metropolis accept reject step (energy: E = S + 1

2 Π2)

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 10: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 11: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Blue Gene/P system

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 12: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Blue Gene/P system

Blue Gene/P system structure

Chip4 processors

13.6 GF/s Compute Card1 chip, 13.6 GF/s

2.0 GB DDR2(4.0GB optional)

Node Card(32 chips 4x4x2)

32 compute, 0-1 IO cards435 GF/s, 64 GB Rack

32 Node CardsCabled 8x8x1613.9 TF/s, 2 TB

System72 Racks, 72x32x32

1 PF/s, 144 TB

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 13: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Blue Gene/P system

The Blue Gene/P node

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 14: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Blue Gene/P system

Blue Gene/P networksBlue Gene/P main networks are:

I The 3 dimensional torus networkI Bandwidth (node): 6*2*425MB/s =

5.1GB/sI HW latency (1 hop): 100ns (32B),

800ns (256B packet)I HW latency (worst): 3.2µsI DMA controlled: overlapping

communication and computation

I The tree network for collectivesI Bandwidth (node): 2*850MB/s =

1.7GB/sI HW latency (worst): 3.5µs

I Global barrier network

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 15: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Blue Gene/P functionality

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 16: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Blue Gene/P functionality

DMA communications

DMADMA

DataData

countercounter

FIFOFIFOFIFOFIFOFIFOFIFO

Memory

upd.

load/storesend/

recv.

read/store

upd.

Compute Node

Torus HWTorus HW

DMADMA

Compute Node

DMA is capable of:I Direct-put: put data into

memory on destination nodeI MemFifo comms: put data

into reception fifo ondestination node

I Remote-get: put a descriptorinto injection fifo ondestination node

I Prefetch-only: prefetch datainto L3 (no transfers)

I Destination node can be thenode itself (local transfers)

DMA “directly” programmable: SPI

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 17: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Blue Gene/P functionality

“Double Hummer” FPU

The “Double Hummer” FPUfeatures:

I Instructions optimizedfor complex arithmetic:only 2 instructionsrequired for complexmultiplication

I 32 primary + 32secondary registers

I Capability to load 16Byte quadword

I 5 stage pipeline

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 18: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Blue Gene/P functionality

Complex multiplication with “Double Hummer” FPUB

A

C

B

A

C

C

**

+-

ss

**

s s

Complex multiplication (A×B=C)

Re(C) = (Re(A)Re(B)− Im(A)Im(B))

Im(C) = (Im(A)Re(B) + Re(A)Im(B))

Required:I a cross copy primary multiply

(2+1 register operands)I a cross mixed negative

secondary multiply-add(3+1 register operands)

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 19: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 20: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

The Wilson kernel

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 21: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

The Wilson kernel

Wilson kernel properties

I Wilson kernel is sparse matrix vector multiplicationI Sparse: memory footprint of kernel scales linearly with NI Wilson kernel is a first order derivative connecting NNI However “NN coefficients” are random matrices

Blue Gene/P implementation strategy:1. (scalar) Spin project forward

2. (comm) Start communication forward

3. (scalar) Spin project backward and SU(3) multiply

4. (comm) Wait forward; Start communication backward

5. (scalar) SU(3) multiply fwd. and sum up

6. (comm) Wait backward

7. (scalar) Add backward

⇒ Calculations and communications overlap.

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 22: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

The Wilson kernel

Wilson kernel communication pattern

CPU0 CPU3

CPU2CPU1

Match 4 dim. periodicphysics lattice tocommunication hardware:

I Put 3 dimensionsalong torus directions

I Use local transfers for4th dimension:

Core0 → Core1↑ ↓

Core2 → Core3

⇒ 4 dimensional torus

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 23: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the DMA

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 24: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the DMA

Injection Fifo

Fifo starting address

Fifo head

Fifo tail

Fifo ending address

DMA

DMA

Inject new

Fifohead&tailwrap around

descriptor

counterdec.

Fifo:contiguouschunk ofDDR Mem.

send

descriptor

Torus HW

I Fifo is a contiguous chunk ofDDR memory

I The Fifo hasI Starting addressI Ending addressI HeadI Tail

I Message descriptors areinjected into injection fifo

I DMA “executes” a descriptor,then updates fifo head

I New descriptor is injected atfifo tail, tail is then updated

I Tail is wrapped around

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 25: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the DMA

Counters

Two tyes of counters:

I Reception (rDMA) countersI Base address: associated virtual address in main memory;

incoming data is stored relative to this addressI Max address: associated virtual address in main memory; defines

“memory reception window”I Counter value: is decreased by DMA when receiving a message

associated to the counter

I Injection (iDMA) countersI Base address: associated virtual address in main memory; data is

send relative to this addressI Counter value: is decreased by DMA when sending a message

associated to the counter

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 26: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the DMA

Descriptors

Target X,Y,Z

Rec. counter ID

Send offset

Recv offset

Inj. counter ID

Hints, …

Inj. counter Grp

Rec. counter Grp

Msg. length

A message descriptor contains (direct-put)I Coordinates of the destination node

(nonlocal comm.)I Send offset relative to inj. counter base

addressI Injection counter group and IDI Reception offset relative to rec. counter

base address (on destination node)I Reception counter group and ID (on

destination node)I Message length

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 27: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the DMA

SPI software layer

SPI contains low level function calls toI access and manipulate fifos, counters etc.I create descriptors, inject descriptorsI use the global interrupt and collective network

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 28: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the DMA

Counters and descriptors

For e.g. a “direct-put“ using SPI one could proceed as follows:1. Allocate 1 injection fifo, 1 reception and 1 injection counter

(if you do not have them already)

2. Set the injection counter base address e.g. to the smallest virtual address of thedata that you want to communicate

3. Set the injection counter value to the number of bytes you intend to send

4. (destination node) send the reception counter base address to the smallestvirtual address where you want to store received data

5. Set the reception counter max address

6. Set the reception counter value to the number of bytes send

For each continuous chunk of data, inject 1 descriptor into the inj. fifo1. Calculate the address offset relative to the injection counter base

2. Calculate the address offset relative to the reception counter base

3. Give the message size

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 29: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the DMA

Summary: kernel persistent communication

Setting up the persistent communication for the kernel with SPI1. Allocate 6+2 injection fifos, 2 reception and 2 injection counters

2. Set the reception counter base and max and the injection counter baseaddresses (same on every node)

3. Deactivate fifos and counters.

4. Calculate the offsets and message sizes etc. (block-strided-move) and inject thedescriptors

5. Move each fifo head to the fifo tail

6. Activate fifos and counters

To start the communication1. Set the reception and injection counter values

2. Move the fifo heads to the fifo start

And the DMA does its work in the background.Poll counters to make sure communication is completed.

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 30: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the “Double Hummer” FPU

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 31: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the “Double Hummer” FPU

Problems

Making efficient use of the “Double Hummer” FPU is not easy:

I Secondary FPU and secondary registers only accessible throughoedipus instructions

I Oedipus instructions always work on 16 Byte aligned 16 Bytequadwords = 2 double precision floating point numbers→ User has to make sure data is aligned

I Compilers frequently fail to rearrange code properly→ “Simdization” fails.

⇒ To get maximum performance one needs to go low level

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 32: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the “Double Hummer” FPU

Fast and easy: XL intrinsics

I IBM XL compilers provide built-in functions (intrinsics) that mapto (e.g. floating point) assembly instructionse.g. __lfpd, __stfpd, __fpmadd and __dcbt

I Intrinsics operate on “double _Complex” variables that map toregisters

I Number of variables is not limited (like registers)I Scheduling will be done by compiler

⇒ Comparatively easy to use.I LQCD code has large parts optimize using intrinsicsI Use a set of macros that implement basic mathematical

operationsI With macros optimization is almost trivial

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 33: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the “Double Hummer” FPU

Example: daxpy

0

0.2

0.4

0.6

0.8

1

32241680

flops

/cyc

le

vector length[KBytes]

450d unrolled, alignedESSL BG

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 34: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the “Double Hummer” FPU

Example: caxpy

0

0.2

0.4

0.6

0.8

1

1.2

1.4

32241680

flops

/cyc

le

vector length[KBytes]

450d450d unrolled, aligned

intrinsics

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 35: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Using the “Double Hummer” FPU

Faster and annoying: assembly

When using the intrinsics, the compiler has great influence on theperformance.

I For more control use (gcc inline) assemblyI Kernel serial code written in assemblyI Uses explicit prefetches (dcbt)I All scheduling and register allocation done by handI No problem if all that is already there (BGL kernel)

⇒ Performance typically another 10% better compared to intrinsics

⇒ Code generation typically 10 times slower compared to intrinsics

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 36: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 37: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Kernel performance

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 38: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Kernel performance

Wilson kernel performance (strong scaling)

0

10

20

30

40

50

60

70

80

4K 16K 32K 64K

1 4 8 16

Perf

orm

ance

/TFl

op/s

# Cores

Number of BG/P racks

37.5% PeakWilson kernel shows

I almost perfect strongscaling,

I a large scaling range,

I perfect weak scaling,

I and reaches 37.5% ofabsolute peak

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 39: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Kernel performance

Performance comparison BGL/BGP

0

5

10

15

20

25

30

35

2K 8K 16K 0

2

4

6

8

101 4 8

Perf

orm

ance

/%Pe

ak

Ker

nel m

emor

y fo

otpr

int/M

Byt

e

# Cores

Number of BG/L racks

perf singleperf doublemem single

mem double

0

10

20

30

40

50

4K 16K 32K 64K 0

2

4

6

8

101 4 8 16

Perf

orm

ance

/%Pe

ak

Ker

nel m

emor

y fo

otpr

int/M

Byt

e

# Cores

Number of BG/P racks

perf singleperf doublemem single

mem double

I Optimized comm. + calc. I Large scaling region

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 40: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Performance of mixed precision inverters

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 41: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Performance of mixed precision inverters

Mixed precision via iterative refinement

I Single precision kernel has best performance and largest scalingregion

I Valence and sea sector calculations require double precisionaccuracy

I Solution: Mixed precision invertersUse a less precise A−1 = a−1:

1. Compute b − Ax0 = r2. Compute a−1r = A−1r + rs, with |rs| = ε|r |3. Update x: x1 = x0 + a−1r

⇒ |b − Ax1| = |b − Ax0 − Aa−1r | = |r − AA−1r − rs| = ε|r |I This can be used to speed up a generic CGI Precision of a−1 can be tuned

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 42: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Performance of mixed precision inverters

Mixed precision GMRESR

relGMRESR(A,b, ε)

{computes x with ‖Ax − b‖ ≤ ε · ‖b‖ via relaxed GMRESR}

x = 0; {initial value}r = b;U = C = []; {empty matrix}while ‖r‖ > ε · ‖b‖ do

solve Au = r to relative accuracy ξ {preconditioner}compute c with ‖Au − c‖ ≤ ε · ‖b‖ · ‖u‖/‖r‖;for i=1:size(C,2) doβ = C[:, i]† · c;c = c − β · C[:, i];u = u − β · U[:, i];

end forc = c/‖c‖; u = u/‖c‖;C = [C, c]; U = [U, u];α = c† · r ;x = x + α · u;r = r − α · c;

end while

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 43: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Performance of mixed precision inverters

Mixed precision inverter performance

0 20 40 60

0.0001

0.001

0.01

0.1

1

sec

WilsonCG-64 vs. CG-mix

0 5 10 15 20 25

0.0001

0.001

0.01

0.1

1

1000 sec

overlapCG-64 vs. GMRES-mix

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 44: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

OutlineLattice QCD

LQCD basicsSimulation algorithm

Blue Gene/P hardwareBGP systemBGP functionality

Lattice QCD software for Blue GeneThe Wilson kernelCommunicationSerial code

Performance resultsKernel PerformanceMixed prec. inverters

Conclusion and outlook

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 45: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Conclusion and Outlook: LQCD software

Blue Gene/L and Blue Gene/P low level software layerI is the core of all the simulation codes used within BMW projects

on Blue GeneI contains all performance critical routinesI extends, combined with mixed action inverters, the scaling

region of the code to over ten thousands of CPUsI delivers up to 37.5% sustained performanceI has been successfully used within the LQCD simulations to test

and verify the Blue Gene’s at JülichIn the future

I further optimize software layerI include support for CELL BE

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal

Page 46: Lattice QCD software for the IBM Blue Gene/P architecture · Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance resultsConclusion and outlook Lattice

Lattice QCD Blue Gene/P hardware Lattice QCD software for Blue Gene Performance results Conclusion and outlook

Excuse me? THEORETIC? Surely you must be joking... Boeing!

Stefan Krieg 30.01.2009

BlueGene software for LQCD FZ Jülich, BU Wuppertal