New Design Space Exploration for Multicore Architectures: A...

10
Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View Matteo Monchiero 1 1 Dipartimento di Elettronica e Informazione Politecnico di Milano Via Ponzio, 34/5 20133 Milano, Italy [email protected] Ramon Canal 2 2 Dept of Computer Architecture Universitat Polit` ecnica de Catalunya Cr. Jordi Girona, 1-3 08034 Barcelona, Spain [email protected] Antonio Gonz ´ alez 2,3 3 Intel Barcelona Research Center Intel Labs-Universitat Polit` ecnica de Catalunya Cr. Jordi Girona, 27-29 08034 Barcelona, Spain [email protected] ABSTRACT Multicore architectures are ruling the recent microprocessor design trend. This is due to different reasons: better performance, thread- level parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one); and, ease and reuse of design. This paper presents a thorough evaluation of multicore architec- tures. The architecture we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1 and L2, and a shared bus interconnect. We consider parallel shared memory applications. We explore the design space related to the number of cores, L2 cache size and processor complexity, show- ing the behavior of the different configurations/applications with respect to performance, energy consumption and temperature. De- sign tradeoffs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floorplans. Their power/thermal characteristics are analyzed and they show the importance of considering thermal effects at the ar- chitectural level to achieve the best design choice. 1. INTRODUCTION Main semiconductor firms have been recently proposing micropro- cessor solutions composed of a few cores integrated on a single chip [1–3]. This approach, named Chip Multiprocessor (CMP), permits to efficiently deal with power/thermal issues dominating deep sub-micron technologies and makes it easy to exploit thread- level parallelism of modern applications. Power has been recognized as first class design constraint [4], and many literature works target analysis and optimization of power consumption. Issues related to chip thermal behavior have been faced only recently, but emerged as one of the most important fac- tors to determine a wide range of architectural decisions [5]. This paper targets a CMP system composed of multiple cores, each one with private L1 and L2 hierarchy. This approach has been Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICS06, June 28-30, Cairns, Queensland, Australia Copyright C 2006 ACM 1-59593-282-8/06/0006 ...$5.00. adopted in some industrial products, e.g. the Intel Montecito [1] and Pentium D [6], and the AMD AthlonX2 [7]. It maximizes de- sign reuse, since this architectural style does not require re-design of the secondary cache, as it would be for a L2 shared architecture. Unlike many recent works about design space explorations of CMPs, we consider parallel shared memory applications, which can be considered the natural workload for a small scale multipro- cessor. We target several scientific programs and multimedia ones. Scientific applications represent the traditional benchmark for mul- tiprocessor, while multimedia programs represent a promising way of improving parallelism in everyday computing. Our experimental framework consists of a detailed microarchi- tectural simulator [8], integrated with Wattch [9] and CACTI [10] power models. Thermal effects have been modeled at the functional unit granularity by using HotSpot [5]. This environment makes a fast and accurate exploration of the target design space possible. The main contribution of this work is the analysis of several de- sign energy/performance trade-offs when varying core complexity, L2 cache size and number of cores, for parallel applications. In particular, we discuss the interdependence of energy/thermal ef- ficiency, performance and architectural level chip floorplan. Our findings can be summarized as follows: We show that L2 cache is an important factor in determining chip thermal behavior. We show that large CMPs of fairly narrow cores are the best solution for energy-delay. We show that floorplan geometric characteristics are impor- tant for chip temperature. This paper is organized as follows. Section 2 presents the related work. Metrics are defined in Section 3. The target design space, as well as the description of the considered architecture is presented in Section 4. The experimental framework used in this paper is intro- duced in Section 5. Section 6 discusses performance/energy/thermal results for the proposed configurations. Section 7 presents an anal- ysis of the spatial distribution of the temperature for selected chip floorplans. Finally, conclusions are drawn in Section 8. 2. RELATED WORK Many works have appeared exploring the design space of CMPs from the point of view of different metrics and application domains. This paper extends previous works in several issues. In detail:

Transcript of New Design Space Exploration for Multicore Architectures: A...

Page 1: New Design Space Exploration for Multicore Architectures: A …people.ac.upc.edu/rcanal/pdf/ics06.pdf · 2006. 5. 22. · Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal

Design Space Exploration for Multicore Architectures: APower/Performance/Thermal View

Matteo Monchiero1

1Dipartimento di Elettronica eInformazione

Politecnico di MilanoVia Ponzio, 34/5

20133 Milano, Italy

[email protected]

Ramon Canal22Dept of Computer

ArchitectureUniversitat Politecnica de

CatalunyaCr. Jordi Girona, 1-3

08034 Barcelona, Spain

[email protected]

Antonio Gonzalez2,3

3Intel Barcelona ResearchCenter

Intel Labs-UniversitatPolitecnica de CatalunyaCr. Jordi Girona, 27-2908034 Barcelona, Spain

[email protected]

ABSTRACTMulticore architectures are ruling the recent microprocessor designtrend. This is due to different reasons: better performance, thread-level parallelism bounds in modern applications, ILP diminishingreturns, better thermal/power scaling (many small cores dissipateless than a large and complex one); and, ease and reuse of design.

This paper presents a thorough evaluation of multicore architec-tures. The architecture we target is composed of a configurablenumber of cores, a memory hierarchy consisting of private L1 andL2, and a shared bus interconnect. We consider parallel sharedmemory applications. We explore the design space related to thenumber of cores, L2 cache size and processor complexity, show-ing the behavior of the different configurations/applications withrespect to performance, energy consumption and temperature. De-sign tradeoffs are analyzed, stressing the interdependency of themetrics and design factors. In particular, we evaluate several chipfloorplans. Their power/thermal characteristics are analyzed andthey show the importance of considering thermal effects at the ar-chitectural level to achieve the best design choice.

1. INTRODUCTIONMain semiconductor firms have been recently proposing micropro-cessor solutions composed of a few cores integrated on a singlechip [1–3]. This approach, named Chip Multiprocessor (CMP),permits to efficiently deal with power/thermal issues dominatingdeep sub-micron technologies and makes it easy to exploit thread-level parallelism of modern applications.

Power has been recognized as first class design constraint [4],and many literature works target analysis and optimization of powerconsumption. Issues related to chip thermal behavior have beenfaced only recently, but emerged as one of the most important fac-tors to determine a wide range of architectural decisions [5].

This paper targets a CMP system composed of multiple cores,each one with private L1 and L2 hierarchy. This approach has been

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICS06,June 28-30, Cairns, Queensland, AustraliaCopyright C 2006 ACM 1-59593-282-8/06/0006 ...$5.00.

adopted in some industrial products, e.g. the Intel Montecito [1]and Pentium D [6], and the AMD AthlonX2 [7]. It maximizes de-sign reuse, since this architectural style does not require re-designof the secondary cache, as it would be for a L2 shared architecture.

Unlike many recent works about design space explorations ofCMPs, we consider parallel shared memory applications, whichcan be considered the natural workload for a small scale multipro-cessor. We target several scientific programs and multimedia ones.Scientific applications represent the traditional benchmark for mul-tiprocessor, while multimedia programs represent a promising wayof improving parallelism in everyday computing.

Our experimental framework consists of a detailed microarchi-tectural simulator [8], integrated with Wattch [9] and CACTI [10]power models. Thermal effects have been modeled at the functionalunit granularity by using HotSpot [5]. This environment makes afast and accurate exploration of the target design space possible.

The main contribution of this work is the analysis of several de-sign energy/performance trade-offs when varying core complexity,L2 cache size and number of cores, for parallel applications. Inparticular, we discuss the interdependence of energy/thermal ef-ficiency, performance and architectural level chip floorplan. Ourfindings can be summarized as follows:

• We show that L2 cache is an important factor in determiningchip thermal behavior.

• We show that large CMPs of fairly narrow cores are the bestsolution for energy-delay.

• We show that floorplan geometric characteristics are impor-tant for chip temperature.

This paper is organized as follows. Section 2 presents the relatedwork. Metrics are defined in Section 3. The target design space, aswell as the description of the considered architecture is presented inSection 4. The experimental framework used in this paper is intro-duced in Section 5. Section 6 discusses performance/energy/thermalresults for the proposed configurations. Section 7 presents an anal-ysis of the spatial distribution of the temperature for selected chipfloorplans. Finally, conclusions are drawn in Section 8.

2. RELATED WORKMany works have appeared exploring the design space of CMPsfrom the point of view of different metrics and application domains.This paper extends previous works in several issues. In detail:

Page 2: New Design Space Exploration for Multicore Architectures: A …people.ac.upc.edu/rcanal/pdf/ics06.pdf · 2006. 5. 22. · Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal

Huh et al. [11] evaluate the impact of several design factors onthe performance. The authors discuss the interactions of core com-plexity, cache hierarchy, and available off-chip bandwidth. Thepaper focuses on a workload composed of single-threaded appli-cations. It is shown that out-of-order cores are more effective thanin-order ones. Furthermore, bandwidth limitations can force largeL2 caches to be used, therefore reducing the number of cores. Asimilar study was conducted by Ekmanet al. [12] for scientific par-allel programs.

Kaxiras et al. [13] deal with multimedia workload, especiallytargeting DSP for mobile phones. The paper discuss the energyefficiency of SMT and CMP organizations with given performanceconstraints. They show that both these approaches can be moreefficient than a single-threaded processor, and claims that SMT hassome advantages in terms of power.

Grochowskiet al. [14] discuss how to achieve the best perfor-mance for scalar and parallel codes in a power constrained envi-ronment. The idea proposed in this paper is to dynamically varythe amount of energy expended to process instructions accordingto the amount of parallelism available in the software. They evalu-ate several architectures showing that a combination of voltage fre-quency/scaling and asymmetric cores represents the best approach.

In [15], Li and Martinez study the power/performance implica-tions of parallel computing on CMPs. They use a mixed analytical-experimental model to explore parallel efficiency, the number ofprocessor used, and voltage/frequency scaling. They show thatparallel computation can bring significant power savings when thenumber of processors used and the voltage/frequency scaling levelsare properly chosen.

These works [14–16] are orthogonal to ours. Our experimentalframework is similar to the one of [15, 16]. We also consider par-allel applications and the benchmarks as in [16]. The power modelwe use accounts for thermal effects on leakage energy, but, unlikethe authors of [15, 16], we also provide a temperature-dependentmodel for L2 caches. This makes it possible to capture some par-ticular behaviors of temperature and energy consumption since L2caches may occupy more than half of the area of the chip in someconfigurations.

In [17], a comparison of SMT and CMP architectures is carriedout. The authors consider performance, power and temperaturemetrics. They show that SMT and CMP exhibit similar thermal be-havior, but the sources of heating are different: localized hotspotsfor SMTs, global heating for CMPs. They found that CMPs aremore efficient for computation intensive applications, while SMTsperform better for memory bound programs due to larger L2 cacheavailable. The paper also discusses the suitability of several dy-namic thermal management techniques for CMPs and SMTs.

Li et al. [18] conduct a thorough exploration of the multi-di-mensional design space for CMPs. They show that thermal con-straints dominate other physical constraints such as pin-bandwidthand power delivery. According to the authors, thermal constraintstend to favor shallow pipelines, narrower cores and tend to reducethe optimal number of cores and L2 cache size.

The focus of [17, 18] is on single-threaded applications. In thispaper, we target a different benchmark set, composed of explicitlyparallel applications.

In [19], Kumaret al. provide an analysis of several on-chip in-terconnections. The authors show the importance of consideringthe interconnect while designing a CMP and argue that careful co-design of the interconnection and the other architectural entities isneeded.

To the best of the authors knowledge, the problem of architectural-level floorplan has not been faced for what concerns multi-core ar-

chitectures. Several chip floorplan have been proposed as instru-mental to thermal/power models, but the interactions of floorplan-ning issues and CMP power/performance characteristics have notbeen faced up to now.

In [20], Sankaranarayananet al. discuss some issues related tochip floorplanning for single-core processors. In particular, the au-thors show that chip temperature variation is sensitive to three mainparameters: the lateral spreading of heat in the silicon, the powerdensity, and the temperature dependent leakage power. We also usethese parameters to describe the thermal behavior for CMPs.

The idea of distributing the microarchitecture for reducing tem-perature is exploited in [21]. In this paper the the organization ofa distributed front-end for clustered microarchitectures is proposedand evaluated

Ku et al. [22] analyze the inter-dependence of temperature andleakage energy in cache memories. They focus on single-core work-load. The authors analyze several low-power techniques and pro-pose a temperature-aware cache management approach. In this pa-per, we account for temperature effects on leakage in the memoryhierarchy, stressing its importance to make best design decisions.

This paper contribution differentiates from the previous ones,since it combines power/performance exploration for parallel ap-plications and interactions with the chip floorplan. Our model takesinto account the thermal effects and the temperature dependence ofleakage energy.

3. METRICSThis section describes the metrics that we use in this paper to char-acterize performance and power efficiency of the CMP configura-tions we explore.

We measure the performance of the system when running a givenparallel applications by using theexecution time(or delay). This iscomputed as the time needed to complete the execution of the pro-gram. Delay is often preferred to IPC for multiprocessor studies. Itaccounts for different instruction count which may arise when dif-ferent system configurations are used. This mostly happens whenthe number of cores is varied and the dynamic instructions countvaries to deal with a different number of parallel threads.

Power efficiency is evaluated through the systemenergy, i.e. theenergy needed to run a given application. We also account for ther-mal effects, and we therefore report theaverage temperatureacrossthe chip. Furthermore, in Section 7, we evaluate several alternativefloorplans by means of the spatial distribution of the temperature(thermal map of the chip).

Energy Delay Product(EDP) andEnergy Delay2 Product(ED2P)are typically used to evaluate energy-delay trade-offs. EDP is usu-ally preferred when targeting low-power devices, while ED2P whenfocusing on high-performance systems, since it gives higher prior-ity to performance over power. In this paper, we use ED2P. Any-way, we give a discussion of EDnP optimality for the considereddesign space. We use concepts taken form [23]. In particular, weadopt the notion ofhardware intensityto proper characterize thecomplexity of different configurations.

4. DESIGN SPACEWe consider a shared memory multiprocessor, composed of severalindependent out-of-order cores (each one with private L2 cache),communicating on a shared bus. Detailed microarchitecture is dis-cussed in Section 5. In this section, we focus on the target designspace. This is composed of the following architecture parameters:

• Number of cores (2, 4, 8). This is the number of cores presentin the system.

Page 3: New Design Space Exploration for Multicore Architectures: A …people.ac.upc.edu/rcanal/pdf/ics06.pdf · 2006. 5. 22. · Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal

Table 1: Chip Area [mm2]Issue

#cores L2 Size [KB] 2 4 6 8

2 256 45 49 62 742 512 63 68 80 932 1024 108 113 124 1382 2048 187 192 202 2164 256 81 90 112 1384 512 117 126 148 1744 1024 199 208 228 2554 2048 350 359 379 4068 256 162 180 224 2758 512 235 253 297 3488 1024 398 416 456 5118 2048 700 719 758 812

• Issue width (2, 4 ,6, 8). We modulate the number of instruc-tions which can be issued in parallel to integer/floating pointreservation stations. According to this parameter, many mi-croarchitectural blocks are scaled, see Table 2 for details.The issue width is therefore an index of the core complex-ity.

• L2 cache size (256, 512, 1024, 2048 KB). This is the size ofthe private L2 cache of each processor.

For each configuration in terms of number of cores and L2 cachesize we consider a different chip floorplan. As in some relatedworks on thermal analysis [5], the floorplan of the core is mod-eled on the AlphaEv6 (21264) [24].The area of several units (reg-ister files, issue and Ld/St queues, rename units, and FUs) has beenscaled according to size and complexity [25]. Each floorplan hasbeen re-designed to minimize dead spaces and not to increase thedelay of the critical loops.

Figure 1 illustrates the methodology used to build the floorplanswe use. In particular, to obtain a reasonable aspect ratio we definedtwo different layouts: one for small caches, 256KB and 512KB(Figure 1.a), and another one for large caches, 1024KB and 2048KB(Figure 1.b). The floorplans for 4 and 8 cores are built by using the2-core floorplan as a base unit. The base unit for small caches isshown in Figure 1.a. Core 0 (P0) and core 1 (P1) are placed sideby side. The L2 caches are placed in front of each core. In the baseunit for large caches (Figure 1.b), the L2 is split in two pieces, onein front of the core, the other one beside it. In this way, each proces-sor+cache unit is roughly squared. For 4 and 8 cores this floorplanis replicated and possibly mirrored, trying to obtain the aspect ratiowhich most approximates 1 (perfect square). Figures 1.c/d showhow the 4-core and 8-core floorplans have been obtained respec-tively from the 2-core and 4-core floorplan. For any cache/coreconfigurations, each floorplan has the shape/organization as out-lined in Figure 1, but different size (these have been omitted forthe sake of clarity). Table 1 shows the chip area for each design.For each configuration, the shared bus is placed according to thecommunication requirements. Size is derived from [19].

5. EXPERIMENTAL SETUPThe simulation infrastructure that we use in this paper is composedof a microarchitecture simulator modeling a configurable CMP, in-tegrated power models, and a temperature modeling tool.

The CMP simulator is SESC [8]. It models a multiprocessorcomposed of a configurable number of cores. Table 2 reports main

SharedBus

D$ I$

P1

I$D$

P0

L2_0 L2_1

I$D$ D$ I$

P1P0

L2_1L2_0

(a) L2 256, 512 KB

(c) 4 cores

(b) L2 1024, 2048 KB

(d) 8 cores

Base UnitBase Unit

SharedBus

Figure 1: Chip floorplans for 2, 4, 8 cores, 256/512KB L2 caches(a), and 1024/2048KB L2 caches (b)

parameters of simulated architectures. Each core is an out-of-ordersuperscalar processor, with private caches (instruction L1, data L1,and instructions and data L2). The 4-issue microarchitecture mod-els the Alpha 21264 [24]. For any different issue width, all theprocessor structures have been scaled accordingly as shown in Ta-ble 2.

Inter-processor communication develops on a high-bandwidthshared bus (57 GB/s), pipelined and clocked at half of the coreclock (see Table 2). Coherence protocol acts directly among L2caches and it is MESI snoopy-based. This protocol generates in-validate/write requests between L1 and L2 caches to ensure thecoherency of shared data. Memory ordering is ruled by a weakconsistency model.

The power model integrated in the simulator is based on Wattch[9] for processor structures, CACTI [10] for caches, and Orion [26]for buses.

The thermal model is based on Hotspot-3.0.2 [5]. HotSpot usesdynamic power traces and chip floorplan to drive thermal simula-tion. As outputs, it provides transient behavior and steady statetemperature. According to [5], we chose a standard cooling solu-tion featuring thermal resistance of 1.0 K/W. We chose ambient airtemperature of 45°C. We carefully select initial temperature valuesto ensure that thermal simulation converges to a stationary state.

Hotspot has been augmented with a temperature-dependent leak-age model, based on [27]. This model accounts for different tran-sistor density of functional units and memories. Temperature de-pendency is modeled by varying the amount of leakage accordingto the proper exponential distribution as in [27]. At each iteration ofthe thermal simulation, the leakage contribution is calculated and‘injected’ into each HotSpot grid element.

Table 3 lists the benchmark we selected. They are 3 scientificapplications from Splash-2 suite [28] and MPEG2 encoder/decoderfrom ALPbench [29]. All benchmarks have been run up to comple-tion and statistics have been collected on the whole program run,

Page 4: New Design Space Exploration for Multicore Architectures: A …people.ac.upc.edu/rcanal/pdf/ics06.pdf · 2006. 5. 22. · Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal

Table 2: Architecture configuration. Issue width parametersare in boldface. They are used as representatives of a core con-figuration through the paper.

ProcessorFrequency 3GHz @70 nmVdd 0.9VCore area (w/ L1$) [mm2] 31 39 55 78Branch penalty 7 cyclesBranch unit BTB (1K entries, 2-way)

Alpha-style Hybrid Branch Predictor (3.7KB)RAS (32 entries)

Fetch/issue/retire width 2/2/4 4/4/6 6/6/8 8/8/10Issue Queue size 16 32 48 64INT I-Window size 10 20 30 40FP I-Window size 5 15 25 35INT registers 40 80 120 160FP registers 32 72 112 152ROB size 40 80 120 160LdSt/Int/FP units 1/2/1 1/4/2 1/6/3 1/8/4Ld/St queue entries 16/16 32/32 48/48 64/64IL1 64KB, 2-way, 64B block, 2 portsHit/miss lat 2/1 cyclesITLB entries 64DL1 64KB, 2-way, 64B block, write-throughHit/miss lat 2/1 cyclesMAF size 8DTLB entries 128Unified L2 256 KB 512 KB 1024 KB 2048 KB

8-way, 64B block, write-back, 2 portsHit/miss lat 10/4 cyclesMAF size 32Coherence Protocol MESI

CMP#cores 2 4 8Shared Bus 76B, 1.5GHz, 10 cycles delayBandwidth 57 GB/sMemory Bus bandwidth 6 GB/sMemory lat 490 cycles

Table 3: Benchmarks#graduated Description – Problem size

instructions (M)

FMM 4387–5741 Fast Multipole Method16k particles, 10 steps

mpeg2dec 1168 MPEG-2 decoder – flowg (Stanford)352×240, 10 frames

mpeg2enc 4275 MPEG-2 encoder – flowg (Stanford)352×240, 10 frames

VOLREND 1425–1888 Volume rendering using ray castinghead, 50 viewpoints

WATER-NS 1780 Forces and potentials of water molecules512 molecules, 50 steps

after skipping initialization. For thermal simulations, we used thepower trace related to the whole benchmark simulation (dumpedevery 10000 cycles). We used standard data set for Splash-2, whilewe limited to 10 frames for the MPEG2. In Table 3 is shown thetotal number of graduated instructions for each application. Thisnumber is fairly constant across all the simulated configurations forall the benchmarks, except for FMM and VOLREND. This varia-tion (23-24%) is related to different thread spawning and synchro-nization overhead as the number of cores is scaled.

6. PERFORMANCE AND POWER EVALU-ATION

In this section, we evaluate the simulated CMP configurations fromthe point of view of performance and energy. We consider the inter-dependence of these metrics as well as chip physical characteristics(temperature and floorplan). We base this analysis on average val-ues across all the simulated benchmarks, apart from Section 6.4,which presents a characterization of each benchmark.

2−2 2−4 2−6 2−8 4−2 4−4 4−6 4−8 8−2 8−4 8−6 8−80

0.5

1

1.5

2

2.5x 10

9

Del

ay [c

ycle

s]

25651210242048

Figure 2: Execution time (delay). Each group of bars (L2 cachevarying) is labeled with <#procs>-<issue width>

6.1 PerformanceFigure 2 shows the average execution time, for all the configura-tions, in terms of cores, issue width and L2 caches. Results interms of IPC show the same behavior.

It can be observed that when scaling the system from 2 to 4, andfrom 4 to 8 processors, delay is reduced by a 47% each step. Thistrend is homogeneous across other parameter variations. It meansthat high parallel efficiency is achieved by these applications, andthat communication overhead is not appreciable.

By varying the issue width, for a given processor count and L2size, a larger speedup is observed when moving from 2- to 4-issue(21%). If the issue width is furthermore increased, the improve-ment of the delay saturates (5.5% from 4- to 6-issue, and 3.3%from 6- to 8-issue). This is because of ILP diminishing returns.This trend is also seen (but not that dramatically) for the configura-tions of 4 and 8 cores.

The impact of the L2 cache size is at maximum 4-5% (from256KB to 2048KB), when the other parameters are fixed. Eachtime the L2 size is doubled the speedup is around 1.2-2%. Alsothis behavior is orthogonal to the other parameters since this trendcan be seen across all configurations.

6.2 EnergyIn this section, we analyze the effect of varying the number ofcores, the issue width and the L2 cache on the system energy.

Number of cores. In Figure 3, the energy behavior is reported.It can be seen that, when the number of processors is increased, thesystem energy slightly increases although the area nearly doublesas core count doubles. For example, for a 4-issue 1024KB configu-ration, the energy increase rate is 5% (both from 2 cores to 4 cores,and from 4 cores to 8 cores). In fact, the leakage energy increase,due to larger chip, is mitigated by the reduction of the executiontime –power-on time– (i.e. the time the chip is leaking is shorter).

On the other hand, when considering configurations with a givenresource budget, and similar area (see Table 1), the energy typicallydecreases. For a 16 total issue width and 2MB total cache, theenergy decreases by a 37%, from 2 to 4 cores. From 4 to 8, itslightly increases, but, the chip for 8 cores is 22% larger. In thiscase, the shorter delay makes the difference, making relatively largeCMPs result in good energy efficiency.

Issue width. By varying the issue width and, therefore, the core

Page 5: New Design Space Exploration for Multicore Architectures: A …people.ac.upc.edu/rcanal/pdf/ics06.pdf · 2006. 5. 22. · Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal

2−2 2−4 2−6 2−8 4−2 4−4 4−6 4−8 8−2 8−4 8−6 8−80

5

10

15

20

25E

nerg

y [J

]

25651210242048

Figure 3: Energy. Each group of bars (L2 cache varying) islabeled with <#procs>-<issue width>

2−2 2−4 2−6 2−8 4−2 4−4 4−6 4−8 8−2 8−4 8−6 8−80

2

4

6

8

10

12

14

16

18

Le

ak

ag

e E

ne

rgy

[J]

25651210242048

core

L2s

Figure 4: Leakage energy. Each bar is labeled<#issue>-<L2size>. Core/L2s leakage breakdown is represented by a dashsuperposed on each bar.

complexity, energy has a minimum at 4-issue. This occurs for 2, 4and 8 processors. The 4-issue cores offer the best balance betweencomplexity, power consumption and performance. The 2-issue mi-croarchitecture can’t efficiently extract available IPC (leakage dueto delay dominates). 8-issue processor cores need much more en-ergy to extract little more parallelism with respect to 4- and 6-issueone, and therefore they are quite inefficient.

L2 size. L2 cache represents the major contributor to chip area,so it affects leakage energy considerably. The static energy com-ponent for our simulations ranges from 37% (2P 8-issue 256KB-L2) to 72% (8P 2-issue 2048KB-L2). Figure 4 shows the leakageenergy and the core-leakage/L2-leakage breakdown. Most of theleakage is due to the L2 caches. This component scales linearlywith L2 cache size. The static power of the core increases as theissue-width and the number of cores is increased. Nevertheless, thiseffect is mitigated by larger L2 caches since they offer enhancedspreading capabilities of the chip (overall temperature decreasesresulting in less leakage energy).

Average chip temperature is shown in Figure 5. Despite abso-

2−2 2−4 2−6 2−8 4−2 4−4 4−6 4−8 8−2 8−4 8−6 8−860

60.5

61

61.5

62

62.5

63

Av.

Tem

pera

ture

[°C

]

25651210242048

Figure 5: Average chip temperature. Each group of bars (L2cache varying) is labeled with<#procs>-<issue width>

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 109

10

15

20

25

Delay [cycles]

Ene

rgy

[J]

EDP ED1.5P ED2P

Figure 6: Energy Delay scatter plot for the simulated configu-rations

lute variation is at most 1.7°C, even small temperature variationcan correspond to appreciable energy modification. In [22], SPICEsimulation results are reported, showing that 1°C translates intoaround 4-5% leakage variation. Regarding our simulations, tem-perature reduction due to caches causes the energy reduction of thecore leakage energy, as shown in Figure 4. In this case, 5% en-ergy reduction (core leakage) corresponds to 0.5°C reduction ofthe average chip temperature. On the contrary, leakage in the cacheis much more sensitive to area than temperature. So, the leakageincrease due to the area can’t be compensated by the temperaturedecrease.

6.3 Energy DelayIn Figure 6, the Energy-Delay scatter plot for the design space is re-ported. Several curves with constant EDnP, withn=1, 1.5, and 2 arereported. Furthermore, the we connected with a line all the pointsof theenergy-efficient family[23]. In the Energy-Delay plane, thesepoints form a convex hull of all possible configurations of the de-sign space.

The best configuration from the Energy-Delay point of view is

Page 6: New Design Space Exploration for Multicore Architectures: A …people.ac.upc.edu/rcanal/pdf/ics06.pdf · 2006. 5. 22. · Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal

0

2

4

6

8

10

12

14IP

C

mpe

g2en

c

mpe

g2de

cFM

M

VOLREND

WATER−N

S

Figure 8: IPC for each benchmark

0

5

10

15

20

25

30

35

Ene

rgy

per

cycl

e [n

J/cy

cle]

mpe

g2en

c

mpe

g2de

cFM

M

VOLREND

WATER−N

S

Figure 9: Energy per cycle for eachbenchmark

60

60.5

61

61.5

62

62.5

63

63.5

64

Av.

Tem

pera

ture

[°C

]

mpe

g2en

c

mpe

g2de

cFM

M

VOLREND

WATER−N

S

4P 8−issue 512L28P 4−issue 256L2

Figure 10: Average temperature foreach benchmark

2−2 2−4 2−6 2−8 4−2 4−4 4−6 4−8 8−2 8−4 8−6 8−80

1

2

3

4

5

6

7

8

9

10

Ene

rgy*

Del

ay2 [J

s2 ]

25651210242048

Figure 7: Energy Delay2 Product. Each group of bars (L2cache varying) is labeled with<#procs>-<issue width>

the 8-core 4-issue 256KB-L2 one. This is optimal for the EDnPwith n < 2.5, as it can be intuitively observed in Figure 6. Ifn ≥ 2.5 the optimum moves to 512KB-L2 and therefore to thepoints of the energy-efficient family with higherhardware intensity1.

By minimizing energy the optimal solution is 2-core 4-issue 256-KB-L2 (low hardware intensity). 2-core 2-issue configurations arepenalized by poor performances, while 4-core 4-issue 256KB-L2represents an interesting trade-offs featuring only 5.5% worse en-ergy and 4.8% better delay.

Nevertheless, one could now consider other metrics such as area(see Table 1) or average temperature (see Figure 5). In the first case(area), smaller core count architecture can be preferred. For exam-ple, 2-core 4-core 4-issue 256KB-L2 needs only 90 mm2, whilethe optimal one (8-core 4-issue 256KB-L2) requires 180 mm2, fea-turing 8.7% better delay. Otherwise, if chip temperature is min-imized, larger caches can be preferred. For example, 8-core 4-

1Thehardware intensityis defined as is the ratio of the relative in-crease in energy to the corresponding relative gain in performanceachievable by architectural modification. See [23] for a detaileddiscussion.

issue 2048KB-L2 features 0.7°C lower average temperature. Thiscomes at the expenses of relevant leakage power consumption inthe L2 caches (see Figure 4).

In Figure 6, we represent, as squares, the points with 2MB to-tal L2 cache (equivalent to the optimum). They form 3 clusters,each one corresponding to a core/cache configuration: 8/256KB,4/512KB and 2/1024KB. These points show that increasing thenumber of cores, while decreasing the L2 size for each proces-sor, makes the energy-delay better. Furthermore, if considering thesame number of total issues (8-core 4-issue and 4-core 8-issue),the advantages of reducing the number of issues, but increasing thecore count come evident. Parallelism of applications in the consid-ered benchmark set is much better exploited by multi-core architec-tures, rather than complex monolithic cores. It can be also observedthat each cluster progressively ‘rotates’ as moving towards the op-timum, making more similar different issue-width from the pointof view of delay.

Figure 7 shows the Energy Delay2 Product for the simulated con-figurations. ED2P strongly favors high-performance architectures.In fact, 8-cores CMPs regardless other parameters are better thanany other architecture: 2% (w.r.t. large caches) up to 68% (w.r.t theED2P optimum).

These results show that the rule-of-the-thumb that configurationsoffering the same ILP (#cores× issue-width) have similar ener-gy/performance trade-offs is not totally accurate when consideringED2P (still holds for EDP). For example, the 4-issue 2-core 512KBCMP is equivalent to the 2-issue 4-core 256KB CMP in terms ofhardware budget but not on ED2P. Note that both configurationshave a total of 4 issue width and 512KB in L2 caches. Neverthelessthe area budget favors the 2-core over the 4-core (68 mm2 vs 81mm2 die area).

Cache size has a relatively small impact on performance (seeFigure 2), while it is important for energy. For the considereddataset, 256KB L2 cache per processor can be considered enough.Moreover it minimizes energy. It should be pointed out that L2size differently affects the core and the L2 leakage (see Figure4). Intelligent cache management, which is aware of used data,could considerably reduce the dependence of the leakage on the L2size, making larger caches more attractive when considering en-ergy/delay trade-offs.

6.4 Per Benchmark ResultsFigures 8, 9, and 10 show IPC, energy per cycle, and average tem-

Page 7: New Design Space Exploration for Multicore Architectures: A …people.ac.upc.edu/rcanal/pdf/ics06.pdf · 2006. 5. 22. · Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal

perature for each benchmark. Results are reported for two config-urations: the optimal one (4-issue, 8 procs. 256KB-L2) and totalcache/issue budget equivalent one, i.e. 8-issue, 4 procs. 512KB-L2. These benchmarks have been selected since they represents twodifferent kinds of workload. Thempeg2decoder and the encoderare integer video applications, stressing integer units and memorysystem.FMM, VOLREND, andWATER-NSare floating point sci-entific benchmarks, each of them featuring different characteristicsin terms of IPC and energy.

IPC ranges from around 3 (FMM, 4-core) to 12 forWATER-NS,8-core. It shows up to 100% difference across the benchmarks.Note that the IPC ofmpeg2encandFMM is much lower thanVOL-REND and mpeg2dec(3-5 vs 5-8), whileWATER-NSstems forreally high IPC (9-12). On the other hand, the energy per cycledoesn’t vary so much across benchmarks (24-32 nJ/cycle). It isalso quite similar for the 4-core and the 8 core CMPs. Temper-ature variability is 1.7°C across the benchmarks. Temperaturedifferences, between the two architectures, reported in the graphs,is about 0.5°C for mpeg2encandFMM, 0.25°C for VOLRENDandmpeg2dec, while it is negligible forWATER-NS.

7. TEMPERATURE SPATIALDISTRIBUTION

This section discusses of the spatial distribution of the tempera-ture (i.e. the chip thermal map) and its relationship with the floor-plan design. We first analyze several different floorplan designsfor CMPs. We then discuss temperature distribution while varyingprocessor core complexity.

7.1 Evaluation of FloorplansThe main goal of this section is to discuss how floorplan design atsystem level can impact on chip thermal behavior. Several floorplantopologies, featuring different core/caches placement are analyzed.We reduce the scope of this part of the work to 4 cores and 4-issueCMPs and L2 cache size of 256KB and 1024KB. Similar resultsare obtained for the other configurations. Figure 11 shows the ad-ditional floorplans we have considered in addition to those in Figure1. Floorplans are classified with respect to processor position in thedie:

• Paired (Figure 1. Cores are paired and placed on alternatesides of the die.

• Lined up (Figure 11.b and Figure 11.d). Cores are lined upon a side of the die.

• Centered (Figure 11.a and Figure 11.c). Cores are placed inthe middle of the die.

Area and aspect ratio are roughly equivalent for each cache sizeand topology.

We report the thermal map for each floorplan type and cache size.This is the stationary solution as obtained by HotSpot. We use thegrid model, making sure that grid spacing is 100µm for each floor-plan. The grid spacing determines the magnitudes of the hotspots,since each grid point has somewhat the average temperature of 120µm× 120µm surrounding it. Our constant grid spacing setup en-sures that the hotspot magnitude is comparable across designs.

Power traces are from the WATER-NS benchmark. We selectedthis application as representative of the behavior of all the bench-marks. The conclusions drawn in this analysis apply for each sim-ulated application.

For each thermal map (e.g. choose Figure 12), several com-mon phenomena can be observed. Several temperature peaks cor-respond to each processor. This is due to the hotspots in the core.

D$ I$

P1

I$D$

D$I$

P3

I$ D$

P2

SharedBus

P0

L2_0

L2_3 L2_2

L2_1

I$D$I$D$

P0 P1

D$ I$D$ I$

P2 P3

SharedBus

L2_0 L2_1 L2_2 L2_3

D$ I$

P1

I$D$

P0

I$D$ D$ I$

P2

L2_2

P3

L2_3

SharedBus

L2_0 L2_1

I$D

$P0

D$

I$

P1

L2_

0L

2_1

D$

I$I$

D$ P2

P3

L2_3

L2_2

(a) 256KB − Centered

(c) 1024KB − Centered (d) 1024KB − Lined up(b) 256KB − Lined up

Figure 11: Additional floorplans taken into account for evalua-tion of the spatial distribution of the temperature

Table 4: Average and maximum chip temperature for differentfloorplans

L2 256KB 1024KBMax [°C] Av. [ °C] Max [°C] Av. [ °C]

Paired 74.2 63.8 73.4 62.2Lined up 74.0 63.5 74.2 62.2Centered 72.4 63.6 73.0 62.2

The nature of the hotspots will be analyzed shortly in next section.Caches – in particular the L2 cache – and the shared bus are cooler,apart from the portions which are heated because of the proximityto hot units.

Paired floorplans for 256KB (Figure 12) and 1024KB (Figure15) show similar thermal characteristics. The temperature of thetwo hotspots of each core (the hottest is the FP unit, the other oneis the Integer Window coupled with the Ld/St queue) ranges from73.1/71.4 to 74.2/72°C for for 256KB L2. Hottest peaks are theones closest to the die corners (FP units on the right side), sincethey dispose of less spreading sides into the silicon. For 1024KBL2 hotspots temperature is between 72.5/71.5 and 73.4/72. In thiscase the hottest peaks are between each core pair. This is becauseof thermal coupling between processors.

The same phenomena appear for thelined up floorplans (Fig-ure 12 and Figure 15). The rightmost hotspots suffers from cornereffect, while inner ones from thermal coupling.

Figure 14 shows an alternative design. Here processors are placedin the center of the die (see Figure 11.a). Also their orientation isdifferent, since they are back to back.

As can be observed in Table 4, the maximum temperature is low-ered significantly (1.6/1.8°C). In this case, this is due to the in-creased spreading perimeter of the processors: two sides are on thesurrounding silicon, and another side is on the bus (the bus goesacross the die, see Figure 11.a). The centered floorplans are theonly ones with fairly hot bus. In this case, leakage in the bus can besignificant. Furthermore, for this floorplan, between the backs ofthe processor the temperature is quite ‘high’ (69.4°C), unlike otherfloorplans featuring relatively ‘cold’ backs (67.9°C).

For the1024KB-centeredfloorplan (Figure 17) these character-istics are less evident. Maximum temperature decrease is only0.4/1.2°C with respect topairedandlined upfloorplans. In partic-ular, it is small if compared with thepairedone. This is reasonableif considering that, in thepairedfloorplan the cache (1024KB) sur-rounds each pair of cores, providing therefore enough spreadingarea.

Overall, the choice of the cache size and the relative placing of

Page 8: New Design Space Exploration for Multicore Architectures: A …people.ac.upc.edu/rcanal/pdf/ics06.pdf · 2006. 5. 22. · Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal

0

2

4

6

8

0

2

4

6

8

10

12

14

60

65

70

75

mmmm

T [°

C]

Figure 12: Thermal map for the 256KB-paired floorplan (Figure 1)

0

2

4

6

8

10

12

14

0

2

4

6

8

60

65

70

75

mm

mm

T [°

C]

Figure 13: Thermal map for the 256KB-lined up floorplan (Figure 11.b)

0

2

4

6

8

10

12

0

2

4

6

8

60

65

70

75

mm

mm

T [°

C]

Figure 14: Thermal map for the 256KB-centered floorplan (Figure 11.a)

0

2

4

6

8

10

12

14

0

2

4

6

8

10

12

14

16

60

70

80

mmmm

T [°

C]

Figure 15: Thermal map for the 1024KB-paired floorplan (Figure 1)

0

2

4

6

8

10

12

14

0

2

4

6

8

10

12

14

16

60

70

80

mmmm

T [°

C]

Figure 16: Thermal map for the 1024KB-lined up floorplan (Figure 11.d)

0

2

4

6

8

10

12

14

0

5

10

15

60

70

80

mmmm

T [°

C]

Figure 17: Thermal map for the 1024KB-centered floorplan (Figure 11.c)

the processors can be important in determining the chip thermalmap. Those layouts, where processors are placed in the center andwhere L2 caches surround cores, typically feature lower peak tem-perature. The entity of the temperature decrease, with respect toalternative floorplans, is limited to approximately 1-2°C. The im-pact on the system energy can be considered negligible, since L2leakage dominates. Despite this, such a reduction of the hotspottemperature can lead to leakage reduction localized in the hotspotsources.

7.2 Processor ComplexityIn Figure 18, several thermal maps for different issue width of theprocessors are reported. We selected the WATER-NS benchmark,a floating point one, since this enables us to discuss heating in bothFP and Integer units. The CMP configuration is 2-core 256KB-L2.

Hottest units are the FP ALU, the Ld/St queue and the IntegerI-Window. Depending on the issue width, the temperature of theInteger Window and the Ld/St vary. For 2- and 4-issue, the FP ALUdominates, while as the the issue width is scaled up the IntegerWindow and the Ld/St becomes the hottest units of the die. This isdue to the super-linear increase of power density in these structures.

Hotspot magnitude depends on issue width – i.e. power densityof the microarchitecture blocks – and on the ‘compactness’ of thefloorplan. In the 4-issue chip temperature peaks higher than in the6-issue one exist. This is because the floorplan of the 6-issue corehas many cold units surrounding hotspots (e.g. see the FP ALUhotspots). Interleaving hot and cold blocks is an effective methodto provide spreading silicon to hot areas.

Thermal coupling of hotspots exists for various units shown inFigure 18, and it is often caused by inter-processor interaction. Forexample, In the 2-issue floorplan, the LdSt queue of the left pro-cessor is thermally coupled with the FP ALU of the right one. Inthe 4-issue, the Integer Execution Unit is warmed by the couplingbetween LdSt queue and FP ALU of the two cores. For what con-cerns intra-processor coupling, it can be observed for the 4-issuedesign, between the FP ALU and the FP register file, and in the8-issue, between the Integer Window and the the LdSt queue.

7.3 DiscussionDifferent factors determine hotspots in the processors. Power den-sity for each microarchitecture unit provides a proxy to tempera-ture, but doesn’t suffice in explaining effects due to the thermal

Page 9: New Design Space Exploration for Multicore Architectures: A …people.ac.upc.edu/rcanal/pdf/ics06.pdf · 2006. 5. 22. · Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal

62 64 66 70 72 7660 68 74

Icache_0 Dcache_0

Bpred_0 DTB_0

FPAdd_0 FPReg_0

FPMul_0

FPMap_0

IntMap_0IntQ_0

IntReg_0

IntExec_0

FPQ_0 LdStQ_0

ITB_0

Icache_1 Dcache_1

Bpred_1 DTB_

FPAdd_1 FPReg_1

FPMul_1

FPMap_1

IntMap_1IntQ_1

IntReg_1

IntExec_1

FPQ_1 LdStQ_1

ITB_1

L2_0 L2_1

SharedBus

Icache_0 Dcache_0

Bpred_0 DTB_0

FPAdd_0

FPReg_0

FPMul_0

FPMap_0

IntMap_0IntQ_0

IntReg_0

IntExec_0FPQ_0

LdStQ_0

ITB_0

Icache_1 Dcache_1

Bpred_1 DTB_1

FPAdd_1

FPReg_1

FPMul_1

FPMap_1

IntMap_1IntQ_1

IntReg_1

IntExec_1FPQ_1

LdStQ_1

ITB_1

L2_0 L2_1

SharedBus

Icache_0 Dcache_0Bpred_0

DTB_0FPAdd_0

FPReg_0FPMul_0

FPMap_0

IntMap_0

IntQ_0

IntReg_0

IntExec_0

FPQ_0

LdStQ_0

ITB_0

Icache_1 Dcache_1Bpred_1

DTB_1FPAdd_1

FPReg_1FPMul_1

FPMap_1

IntMap_1

IntQ_1

IntReg_1

IntExec_1

FPQ_1

LdStQ_1

ITB_1

L2_0 L2_1

SharedBus

Icache_0 Dcache_0Bpred_0 DTB_0

FPAdd_0

FPReg_0

FPMul_0

FPMap_0

IntMap_0 IntQ_0

IntReg_0

IntExec_0

FPQ_0

LdStQ_0

ITB_0

Icache_1 Dcache_1Bpred_1 DTB_1

FPAdd_1

FPReg_1

FPMul_1

FPMap_1

IntMap_1 IntQ_1

IntReg_1

IntExec_1

FPQ_1

LdStQ_1

ITB_1

L2_0 L2_1

SharedBus

2-issue 4-issue 6-issue 8-issue

FP ALU

FP ALU

IntQLSQ

LSQFP ALU

IntQLSQ

FP ALUIntQ LSQ FPQ

FPRF

IntEx

T [°C]

Figure 18: Thermal maps of 2-core 256KB-L2 CMP for VOLREND, varying issue width (from left to right: 2-, 4-, 6-, and 8-issue)

coupling and spreading area. It is important to care about the ge-ometric characteristics of the floorplan. We summarize all thoseimpacting on chip heating as follows:

• Proximity of hot units. If two or more hotspots come close,this will produce thermal coupling and therefore raise thetemperature locally.

• Relative positions of hot and cold units. A floorplan inter-leaving hot and cold units will result in lower global powerdensity (therefore lower temperature).

• Available spreading silicon. Units placed in such a position,that limits its spreading perimeter, will result in higher tem-perature. E.g. the units placed in a corner of the die.

These principles, all related to the common idea of lowering theglobal power density, apply to core microarchitecture, as well asCMP system architecture. In the first case, they suggest to makenot too close hot units, like register files, instruction queues,. . . Forwhat concerns CMPs they can be translated as follows:

• Proximity of the cores. The interaction between two coresplaced side-by-side, can generate thermal coupling betweensome units of the processors.

• Relative position of cores and L2 caches. If L2 caches areplaced to surround the cores, this results in better heat spread-ing and lower chip temperature.

• Position of cores in the die. Processors placed in the centerof the die offer better thermal behavior with respect to pro-cessors in the die corners or beside the die edge.

8. CONCLUSIONS AND FUTURE WORKIn this paper, we have discussed the impact of the choice of severalarchitectural parameters of CMPs on power/performance/thermalmetrics. Our conclusions apply to multicore architectures com-posed of processors with private L2 cache and running parallel ap-plications. Our contributions can be summarized as follows:

• The choice of L2 cache size is not only matter of perfor-mance/area, but also temperature.

• According to ED2P, best CMP configuration consists of alarge number of fairly narrow cores. Wide cores can be con-sidered too much power hungry to be competitive.

• Floorplan geometric characteristics are important for chiptemperature. The relative position of processors and cachesmust be carefully determined to avoid hotspot thermal cou-pling.

Several other factors may affect design choices, such as area,yield and reliability. These factors may, as well, affect the designchoice. At the same time, orthogonal architectural alternatives suchas heterogeneous cores, L3 caches, complex interconnects might bepresent in future CMPs. Nevertheless, evaluating all these alterna-tives is left for future research.

9. ACKNOWLEDGMENTSThis work has been supported by the Generalitat de Catalunya un-der grant SGR-00950, the Spanish Ministry of Education and Sci-ence under grants TIN2004-07739-C02-01, TIN2004-03072 andHI2005-0299, and the Ministero dell’Istruzione, dell’Universita edella Ricerca under the MAIS-FIRB project.

10. REFERENCES[1] C. McNairy and R. Bhatia. Montecito: A dual-core,

dual-thread Itanium processor.IEEE Micro, pages 10–20,March/April 2005.

[2] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A32-way multithreaded Sparc processor.IEEE Micro, pages21–29, March/April 2005.

[3] R. Kalla, B. Sinharoy, and J.M. Tendler. IBM Power5 Chip:A Dual-Core Multithreaded Processor.IEEE MICRO, pages40–47, March/April 2004.

[4] Trevor Mudge. Power: A first class constraint for futurearchitectures. InHPCA ’00: Proceedings of the 6thInternational Symposium on High-Performance Computer

Page 10: New Design Space Exploration for Multicore Architectures: A …people.ac.upc.edu/rcanal/pdf/ics06.pdf · 2006. 5. 22. · Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal

Architecture, Washington, DC, USA, 2000. IEEE ComputerSociety.

[5] Kevin Skadron, Mircea R. Stan, Karthik Sankaranarayanan,Wei Huang, Sivakumar Velusamy, and David Tarjan.Temperature-aware microarchitecture: Modeling andimplementation.ACM Trans. Archit. Code Optim.,1(1):94–125, 2004.

[6] Intel. White paper: Superior performance with dual-core.Technical report, Intel, 2005.

[7] Kelly Quinn, Jessica Yang, and Vernon Turner. The nextevolution in enterprise computing: The convergence ofmulticore x86 processing and 64-bit operating systems –white paper. Technical report, Advanced Micro Devices Inc.,April 2005.

[8] Jose Renau, Basilio Fraguela, James Tuck, Wei Liu, MilosPrvulovic, Luis Ceze, Smruti Sarangi, Paul Sack, KarinStrauss, and Pablo Montesinos. SESC simulator, January2005. http://sesc.sourceforge.net.

[9] David Brooks, Vivek Tiwari, and Margaret Martonosi.Wattch: a framework for architectural-level power analysisand optimizations. InProc. of the 27th InternationalSymposium on Computer Architecture, pages 83–94. ACMPress, 2000.

[10] Premkishore Shivakumar and Norman P. Jouppi. CACTI 3.0:An integrated cache timing, power, and area model.Technical Report 2001/2, Western Research Laboratory,Compaq, 2001.

[11] J. Huh, D. Burger, and S. Keckler. Exploring the designspace of future cmps. InPACT’01: Proceedings of the 10thInternational Conference on Parallel Architectures andCompilation Techniques, pages 199–210, Washington, DC,USA, 2001. IEEE Computer Society.

[12] M. Ekman and P. Stenstrom. Performance and power impactof issue-width in chip-multiprocessor cores. InICPP’03:Proceedings of the 2003 International Conference onParallel Processing, pages 359–369, Washington, DC, USA,2003. IEEE Computer Society.

[13] Stefanos Kaxiras, Girija Narlikar, Alan D. Berenbaum, andZhigang Hu. Comparing power consumption of an SMT anda CMP DSP for mobile phone workloads. InCASES ’01:Proceedings of the 2001 international conference onCompilers, architecture, and synthesis for embedded systems,pages 211–220, New York, NY, USA, 2001. ACM Press.

[14] Ed Grochowski, Ronny Ronen, John Shen, and Hong Wang.Best of both latency and throughput. InICCD ’04:Proceedings of the IEEE International Conference onComputer Design (ICCD’04), pages 236–243, Washington,DC, USA, 2004. IEEE Computer Society.

[15] J. Li and J.F. Martinez. Power-performance implications ofthread-level parallelism on chip multiprocessors. InISPASS’05: Proceedings of the 2005 InternationalSymposium on Performance Analysis of Systems andSoftware, pages 124–134, Washington, DC, USA, 2005.IEEE Computer Society.

[16] J. Li and J.F. Martinez. Dynamic power-performanceadaptation of parallel computation on chip multiprocessors.In HPCA ’06: Proceedings of the 12th InternationalSymposium on High Performance Computer Architecture,Washington, DC, USA, 2006. IEEE Computer Society.

[17] Yingmin Li, David Brooks, Zhigang Hu, and Kevin Skadron.Performance, energy, and thermal considerations for SMTand CMP architectures. InHPCA ’05: Proceedings of the

11th International Symposium on High-PerformanceComputer Architecture, pages 71–82, Washington, DC,USA, 2005. IEEE Computer Society.

[18] Y. Li, B. Lee, D. Brooks, Z. Hu, and K. Skadron. CMPdesign space exploration subject to physical constraints. InHPCA ’06: Proceedings of the 12th InternationalSymposium on High Performance Computer Architecture,Washington, DC, USA, 2006. IEEE Computer Society.

[19] Rakesh Kumar, Victor Zyuban, and Dean M. Tullsen.Interconnections in multi-core architectures: Understandingmechanisms, overheads and scaling. InISCA ’05:Proceedings of the 32nd Annual International Symposium onComputer Architecture, pages 408–419, Washington, DC,USA, 2005. IEEE Computer Society.

[20] K. Sankaranarayanan, S. Velusamy, M. Stan, andK. Skadron. A case for thermal-aware floorplanning at themicroarchitectural level.Journal of Instruction LevelParallelism, 2005.

[21] Pedro Chaparro, Grigorios Magklis, Jose Gonzalez, andAntonio Gonzalez. Distributing the frontend for temperaturereduction. InHPCA ’05: Proceedings of the 11thInternational Symposium on High-Performance ComputerArchitecture, pages 61–70, Washington, DC, USA, 2005.IEEE Computer Society.

[22] Ja Chun Ku, Serkan Ozdemir, Gokhan Memik, and YeheaIsmail. Thermal management of on-chip caches throughpower density minimization. InMICRO 38: Proceedings ofthe 38th annual IEEE/ACM International Symposium onMicroarchitecture, pages 283–293, Washington, DC, USA,2005. IEEE Computer Society.

[23] V. Zyuban and P. N. Strenski. Balancing hardware intensityin microprocessor pipelines.IBM Journal of Research &Development, 47(5-6):585–598, 2003.

[24] R. E. Kessler. The Alpha 21264 microprocessor.IEEEMicro, 19(2):24.

[25] Subbarao Palacharla, Norman P. Jouppi, and J. E. Smith.Complexity-effective superscalar processors. InISCA ’97:Proceedings of the 24th annual international symposium onComputer architecture, pages 206–218, New York, NY,USA, 1997. ACM Press.

[26] Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh, and SharadMalik. Orion: a power-performance simulator forinterconnection networks. InProc. of the 35th annualACM/IEEE International Symposium on Microarchitecture,pages 294–305, 2002.

[27] Weiping Liao, Lei He, and K.M. Lepak. Temperature andsupply Voltage aware performance and power modeling atmicroarchitecture level.IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems,24(7):1042.

[28] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie,Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2programs: characterization and methodologicalconsiderations. InISCA ’95: Proceedings of the 22nd annualInternational Symposium on Computer Architecture, pages24–36, New York, NY, USA, 1995. ACM Press.

[29] Man-Lap Li, Ruchira Sasanka, Sarita V. Adve, Yen-KuangChen, and Eric Debes. The alpbench benchmark suite forcomplex multimedia applications. InProceedings of theIEEE International Symposium on WorkloadCharacterization (IISWC-2005), Washington, DC, USA,2005. IEEE Computer Society.