• unlimited access with print and download
    $ 37 00
  • read full document, no print or download, expires after 72 hours
    $ 4 99
More info
Unlimited access including download and printing, plus availability for reading and annotating in your in your Udini library.
  • Access to this article in your Udini library for 72 hours from purchase.
  • The article will not be available for download or print.
  • Upgrade to the full version of this document at a reduced price.
  • Your trial access payment is credited when purchasing the full version.
Buy
Continue searching

An adaptive chip multiprocessor cache hierarchy

Dissertation
Author: M. W. Alexander Settle
Abstract:
System performance is increasingly coupled to cache hierarchy design as chip- multiprocessors (CMP) increase in core count across generations. Higher core counts require large last level cache (LLC) capacities to avoid costly off-chip memory band-width and the inherent bottleneck of memory requests from multiple active cores. Currently there are two divisions of thought in CMP cache design---shared versus private last level caches. At center of the issue is that CMP systems can improve different workloads: the throughput of multiple independent single-threaded applications and the high-performance demands of parallel multi-threaded applications. Consequently, maximizing the improvement of CMP performance in each of these domains requires opposing design concepts. As a result, it is necessary to investigate the behaviors of both shared and private LLC design models, as well as investigate an adaptive LLC approach that works for multiple workloads. This thesis proposes a scalable CMP cache hierarchy design that shields applications from inter-process interference while offering a generous per core last level cache capacity and low round trip memory latencies. A combination of parallel and serial data mining applications from the Nu-Mine Bench suite along with scientific workloads from the SpecOMP suite are used to evaluate the cache hierarchy performance. The results show that the scalable CMP cache hierarchy decreases the average memory latency of the parallel workloads by 45% against the private cache configuration and an average of 15% against the shared cache. In addition, the memory bandwidth is 25% lower than the private cache bandwidth for parallel applications and 30% lower for the serial workloads.

This thesis entitled: An Adaptive Chip Multiprocessor Cache Hierarchy written by M.W.Alexander Settle has been approved for the Department of Electrical and Computer Engineering Daniel A.Connors Prof.Andrew Plesczkun Prof.Dirk Grunwald Date The final copy of this thesis has been examined by the signatories,and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

iii Settle,M.W.Alexander (Ph.D.,Electrical Engineering) An Adaptive Chip Multiprocessor Cache Hierarchy Thesis directed by Prof.Daniel A.Connors Abstract System performance is increasingly coupled to cache hierarchy design as chip- multiprocessors (CMP) increase in core count across generations.Higher core counts require large last level cache (LLC) capacities to avoid costly off-chip memory band- width and the inherent bottleneck of memory requests from multiple active cores.Cur- rently there are two divisions of thought in CMP cache design - shared versus private last level caches.At center of the issue is that CMP systems can improve different workloads:the throughput of multiple independent single-threaded applications and the high-performance demands of parallel multi-threaded applications.Consequently, maximizing the improvement of CMP performance in each of these domains requires opposing design concepts.As a result,it is necessary to investigate the behaviors of both shared and private LLC design models,as well as investigate an adaptive LLC approach that works for multiple workloads. This thesis proposes a scalable CMP cache hierarchy design that shields applica- tions from inter-process interference while offering a generous per core last level cache capacity and low round trip memory latencies.A combination of parallel and serial data mining applications from the Nu-Mine Bench suite along with scientific workloads from the SpecOMP suite are used to evaluate the cache hierarchy performance.The results show that the scalable CMP cache hierarchy decreases the average memory latency of the parallel workloads by 45% against the private cache configuration and an average of 15% against the shared cache.In addition,the memory bandwidth is 25% lower

iv than the private cache bandwidth for parallel applications and 30% lower for the serial workloads.

Dedication To Mary,Ben,and Charlie Settle

vi Acknowledgements This work was made possible by a Fulbright Grant for the 2004-2005 academic year.I am grateful to Antonio Gonzal´ez and Enric Gibert for hosting me at the UPC and helping me develop my thesis research;to all the ‘becarios’ at the UPC who made my stay in Spain truly memorable;to Mary for making it possible for me to study and raise a family;to Ben for going to grade school in a foreign country;to Matthew for offering round the clock computer support on Hemisphere;to VJ for numerous insightful discussions about computer architecture research;and to Dan for continuously supporting my research. Computer time was provided by NSF ARI Grant#CDA-9601817,NSF MRI Grant#CNS-0420873,NASA AIST grant#NAG2-1646,DOE SciDAC grant#DE- FG02-04ER63870,NSF sponsorship of the National Center for Atmospheric Research, and a grant from the IBM Shared University Research (SUR) program.

vii Contents Chapter 1 Introduction 1 1.1 Current CMP Cache Architectures.....................2 1.1.1 Latency................................3 1.1.2 Capacity...............................4 1.1.3 Interference..............................4 1.2 Capacity vs.Latency.............................6 1.3 Industry Trends in CMP Cache Architectures...............8 1.4 Cache Hierarchy Design Considerations..................10 1.5 Workload Dependence............................12 1.5.1 Parallel Workloads..........................13 1.5.2 Serial Workloads...........................14 1.6 Dynamic Cache Organization........................15 1.7 Thesis Statement Contributions and Outline................16 2 Measuring Shared Cache Interference 18 2.1 Interference Analysis.............................19 2.2 Storage Allocation for Shared Associative Caches.............23 2.2.1 Replacement Algorithms.......................23 2.2.2 Measuring Capacity Misses.....................24

viii 2.2.3 LRU Stack Hit Order........................26 2.2.4 Reuse Interval............................29 2.3 Potential to Cause Interference.......................31 2.4 Correlation of Interference to Counters...................32 2.4.1 Shared Cache Performance Metrics.................32 2.4.2 Interference Prediction........................34 2.4.3 Relative Access Rate.........................36 2.4.4 Local Miss Rate...........................37 2.4.5 LRU Stack Rank...........................38 2.4.6 Linear Regression Correlation....................39 2.5 Extension to Existing Hardware Counters.................40 2.5.1 Fine Grain Access Counters.....................41 2.5.2 Activity Vectors...........................43 2.5.3 Inter-thread Eviction Counters...................46 2.6 Summary...................................48 3 Dynamic Cache Management System 50 3.1 Cache Partitioning Description.......................52 3.2 Virtual Partitions...............................53 3.3 Physical Partitions..............................54 3.4 Partition Management............................55 3.4.1 Precision Model:Upper Bound on Partitioning Performance..57 3.4.2 Shadow Cache:Precision Model Search Space...........57 3.5 Hardware Cache Partitioning Algorithms.................61 3.5.1 Heuristic Based Cache Partitioning Algorithm..........61 3.5.2 Cache Line Reused Based Partitioning...............65 3.5.3 Reuse Based Partition Algorithm..................66

ix 3.6 Hardware Based Cache Partitioning Results................68 3.7 Fairness....................................70 3.8 Cache Partitioning Performance Results..................72 3.9 Summary...................................76 4 A Scalable CMP Cache Hierarchy 78 4.1 Introduction..................................78 4.2 Workloads...................................80 4.2.1 Parallel Workloads..........................80 4.3 Scalable LLC.................................84 4.4 Parallel Application Performance......................85 4.5 Serial Application Performance.......................90 4.5.1 Fairness................................91 4.5.2 Latency................................93 4.6 Summary...................................96 5 Experimental Method 97 5.1 Simulation Infrastructure..........................97 5.2 Workloads Studied..............................99 5.3 Real Hardware Measurements........................100 5.4 OS Scheduler Modifications.........................100 5.5 Linux scheduler implementation.......................101 6 Related Work 105 6.1 CMP Cache Hierarchies...........................105 6.2 Innovative Uniprocessor Cache Organizations...............108 6.3 Cache Sharing.................................108 6.4 OS Scheduling For MT Performance....................110

x 6.5 Transition to Multithreaded Processors...................111 6.5.1 Course Grain Multithreading....................111 6.5.2 Fine Grain Multithreading.....................112 6.5.3 SMT Simultaneous Multithreading.................113 6.5.4 Performance Issues..........................113 6.5.5 Compilers...............................114 7 Conclusions 115 Bibliography 118

xi Tables Table 2.1 Working Set Sizes..............................27 2.2 512KB Shared LLC Interference Indicators................35 4.1 OpenMp Applications............................82 4.2 Unloaded Cache Access Latencies......................85 4.3 Serial Workload Name Key.........................90 5.1 Memory Simulator Configuration......................98

xii Figures Figure 1.1 Shared Cache Architectures.........................5 1.2 Cache Latency vs.Capacity.........................7 2.1 Inter-process Interference..........................19 2.2 Cumulative Shared LLC Raw Miss Count.................22 2.3 LRU Stack Abstraction...........................24 2.4 LRU Stack for All Bmarks..........................26 2.5 LRU Rank Hit Contribution.........................28 2.6 Early Eviction Optimization.........................30 2.7 Line Address Reuse..............................30 2.8 Relative Miss Rate to Cache Line Allocation Relation..........33 2.9 Interference Counter Correlation......................40 2.10 Twolf Hit Distribution Across Sets.....................42 2.11 GCC Twolf Per Set Miss Counts......................44 2.12 GCC Per Set Miss Counts..........................44 2.13 Illustrating the Basic Construction of Activity Vectors...........46 2.14 Inter-thread Interference Graph.......................47 3.1 Gzip Mcf optimized cache line allocation..................51 3.2 Partitioned Cache..............................54

xiii 3.3 Model Checker................................56 3.4 Shadow cache.................................58 3.5 Shadow Cache Performance.........................61 3.6 gzip mcf Optimized Cache Line Allocation.................69 3.7 Cache Sharing Fairness............................69 3.8 Cache Sharing Standard Deviation.....................71 3.9 L2 Cache Allocation Model Speedup....................73 3.10 L2 Cache Fairness..............................74 3.11 L2 Cache Hit Count.............................74 3.12 2MB L2 Cache Speedup...........................75 3.13 CMP L2 Cache Hit Count..........................76 4.1 Scalable CMP Cache Hierarchy.......................85 4.2 Average Memory Latency Across LLC Organizations...........86 4.3 Memory Bandwidth Parallel Applications Across LLC Organizations..87 4.4 L2 Access Components............................89 4.5 4M Shared LLC Cache Fairness.......................91 4.6 4M Scalable LLC Cache Fairness......................93 4.7 Memory Latency of Serial Applications Across LLC Organizations...95 4.8 Memory Bandwidth of Serial Applications Across LLC Organizations..95

Chapter 1 Introduction In 2005 the computing industry abandoned processor clock frequency as the pri- mary indicator of processor performance.This abrupt shift in marketing rhetoric sig- naled the realization of a looming problem in CPU design;frequency scaling is bounded by power dissipation and wire delay.In the past,chip makers have been able to rely on new process technologies to deliver sizable increases in processor frequencies,however, at scales of tens of nanometers [29,4],the physical properties of CMOS transistors impede the rate of increase in clock speeds.Power dissipation has suddenly become a primary CPU design consideration because,under the latest process technologies, processors can actually malfunction or even melt when run at very high frequencies. The wire delay problem has been addressed by increasing the processor pipeline depth, which enables a faster clock frequency by decreasing the amount of work done at each pipeline stage.Extremely deep pipelines however,exposed uniprocessors to performance loss from branch mispredictions due to the time required to recover from flushing the pipeline.The commercially unsuccessful Intel Pentium 4 processor highlights the need for an industry shift in processor design strategies.The shift away from clock frequency has made performance improvements across processor generations more dependent on architectural enhancements.The latest process technologies support more than 1 bil- lion transistors on a single chip,which is sufficient to accommodate several processors. Increased transistor densities per chip have ushered in the commercial adoption of Chip

2 Multiprocessors (CMP).CMPs achieve greater performance through parallelism,mak- ing them less dependent on frequency scaling.As a result,CMPs have become the premier architecture for high performance computing since 2005. The CMP design strategy has realized compelling performance gains,however, the presence of many cores introduces a host of new performance bottlenecks.The additional cores per chip increases the memory bandwidth demand raising the need for large last level caches (LLC) to reduce the number of off-chip memory references [45,31]. The resulting CMP challenge is to balance the transistor budget between core count and on-chip cache storage to provide the highest possible performance across scientific and general purpose workloads.The design of the cache hierarchy,therefore,plays a crit- ical role in the balance between core count and on chip storage capacity.Currently there are two divisions of thought in CMP cache design - shared versus private last level caches.At center of the issue is that CMP systems must improve different work- load demands:the throughput of multiple independent single-threaded applications and the high-performance demands of parallel multi-threaded applications.Consequently, maximizing the improvement of CMP performance in each of these domains requires opposing design concepts.As a result,it is necessary to investigate the behaviors of both shared and private LLC design relative to each workload’s requirements.In prac- tice,neither model is sufficient to satisfy each of the workload classifications.This thesis introduces an adaptable,scalable cache hierarchy that provides compelling performance improvements for both parallel and serial workloads relative to the shared and private cache hierarchies. 1.1 Current CMP Cache Architectures In the shared cache model,the LLC is shared by all the cores,while the higher levels are typically private to each core.The shared LLC has the advantage of making the capacity of the lowest level appear exceptionally large to each hardware context.

3 The increased capacity available to each core can improve the hit rate,which reduces the off-chip memory bandwidth.However,the access latency to such a large storage block can outweigh the benefits attributed to the additional capacity.Shared LLCs are also the most attractive hierarchy design for parallel applications because they elim- inate the need to replicate copies of shared data.Private cache hierarchies,on the other hand,offer less capacity to each context but with a considerably lower access latency.Additionally,private cache hierarchies force a static partition between the hardware contexts,eliminating inter-process cache interference for independent serial applications.The resulting drop in inter-process interference makes the the per core performance repeatable which is a requirement for the embedded computing domain. Parallel applications,however,experience an increase in inter-process cache interference in the private cache model because shared cache lines are replicated across the caches, reducing the effective cache capacity.The design tradeoffs for the two models are cen- tered on the impacts of latency,capacity and interference relative to the workloads run on the system. 1.1.1 Latency The round trip memory system latency determines the performance of a given cache hierarchy.Several variables impact the round trip latency,such as the number of cache levels,the unloaded access latency of each level,the coherency model,and queuing delays that result from an over subscription of bandwidth to main memory. Each of these metrics has a different weight on the overall latency based on the type of application running on the machine.For the case of serial applications,bandwidth is a first order concern,especially as the core counts increase.Due to the trend towards larger core counts,the hierarchy must also be scalable.This thesis evaluates the round trip latency from a scalability perspective in order to arrive at the optimal hierarchy for high performance CMP systems.

4 1.1.2 Capacity The effective LLC capacity available to each application is a significant contrib- utor to the memory subsystem performance.The first order effect of a large capacity is an improved hit rate which decreases the net miss latency of the memory subsystem. However,a large capacity cache comes at the expense of increased access latency and the potential for inter-process cache interference.In a CMP machine,the LLC stor- age capacity ultimately determines the bandwidth requirements for the system.Larger capacity caches decrease the bandwidth,which in turn lowers the round trip memory system latency.In order to provide a cache hierarchy that converges to the performance requirements of a diverse application set,the latency and bandwidth tradeoffs with respect to capacity must be evaluated. 1.1.3 Interference Interference occurs in both the parallel and serial application domains,yet it manifests in different ways.In parallel applications,interference occurs when a shared cache line is accessed by a remote peer cache.In this case,the shared line is copied to the requesting cache at the expense of evicting another line.If the sharing component is very large relative to the total LLC references,the shared data interferes with the private local data which leads to a decrease in the effective cache capacity.For serial applications,interference occurs in shared caches when cache lines for a given application are evicted in order to allocate storage to a different hardware context.Due to the conflicting demands placed on the system relative to interference,an adaptable cache hierarchy must be able to manage the interference from both application domains. Figure 1.1a shows the shared cache configuration for a two level hierarchy.The design provides one large LLC shared among all the first level data caches.Coherency is managed on chip at the bus between the LLC and the first level data caches.This

5 (a) CMP Shared (b) CMP Private Figure 1.1:Shared Cache Architectures leads to a reduction in memory bandwidth because there are no snoop transactions at the memory bus interface.The bandwidth is further reduced by the larger effective capacity available to the LLC.There are two significant draw backs to the shared model however.First,the large capacity leads to an increase in unloaded access latency which increases non-linearly with cache size.Transistor scaling has consistently outpaced metal wire scaling with each new process generation.The result is that the actual distance a message can travel in one clock cycle has decreased significantly.The wire delay problem is causing cache access latencies to increase rapidly as the capacity increases. The rapid rate of increase of access latency relative to the cache capacity dampens the contribution to performance provided by the improved hit rates associated with larger caches.The second problem is managing the inter-process interference.Without active management of the shared cache,there is nothing to prevent a single rogue processes from consuming all the cache storage at the expense of the other applications in the system.Design techniques such as tiled [75] caches help control the large access latencies but they are unable to address the problem of exposure to other applications. Figure 1.1b shows the private LLC organization.In this configuration,the total

6 LLC capacity is equivalent to the shared case,however,the effective capacity is less because shared lines must be duplicated across the caches.In this model though,each private cache has a shorter access latency than the shared cache making it possible for some applications to achieve greater performance due to the reduced memory latency. The working set size relative to the private cache size plays a significant role in de- termining the round trip latency.If the private LLC has sufficient capacity to cover the working set,then the private cache organization can outperform the shared cache. However,even for applications with large components of shared data,an insufficiently sized private cache slice causes the system to be dominated by capacity misses and a resulting increase in bandwidth.In this model the coherency is maintained at the off-chip memory bus which adds to the per core bandwidth demands.The LLCs are inclusive,obviating the need to run a coherency engine between the first level caches. The comparatively large drop in access latency associated with smaller LLCs can often outweigh any performance loss from an increased miss rate.For this reason,private LLC configurations are a compelling model for certain application types.In both of these models,and throughout this thesis,only the data reference stream is considered. Commercial CMPs such as the Intel Xeon and Montecito [33,34] manage the instruc- tion and data streams separately with trace caches and dedicated instruction caches respectively. 1.2 Capacity vs.Latency Parallel applications clearly benefit from the large fully shared cache shown in Figure 1.1a,however,as the cache size increases so does the access latency.At the last level cache there is a delicate balance between the performance gained by improving the hit rate and the performance lost by increasing the access latency.This problem has been addressed by several techniques [15,38,45,52,75] in the literature.Figure 1.2 shows the relationship between cache capacity and access latency over a range of process

7 0 20 40 60 80 100 120 140 160 180 200 0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07 Access Time Capcity Capacity vs. Latency Over Process Generations 32nm 45nm 65nm Figure 1.2:Cache Latency vs.Capacity technologies.The curves were generated from a CACTI [57] model of the most current process generations 65nm,45nm,and 32nm over a range of capacities from 1MB to 32MB.The results show that for smaller cache sizes there is a nearly linear relation between cache size and latency,but as the capacities approach 30MB the relation be- comes super linear,with the rate of increase in latency getting larger with each cache capacity step.Figure 1.2 illustrates the need to develop technologies that reduce the access latencies for large caches. Solutions to the latency problem primarily rely on making the last level cache a collection of smaller caches that are either connected over a bus or network.For the case of NUCA [38,15,9] cache lines with high reference frequencies are migrated to smaller low latency cache slices closer to their originating cores.While these optimizations are able to provide lower average latencies,they do so at the expense of capacity due to the need to replicate data either for sharing or migration purposes.Upon close inspection of latency optimizations for large last level caches,the design trend leads to a hybrid organization that borrows from both the shared and private configurations.

8 Using a network to connect the caches as in [75,66],improves the latency problem without decreasing capacity,however,the network hop latency plays a significant part in determining the total round trip memory access time.In this case,the shared last level cache can be viewed as a collection of smaller private caches connected over a network.Increases in capacity lead to potentially more nodes on the network and longer travel time between nodes.In all of these cases though,the shared last level cache design is moving away from a large monolithic storage block in order to manage latency. 1.3 Industry Trends in CMP Cache Architectures Presently,all of the major chip makers have released CMP processors,many of them supporting multithreading on each core.While this trend illustrates a clear movement toward greater thread level parallelism in processor architectures,there does not yet appear to be a consensus on the best cache hierarchy model.Both the AMD Athlon T1 [2] and the Intel Itanium Montecito [34] use private cache hierarchies per core.The Montecito processor offers a generous 12MB of last level cache per core,which is primarily due to the fact that in-order machines do not have dedicated hardware for hiding latency other than large caches.Superscalar cores require much smaller capac- ities,usually between 512KB and 1MB,but as application working set sizes continue to increase the caches will inevitably need to be larger to accommodate them.The dual core Athlon for example,provides 1MB of private L2 cache for each core,which is sufficiently large for many modern applications,however,a quad core implementation of the same architecture may not have sufficient chip area to support 4MB of total cache storage.As both the core counts and working set sizes continue to increase,the per core capacity requirements may soon reach a point where it is no longer feasible to provide large private storage for each core.The per core capacity requirements underscore the importance of converging on a scalable CMP cache architecture.

9 The Sparc Niagra [70] demonstrates the drive in industry towards large degrees of thread level parallelism.The architecture supports 8 cores,each of which has up to 4 hardware contexts.The 3MB L2 cache is shared across all of the cores meaning that at any time there could be up to 32 software contexts competing for shared cache resources.Since the cache is shared in this design,each core can use up to the entire 3MB of storage.If the Niagra had instead used private caches,the cores would be limited to a maximum of only 384KB of LLC storage which is less than the 512KB offered by most modern super scaler processors.Additionally,since the Niagra provides 4 hardware contexts per core,a single core running a parallel application would likely suffer severe performance penalties with only 324KB of L2cache.This architecture provides a good example of the competition in the transistor budget between two of the most fundamental components to performance,core count and on-chip cache storage. The Intel Xeon and Yonah processors [33],both dual core x86 implementations also use a shared L2 cache,in this case 2MB.Although the total capacity is the same as for the private cache Athlon,each individual core has the ability to use up to 2MB of L2 cache.The additional storage can lead to considerable performance improvements for some applications,however,it exposes the others to performance loss from inter-process interference.The IBM Power 5 [63] also provides 2MB of L2 cache shared across two cores.Each core supports 2 hardware thread contexts implemented with simultaneous multithreading (SMT) technology.Both the Power 5 and Power 6 include a level 3 cache but it is not directly in the data path between memory and L2.In this case the L3 is shared and is modeled after a victim cache,so it only stores data that was evicted from L2.These CMP implementations reinforce the practical need for scalable cache hierarchies that can adapt to the performance demands of multiple workload classes.

10 1.4 Cache Hierarchy Design Considerations Modern CMP systems are effectively multiprocessors in a single package,and as such,there is an expectation that each core will deliver high performance and have predictable execution times across invocations.The performance predictability of each application is less of a concern in multi threaded architectures such as SMT,where the additional hardware contexts are used primarily as a way to improve the throughput of the base uniprocessor [71].In the performance domain though,CMP systems must be capable of delivering high per core performance for serial applications,in addition to providing competitive results for parallel workloads.The requirements for each of these application classes means that the cache hierarchy must be designed to expose the synergies of their runtime characteristics.This thesis uses latency,capacity and interference as the primary performance metrics to argue for a scalable configurable cache architecture. Measuring the performance of CMP cache hierarchies proves challenging in the multi-programmed application domain.In this case,there are two different performance models.The first model considers the combined IPC of the system.In this model,a single high performing application can account for a large enough system IPC that the performance of the other applications are rendered insignificant.The second model concerns the performance of each individual core.In this model,each core must achieve near optimal performance in order to be considered successful.Although maintaining high per core performance may lead to a lower combined system performance,the second approach ensures that each core receives access to the resources it needs to maintain competitive,repeatable performance.For both of these performance models, understanding the following terms is critical for evaluating the effectiveness of the cache architecture. Performance Performance is the execution time of the job mix running on the sys-

11 tem.This metric accounts for the overall throughput of the machine and is not concerned with the execution rate of individual applications.In this reference frame,the performance of an single application can dominate the system IPC. Fairness Fairness is the rate of forward progress that each application makes relative to its performance on a uniprocessor machine with the same resources.Good fairness indicates that all applications are operating close to their uniprocessor performance.When per core performance is the driving issue,the cache system must be optimized to improve fairness. Shared cache architectures achieve good performance by favoring the applications which access the shared cache most frequently,at the expense of the others.While this tends to make the performance of the chip appear high,it does not do anything to regulate the per core performance making fairness low.Ideally,the CMP should offer high performance and fairness by exploiting the benefits offered by the shared cache, while ensuring that the interference penalties are kept to a minimum.For any CMP system the core count and cache capacity requirements for the target applications drive the decisions of whether or not to implement shared caching.For the private cache case, internal fragmentation is a significant concern.Internal fragmentation occurs when there is sufficient capacity to accommodate the total system wide capacity demands,yet one or more of the caches is over utilized while the others are under utilized. In a dual core system for example,if the working set size of a given process is less than the total system wide capacity yet larger than the capacity of a private last level cache,then it would suffer a performance penalty in the private cache model. If the other core were simultaneously running an application that adequately fit in the private cache,then the system would suffer from internal fragmentation.Even though the total last level capacity would be sufficient for both applications,the larger would suffer a performance loss.If the same applications were run in a shared cache

12 system with an equivalent cache capacity,then both applications could potentially run at near optimal performance.The concern is that the applications could cause conflict misses on each other which would offset the benefit of the larger capacity.Despite the penalty associated with internal fragmentation,the execution times of each application remain independent of the cache foot prints of the others.If the CMP is designed to run applications with similar working set sizes,then internal fragmentation may pose less of a problem than for a machine targeted for general use.In a properly managed shared cache organization,however,both applications can access sufficient storage while being isolated from direct interference from the others,leading to greater processor performance. 1.5 Workload Dependence Support for both parallel and serial applications is critical to modern micropro- cessor development.Until the advent of system on a chip computing,the sequential model had been the most prevalent.The availability of CMP and multi-threaded sys- tems has led a drive in the computing industry to encourage developers to adopt the parallel programming model in order to better leverage the new architectures.Due to the breadth of the software application space,it is crucial that the latest chip ar- chitectures provide compelling performance for both parallel and serial programming paradigms.This thesis examines the runtime characteristics of these application classes in order to argue a unified cache hierarchy design that delivers compelling performance for each application type.In order for a cache hierarchy design to be successful,the impacts of latency,bandwidth,capacity,and inter-process interference must be weighed relative to the working set sizes and data sharing properties of the target applications. The performance demands of a cache hierarchy which targets only parallel applications is considerably different than for serial.As a result a design that is intended to sup- port a breadth of application types will need to leverage the best attributes from each

13 programming paradigm. 1.5.1 Parallel Workloads In the parallel application domain,there are several techniques for enabling com- munication and data sharing across the software contexts.The most common pro- gramming model is to have a master process create several child processes with unique address spaces that communicate by sockets or reserved shared memory regions.In this model the processes are usually executing the same instructions but have the ability to perform the computations in parallel.Web servers such as apache [7] typically operate in this fashion.An alternative to the multi-process programming model is to use a soft- ware thread library that provides for light weight software processes,or threads,that share the same address space as the parent thread.In this model each process has its own private stack space while the global data allocated to the heap is shared across the processes.This model provides more fine grained control of the parallel algorithms used to process shared data.The variability in the techniques for managing shared data, suggest that parallel applications will have different constraints placed on the shared cache depending on the programming model. 1.5.1.1 Workload Traits This thesis studies applications that use the software thread programming model. These applications typically employ software threads to work on a shared data structure contained in the body of a loop.The threads are usually executing the same instructions while processing different regions of the shared data.The applications have varying degrees of data sharing depending upon the implementation of the parallel loops.

14 1.5.1.2 Impact on Cache Model The use of shared data and the working set size of each software thread play important roles in determining the expected performance of any cache hierarchy.The parallel applications examined in this thesis are evaluated relative to how efficiently the shared data is distributed across the LLC storage in addition to the latency and bandwidth required to service requests to shared data. 1.5.2 Serial Workloads In the shared cache model,the temporal and spatial localities of each process determine their respective storage demands,and in turn are indicators of the amount of interference likely to occur.Since shared caches use replacement policies such as LRU that were designed for uniprocessors,the behavior of any single application in a shared cache is dependent on the properties of the others.The uniprocessor replacement algorithms lead to non deterministic run times for each application on a CMP machine. Embedded applications are particularly vulnerable to this problem,because they usually have hard real time scheduling deadlines to meet.The shared cache model introduces the possibility that these deadlines will be missed,unless there is a way to isolate the high priority jobs from the others.In spite of the lower per core capacity of private caches, they are better suited for embedded applications because they offer runtime performance predictability.Given the tradeoffs inherent to each cache hierarchy model,the target workload characteristics play a commanding role in determining the processor’s cache architecture. 1.5.2.1 Workload Traits When serial applications are run in parallel on a CMP processor,the cache ca- pacity demands of each application,coupled with the relative LLC access rates between the applications determines how the applications will behave under the selected cache

15 hierarchy model.This thesis exploits these traits to drive optimizations that manage the performance and fairness of independent serial applications. 1.5.2.2 Impact on Cache Model The runtime characteristics of the target workloads determine how the applica- tions will behave under a given cache hierarchy.In both the private and shared cases, the effects of cache capacity and latency play significant roles in determining the overall performance benefit.This thesis uses the balance between latency and capacity to moti- vate innovative cache hierarchy designs that improve the serial application performance without sacrificing parallel application performance. 1.6 Dynamic Cache Organization Rather than fixing either a private or shared cache hierarchy,a memory system can be adaptable and exploit feedback to make the cache resources map to the applica- tion properties at run time.In this design scenario,a dynamically managed cache can be made to favor parallel applications in some modes while shielding independent appli- cations from inter-process interference.The resulting model can adjust the distribution of cache resources based on the properties of the co-scheduled workloads.Additionally, the proposed cache organization is scalable with respect to increased CMP core counts. In order to simultaneously manage the latency and bandwidth requirements of the CMP system,the proposed hierarchy adopts a hybrid model that inherits the most attractive qualities of both the private and shared organizations.Providing dynamic manage- ment of shared cache resources requires runtime detection of inter-process interference which can not be measured directly as can the cache hit rate.Furthermore,changes to interference leads to a shift in both the performance and fairness of the individual applications,both of which are difficult to detect and change at runtime.This thesis advocates a scalable cache hierarchy that supports a breadth of characteristic software

Full document contains 137 pages
Abstract: System performance is increasingly coupled to cache hierarchy design as chip- multiprocessors (CMP) increase in core count across generations. Higher core counts require large last level cache (LLC) capacities to avoid costly off-chip memory band-width and the inherent bottleneck of memory requests from multiple active cores. Currently there are two divisions of thought in CMP cache design---shared versus private last level caches. At center of the issue is that CMP systems can improve different workloads: the throughput of multiple independent single-threaded applications and the high-performance demands of parallel multi-threaded applications. Consequently, maximizing the improvement of CMP performance in each of these domains requires opposing design concepts. As a result, it is necessary to investigate the behaviors of both shared and private LLC design models, as well as investigate an adaptive LLC approach that works for multiple workloads. This thesis proposes a scalable CMP cache hierarchy design that shields applications from inter-process interference while offering a generous per core last level cache capacity and low round trip memory latencies. A combination of parallel and serial data mining applications from the Nu-Mine Bench suite along with scientific workloads from the SpecOMP suite are used to evaluate the cache hierarchy performance. The results show that the scalable CMP cache hierarchy decreases the average memory latency of the parallel workloads by 45% against the private cache configuration and an average of 15% against the shared cache. In addition, the memory bandwidth is 25% lower than the private cache bandwidth for parallel applications and 30% lower for the serial workloads.