• unlimited access with print and download
    $ 37 00
  • read full document, no print or download, expires after 72 hours
    $ 4 99
More info
Unlimited access including download and printing, plus availability for reading and annotating in your in your Udini library.
  • Access to this article in your Udini library for 72 hours from purchase.
  • The article will not be available for download or print.
  • Upgrade to the full version of this document at a reduced price.
  • Your trial access payment is credited when purchasing the full version.
Buy
Continue searching

Analytically Modeling the Memory Hierarchy Performance of Modern Processor Systems

ProQuest Dissertations and Theses, 2011
Dissertation
Author: Fang Liu
Abstract:
The first goal of this dissertation is to understand how cache parameters and application behavior influence the number of context switch misses the application suffers from. We characterize a previously-unreported type of context switch miss that occurs as the artifact of the interaction of cache replacement policy and an application's temporal reuse behavior. We characterize the behavior of these "reordered misses" for various applications, cache sizes, and various amount of cache perturbation. As a second contribution, we develop an analytical model that reveals the mathematical relationship between cache design parameters, an application's temporal reuse pattern, and the number of context switch misses the application suffers from. We validate the model against simulation studies and find it is sufficiently accurate in predicting the trends of context switch misses with regard to cache perturbation amount. The mathematical relationship provided by the model allows us to derive insights into precisely why some applications are more vulnerable to context switch misses than others. Through a case study on prefetching, we find that prefetching tends to aggravate context switch misses, and a less aggressive prefetching technique can reduce the number of context switch misses. We also investigate how cache size affects context switch misses. Our study shows that under relatively heavy workloads in the system, the worst case number of context switch misses for an application tends to increase proportionally with cache size, to an extent that may completely negate the reduction in other types of cache misses. The second goal of this dissertation is to propose a simple yet powerful analytical model that gives us the ability to answer several important questions: (1) How does off-chip bandwidth partitioning improve system performance? (2) In what situations is the performance improvement high or low, and what factors determine that? (3) In what way does cache and bandwidth partitioning interact, and is the interaction negative or positive? (4) Can a theoretically optimum bandwidth partition be derived, and if so, what factors affect it? We believe understanding the answers to these questions is very valuable to CMP system designers in coming up with strategies to deal with the scarcity of off-chip bandwidth in future CMPs. Modern high performance processors widely employ hardware prefetching techniques to hide long memory access latency. While very useful, hardware prefetching tends to aggravate the bandwidth wall due to high bandwidth consumption, where system performance is increasingly limited by the availability of off-chip pin bandwidth in CMPs. Prior studies have proposed to improve prefetching policy or partitioning bandwidth among cores to improve bandwidth usage. However, they either study techniques in isolation, leaving the significant interaction unexplored, or perform them in an ad-hoc manner, resulting in important insights missed. The third goal of this dissertation is to investigate how hardware prefetching and memory bandwidth partitioning impact CMP system performance and how they interact. To arrive at the goal, we propose an analytical model-based study. The model includes a composite prefetching metric that can help determine under which conditions prefetching can improve system performance, a bandwidth partitioning model that takes into account prefetching effects, and a derivation of the weighted speedup-optimum bandwidth partition sizes for different cores. Through model-driven case studies, we find several interesting observations that can be valuable for future CMP system design and optimization. We also explore simulation-based empirical evaluation to validate the observations and show that maximum system performance can be achieved by selective prefetching, guided by the composite prefetching metric, coupled with dynamic bandwidth partitioning. (Abstract shortened by UMI.)

TABLE OF CONTENTS List of Tables.........................................vii List of Figures.........................................viii Chapter 1 Introduction...................................1 1.1 Behavior and Implications of Context Switch Misses................3 1.2 Effects of Bandwidth Partitioning on CMP Performance..............7 1.3 Impact of Hardware Prefetching and Bandwidth Partitioning...........10 1.4 Organization of the Dissertation...........................13 Chapter 2 Behavior and Implications of Context Switch Misses.........14 2.1 Characterizing Context Switch Misses........................14 2.1.1 Types of Context Switch Misses.......................14 2.1.2 Characterization Methodology........................15 2.1.3 Characterization Results............................17 2.2 The Analytical Cache Model.............................21 2.2.1 Assumptions and Input of the Model.....................21 2.2.2 Model Formulation...............................23 2.3 Model Validation....................................27 2.3.1 Validation Methodology............................27 2.3.2 Validation Results...............................28 2.3.3 Why Applications Suffer from Context Switch Misses Differently.....30 2.4 Analytical Model Case Studies............................31 2.4.1 Prefetching...................................31 2.4.2 Cache Sizes...................................36 2.5 Related Work......................................39 2.6 Conclusions.......................................40 Chapter 3 Effects of Bandwidth Partitioning on CMPs Performance......42 3.1 Analytical Bandwidth Partitioning Model Formulation...............42 3.1.1 Assumptions..................................42 3.1.2 Model Formulation...............................44 3.2 Impact of Bandwidth Partitioning on System Performance.............50 3.2.1 Interaction Between Cache and Bandwidth Partitioning..........53 3.3 Model Validation and Empirical Evaluation.....................53 3.3.1 Evaluation Environment and Methodology.................53 3.3.2 Experimental Results.............................57 3.4 Related Work......................................65 3.5 Conclusions.......................................67 v

Chapter 4 Impact of Hardware Prefetching and Bandwidth Partitioning...68 4.1 Analytical Modeling..................................68 4.1.1 Assumptions and Model Parameters.....................68 4.1.2 Composite Metric for Prefetching.......................69 4.1.3 Memory Bandwidth Partitioning Model with Prefetching.........72 4.2 Model-Driven Study..................................74 4.2.1 The Impact of Miss Frequency........................74 4.2.2 The Impact of Prefetching Coverage.....................76 4.2.3 The Impact of Prefetching Accuracy.....................78 4.3 Empirical Evaluation..................................80 4.3.1 Environment and Methodology........................80 4.3.2 Experimental Results.............................82 4.4 Related Work......................................91 4.5 Conclusions.......................................92 Chapter 5 Retrospection..................................93 5.1 Summary........................................93 5.2 Reflections........................................94 References...........................................99 vi

LIST OF TABLES Table 2.1 SPEC2006 benchmarks.The unit for L2 miss frequency is the average number of cache misses in a single cache set that occurs in 50 milliseconds.The L2 cache size is 512KB....................................16 Table 3.1 Input and parameters used in our model......................44 Table 3.2 Comparing natural and optimum bandwidth sharing for any N and when N is large........................................50 Table 3.3 Three representative cases of mixed workloads...................55 Table 3.4 Weighted speedup ratios over no partitioning configuration and interaction of cache and bandwidth partitioning for Mix-1 workloads...............62 Table 3.5 Weighted speedup ratios over no partitioning configuration and interaction of cache and bandwidth partitioning for Mix-2 workloads..............64 Table 3.6 Weighted speedup ratios over no partitioning configuration and interaction of cache and bandwidth partitioning for Mix-3 workloads...............66 Table 4.1 Input and parameters used in our model......................69 Table 4.2 Ranking of applications based on ∆CPI = CPI p −CPI,our composite metric θ = MAK 2 B 2 f(c − 2 + 1−c a ),and conventional metric accuracy a.The available bandwidth is 800MB/s...............................82 Table 4.3 Ranking of applications based on ∆CPI = CPI p −CPI,our composite metric θ = MAK 2 B 2 f(c − 2 + 1−c a ),and conventional metric accuracy a.The available bandwidth is 1.6GB/s...............................83 Table 4.4 Ranking of applications based on ∆CPI = CPI p −CPI,our composite metric θ = MAK 2 B 2 f(c − 2 + 1−c a ),and conventional metric accuracy a.The available bandwidth is 800MB/s...............................84 vii

LIST OF FIGURES Figure 1.1 The average number of context switch misses for a single context switch suffered by SPEC2006 applications running on a processor with a 1MB L2 cache when they are so-scheduled with libquantum......................4 Figure 1.2 Two types of context switch misses suffered by SPEC2006 applications running on a processor with 1-MB L2 cache or 2-MB cache................6 Figure 1.3 Weighted speedup of four applications running on a 4-core CMP with a 2MB shared L2 cache,assuming various peak off-chip bandwidth............8 Figure 1.4 Weighted speedup (a) and optimum bandwidth allocation (b) for co-schedules running on a dual-core CMP............................11 Figure 2.1 The content of a cache set right before a process is context switched (a),when it resumes execution (b),after an access to block D (c),and after an access to block C (d)....................................15 Figure 2.2 Breakdown of the types of L2 cache misses suffered by SPEC2006 applications with various cache sizes..............................17 Figure 2.3 Normalized number of context switch misses for various time slices on an 8-way associative 1-MB L2 cache............................18 Figure 2.4 The composition of context switch misses for various cache sizes.........20 Figure 2.5 Increase in the L2 cache misses due to context switch misses for various shift amounts (2,4,6,and 8) on an 8-way associative 1-MB L2 cache.A shift amount of 8 means the entire cache state is replaced....................21 Figure 2.6 The state transition illustration for the case in which the right-most hole is not at the LRU position.Accessed blocks are marked by asterisks..........24 Figure 2.7 The state transition illustration for the case in which the right-most hole is at the LRU position.Accessed blocks are marked by asterisks............24 Figure 2.8 The Markov model for what previous states can lead to the current state (c,h,n)......................................25 Figure 2.9 Mathematical expression for the probability to reach a state (c,h,n)......26 Figure 2.10 Total context switch misses collected by the simulation model versus estimated by our model with variable shift amounts.....................28 Figure 2.11 Comparing the prediction accuracy of our model using global (entire cache) profile information vs.local (one set) profile information.............29 Figure 2.12 Stack distance probability function showing probability of accessing different stack positions...................................30 Figure 2.13 The fractions of the original natural and total L2 cache misses that remain after prefetching is applied...............................32 Figure 2.14 Usefulness of a prefetch block:the probability of prefetched blocks to be used (U prefetch ) vs.usefulness of a demand fetched block:the probability of demand- fetched blocks to be reused (U demand )......................33 Figure 2.15 Average stack distance profile of SPEC2006 benchmarks without prefetching vs.with prefetching applied...........................34 viii

Figure 2.16 Normalized context switch misses without prefetching vs.with prefetching vs. with prefetching plus a filter...........................35 Figure 2.17 Natural and context switch cache misses for various cache sizes with time quan- tum of 50ms,normalized to the total number of misses on a 512KB L2 cache for each benchmark.................................37 Figure 2.18 Average natural and context switch cache misses of SPEC2006 benchmarks for various cache sizes and time quanta.(a) shows the split misses normalized to a 512KB L2 cache with 5ms case;(b) shows the fraction of context switch misses with regard to the total cache misses for each configuration............38 Figure 3.1 The assumed base CMP configuration (a),and configuration with token bucket bandwidth partitioning (b)............................43 Figure 3.2 Fraction of bandwidth allocated to thread 2 as a function of miss frequency ratio of thread 2 to thread 1............................51 Figure 3.3 The weighted speedup when optimum bandwidth partition is made minus that when no bandwidth partitioning is used,as a function of the ratio of miss frequen- cies of thread 2 and thread 1 (a),or as a function of the peak available bandwidth B (b).For part (a),we assume M 1 A 1 = 2million/s and B = 1.6GB/s.For part (b),we assume M 2 A 2 M 1 A 1 = 5.............................51 Figure 3.4 Benchmarks sensitivity to L2 cache capacity and miss frequency.........54 Figure 3.5 Comparing weighted speedup of simulated vs.modeled for mixed workloads..56 Figure 3.6 Comparing the weighted speedup for Mix-1-2,Mix-2-2 and Mix-3-2 using fair (equal) partitioning [66] vs.simplified optimum bandwidth partitioning vs.op- timum partition and with various adjustments..................60 Figure 3.7 Weighted speedup of four configurations with various available bandwidth for Mix-1 workloads.................................61 Figure 3.8 Weighted speedup of four configurations with various available bandwidth for Mix-2 workloads.................................63 Figure 3.9 Weighted speedup of four configurations with various available bandwidth for Mix-3 workloads.................................65 Figure 4.1 Bandwidth share for Thread 2 (a),and the resulting weighted speedup (b),as the miss frequency of Thread 2 varies.......................75 Figure 4.2 Bandwidth share for Thread 2 (a),and the resulting weighted speedup (b),as the prefetching coverage for Thread 2 varies...................77 Figure 4.3 Bandwidth share for Thread 2 (a),and the resulting weighted speedup (b),as Thread 2’s prefetching accuracy varies......................79 Figure 4.4 Benchmarks prefetch coverage and accuracy...................81 Figure 4.5 Weighted speedup for various partition schemes in a dual-core CMP system with prefetchers turned on (a) and prefetchers turned off (b)...........85 Figure 4.6 Bandwidth shares for various co-schedules under different configurations in a dual-core CMP...................................87 Figure 4.7 Weighted speedup for various co-schedules under different configurations in a dual-core CMP...................................88 ix

Figure 4.8 Weighted speedup of optimum bandwidth partitioning for no prefetching,all prefetching,vs.selectively turning on prefetchers using the composite prefetch- ing metric on dual-core CMP with 1.6GB/s bandwidth (a),and quad-core CMP with 3.2GB/s bandwidth (b)...........................90 x

Chapter 1 Introduction With the continuous development in integrated-circuit technology,Moore’s law has predicted that the number of transistors on a CMOS silicon chip can be doubled every eighteen months.In the early stage of a modern processor design cycle,this trend,however,poses growing challenges in modeling and estimating the performance of the processor system. To understand and estimate performance of processor systems,architects typically build low-level detailed simulation models,and run numerous simulations to evaluate the perfor- mance and obtain insights and design trade-offs.One of the most popular classes of simulation models is cycle-accurate simulators,such as SimpleScalar [15],Simics [60],and SESC [38].The benefit of such simulation-based performance modeling is that it can be quickly adapted to new architectures and workloads.However,two major drawbacks of simulation models can- not be avoided.One is that creating and validating simulators themselves is time consuming. Even worse,the semiconductor development trend tends to increase the complexity of cycle- accurate simulators,because a growing number of resources are integrated on ever larger chips. The other drawback is that running benchmarks in the cycle-accurate simulators is extremely slow.It has been reported that simulators of out-of-order processors run programs thousands of times slower than the actual hardware [75].With increasingly complicated architectures and workloads in the future,simulation time will keep growing [89]. An alternative performance modeling method is to deploy high-level analytical models [44]. An analytical model expresses the relationships among architecture parameters and application characteristics with mathematical formulae.While it may require collecting certain parameters in advance,this can usually be done via sampling or profiling based simulations,that take much shorter time to run than cycle-accurate simulations.In addition,the process of collecting parameters is a one-time cost that can be amortized over many cases.Once the model is built, it is very fast to tune the parameters and obtain the predicted performance results.Therefore, analytical performance modeling can significantly improve the efficiency of the architecture 1

design process,because less simulation time is required to obtain the same observations and design trade-offs.Another inherent benefit of analytical models is that they can reveal non- obvious trends and insights that are difficult to obtain by using cycle-accurate simulations, since the relationships of all variables are expressed in the mathematical formulae.Therefore, architecture design and evaluation can be improved by analytical modeling in terms of quality, as deeper insights can be obtained. Among all performance aspects in modern processor design,memory hierarchy performance plays a significant role on overall performance,because it bridges the increasing gap between the CPU speed and DRAM main memory speed.More importantly,the expensive resources in the memory hierarchy,such as the last level cache as well as the off-chip memory bandwidth,are temporally and/or spatially shared among processes/threads in a multiprogramming or multi- tasking environment.Such resource sharing can cause severe resource contention,manifesting as single process performance volatility,as well as overall system performance degradation.In this dissertation,we will deploy analytical performance modeling to investigate the performance issues due to resource sharing in the memory hierarchy. Temporal sharing of the memory hierarchy occurs due to context switching in modern com- puter systems,which allows multiple threads of execution to time-share a limited number of processors.While very useful,context switching can introduce high performance overheads, primarily because it incurs cache perturbation.Between the time a thread is switched out and when it resumes execution,parts of its working set in the cache may be perturbed by other inter- fering threads,leading to (context switch) cache misses to recover from the perturbation.The first goal of this dissertation is to understand how cache parameters and application behavior influence the number of context switch misses the application suffers from. Spatial sharing of memory hierarchy resources is common in recent mainstream computing platforms – Chip Multi-Processor (CMP) architectures.Recent CMPs allow multiple threads to execute simultaneously with each thread running on a single core,but multiple cores share the last level cache and off-chip pin bandwidth.It has been observed that such spatial sharing may lead to single thread performance volatility,as well as overall system performance degra- dation.To alleviate the performance issue,last level cache and off-chip bandwidth partitioning schemes have been proposed in prior studies.While the effects of cache partitioning on system performance are well understood,how bandwidth partitioning affects system performance,and how it interacts cache partitioning are not clear.The second goal of this dissertation is to analyze those unclear factors. Modern high performance processors widely employ hardware prefetching technique to hide long memory access latency.While very useful,hardware prefetching tends to aggravate the bandwidth wall due to high bandwidth consumption,where system performance is increasingly limited by the availability of off-chip pin bandwidth in CMPs.Prior studies have proposed 2

to improve prefetching policy or partitioning bandwidth among cores to improve bandwidth usage.However,they either study techniques in isolation,leaving the significant interaction unexplored,or perform them in an ad-hoc manner,resulting in important insights missed.The third goal of this dissertation is to investigate how hardware prefetching and memory bandwidth partitioning impact CMP system performance and how they interact. 1.1 Behavior and Implications of Context Switch Misses One of the essential features in modern computer systems is context switching.It allows a system to provide an illusion of many threads of execution running simultaneously on a limited number of processors by time-sharing the processors.While very useful,context switching introduces high overheads directly and indirectly [4,16,22,52,63,87,88].Direct context switch overheads include saving and restoring processor registers,flushing the processor pipeline,and executing the Operating System (OS) scheduler.Indirect context switch overheads include the perturbation of the cache and TLB states.When a process/thread is switched out,another process/thread runs and brings its own working set to the cache.When the switched-out process/thread resumes execution,it has to refill the cache and TLB to restore the state lost due to the context switch.Prior research has shown that indirect context switch overheads,mainly the cache perturbation effect,are significantly larger than direct overheads [16,22,52,87].For example,experimental results in [52] show that the indirect context switch cost is up to 262 times more than the average direct context switch cost on an IBM eSever with dual 2.0GHz Intel Pentium Xeon CPUs.We refer to the extra cache misses that occur as a result of the cache perturbation by context switching as context switch cache misses. The number of context switch misses suffered by a thread is determined by the frequency of context switches as well as the number of context switch misses that occur after each con- text switch.The factors that affect the frequency of context switches and how they affect it are relatively easy to pinpoint and straightforward to understand.For example,the fre- quency of context switches is proportional to the level of processor sharing,i.e.the number of threads/processes that time-share a single processor.Factors that tend to increase the degree of processor sharing include thread-level parallelismand virtualization.Factors that tend to de- crease the degree of processor sharing include multi-core design,which unfortunately increases the degree of memory hierarchy resource sharing. It is not difficult to understand that the frequency of context switches occurring in the system is inversely proportional to the time quantum size allocated by the OS.However,factors that affect the number of context switch misses after a single context switch are not well understood. For example,it seems reasonable to expect that the part of the working set of a thread displaced from the cache due to a context switch will need to be fully reloaded back into the cache by the 3

thread through cache misses,which we refer to as required reload misses.One may expect that the number of context switch misses a thread suffers from to be closely related to the number of required reload misses.For the ease of discussion,we refer to this expectation as the required reload misses hypothesis.Unfortunately,the hypothesis is grossly inadequate at explaining the actual behavior of context switch misses.Figure 1.1 shows the average number of context switch misses that occur on a single context switch for eleven SPEC2006 benchmarks [82]. Each benchmark time-shares a single processor with libquantum using 5ms time quantum.The context switch misses are collected on a full system simulation that runs an unmodified Linux operating system (OS),with the processor having a 1MB L2 cache. 1 The x-axis shows the applications,sorted by L2 cache miss rate (shown in the parentheses) of each application when it runs alone on a dedicated processor. Figure 1.1:The average number of context switch misses for a single context switch suffered by SPEC2006 applications running on a processor with a 1MB L2 cache when they are so-scheduled with libquantum. The required reload misses hypothesis cannot explain the variation of context switch misses across benchmarks,since the number of bytes in the working set displaced by libquantum on a single context switch is roughly the same across all benchmarks.The hypothesis ignores the fact that not all displaced bytes of the working set of a thread are going to be accessed again by the thread when it resumes execution.Hence,the temporal reuse pattern of an application 1 Please refer to Section 2.1 for details on other processor and memory hierarchy parameters. 4

affects its context switch misses,but the exact relationship between them is still unclear.For example,ignoring cold misses,the cache miss rate of an application represents how likely a block that was used in the past and has been replaced from the cache will be reused.However, the figure shows that there is no apparent correlation between an application’s miss rate with how many context switch misses it suffers from.For example,despite having similar miss rates, hmmer suffers from more than three times the context switch misses that sjeng suffers from. Another reason why the required reload misses hypothesis is inadequate is that it assumes that context switch misses only arise due to the displaced working set.However,cache pertur- bation actually affects context switch misses through another channel.Specifically,it causes the non-displaced part of the working set to become less “fresh” in the cache,as it is shifted in its recency (or stack) order to be closer to the least recently used (LRU) position in the cache.This recency reordering causes the reordered cache blocks to have an elevated chance to be replaced by the thread itself when it resumes execution,before the thread has a chance to reuse them. Because these blocks are replaced by the thread itself (rather than by interfering threads from other applications),these misses have not been correctly reported as context switch misses in prior studies [1,4,48,86],causing the number of context switch misses to be systematically under-estimated.To give an illustration of the magnitude of the under-estimation,Figure 1.2 shows the fractions of the two types of context switch misses for twelve SPEC CPU2006 ap- plications.Replaced misses are context switch misses due to the part of working set that is displaced by the interfering thread,while reordered misses are ones due to the part of working set that is merely reordered in the cache.The evaluation setup is the same as in Figure 1.1, except that the result shown for each application is the average context switch misses when the application is co-scheduled with eleven other applications.The figure shows that reordered misses account for between 10% to 28% of the total context switch misses for a 1-MB cache and slightly higher for a 2-MB cache.In other words,not counting reordered misses as context switch misses may under-estimate the number of context switch misses rather significantly. Clearly,a simple hypothesis such as the required reload misses hypothesis is inadequate in explaining what factors affect the number of context switch misses a thread suffers from and how they exactly affect it.Hence,the goal of this study is to understand how cache parameters, a thread’s temporal reuse patterns,and the amount of cache perturbation influence the number of context switch misses the thread experiences.We hope that a better understanding of the nature of context switch misses can help researchers in coming up with techniques to reduce them.Our first contribution is characterizing context switch misses,especially focusing on the reordered misses:their magnitude,how they affect different applications,and how they are affected by cache sizes and the amount of cache perturbation.The main findings are:(1) context switch misses can contribute to a significant increase in the total L2 cache misses;(2) they tend to increase along with the cache size up until the cache size is large enough to hold the 5

Figure 1.2:Two types of context switch misses suffered by SPEC2006 applications running on a processor with 1-MB L2 cache or 2-MB cache. entire combined working sets;(3) reordered misses tend to contribute to an increasing fraction of context switch misses as the cache size increases;and (4) the maximum number of reordered misses occurs when cache perturbation displaces roughly half of the total cache blocks. Our second contribution is an analytical model-based investigation to establish how the temporal reuse pattern of a thread,the amount of cache perturbation,and the number of context switches are mathematically related.Compared to empirical studies,the mathematical relationship allows us to gain clearer insights into the behavior of context switch misses.We validate our Markov-based model against the simulation results of SPEC2006 applications.We find that the model is sufficiently accurate in predicting the trends of context switch misses. The model allows us to derive insights into precisely why some applications are more vulnerable to context switch misses than others.Then,as one case study,we apply the model to analyze how prefetching and context switch misses interact with each other in various applications.The investigation leads us to a previously-unreported observation that prefetching can aggravate the number of context switch misses for an application.In addition,perversely,the more effective prefetching is for an application,the higher the number of context switch misses the application suffers from.We also investigate how cache sizes affect the number of context switch misses. Our study shows that under relatively heavy workloads in the system,the worst-case number of context switch misses for an application tends to increases with cache sizes,to an extent that may completely negate the reduction in other types of cache misses. 6

1.2 Effects of Bandwidth Partitioning on CMP Performance Chip Multi-Processors (CMPs) have recently become a mainstreamcomputing platform.Recent CMP designs allow multiple processor cores to share expensive resources,such as the last level cache and off-chip pin bandwidth.Such sharing has produced an interesting contention-related performance phenomena,in which the system throughput,as well as individual thread perfor- mance,are highly volatile,depending on the mix of applications that share the resources and on how the resources are partitioned among the applications.To improve system performance and/or reduce the performance volatility of individual threads,researchers have proposed to either partition the last level cache [13,14,26,30,36,41,54,68,70,71,77,83,84] or partition off-chip bandwidth [33,50,64,65,66,72] usage between cores,with partition sizes chosen to optimize a particular goal,such as maximizing throughput or fairness.The growing number of cores on a chip increases the degree of sharing of the last level cache and off-chip bandwidth, making it more important to reach an optimum partition for each resource. THe effects of bandwidth partitioning on system performance are not as well understood as the effects of cache partitioning.For example,it is well known that cache partitioning can reduce the total number of cache misses by reallocating cache capacity from threads that do not need much cache capacity to those that do,and the reduction in the total number of cache misses improves the overall system performance.However,allocating fractions of off-chip bandwidth to different cores does not reduce the total number of cache misses.Instead,it only prioritizes the off-chip memory requests of one thread over others.Hence,while bandwidth partitioning can clearly affect an individual thread’s performance,the aspects of how off-chip bandwidth partitioning affects system performance are not yet well understood. In addition,despite the wealth of studies in last level cache and off-chip bandwidth parti- tioning,the interactions between cache partitioning and bandwidth partitioning are not well understood.Most studies in cache partitioning ignore bandwidth partitioning [13,14,26,30, 36,54,68,70,71,77,83,84],while most studies in bandwidth partitioning assume private caches or a statically-partitioned shared cache [50,64,65,66,72].Only recently coordinated cache partitioning and bandwidth partitioning were explored together using an artificial neural network (ANN) optimizer,where it was shown to outperform the cases in which cache or band- width partitioning were applied individually [8].However,since ANN is a black box optimizer, the nature of interaction between cache and bandwidth partitioning remains unexplored. One theory that has been mentioned in the literature is that bandwidth partitioning’s im- pact on performance is secondary to that of cache partitioning [12,84].The theory certainly makes sense,because cache partitioning can significantly reduce the total number of cache misses of applications that share the cache,resulting in reduction in the total off-chip memory traffic;in contrast,bandwidth partitioning cannot reduce total off-chip memory traffic,since it 7

Full document contains 120 pages
Abstract: The first goal of this dissertation is to understand how cache parameters and application behavior influence the number of context switch misses the application suffers from. We characterize a previously-unreported type of context switch miss that occurs as the artifact of the interaction of cache replacement policy and an application's temporal reuse behavior. We characterize the behavior of these "reordered misses" for various applications, cache sizes, and various amount of cache perturbation. As a second contribution, we develop an analytical model that reveals the mathematical relationship between cache design parameters, an application's temporal reuse pattern, and the number of context switch misses the application suffers from. We validate the model against simulation studies and find it is sufficiently accurate in predicting the trends of context switch misses with regard to cache perturbation amount. The mathematical relationship provided by the model allows us to derive insights into precisely why some applications are more vulnerable to context switch misses than others. Through a case study on prefetching, we find that prefetching tends to aggravate context switch misses, and a less aggressive prefetching technique can reduce the number of context switch misses. We also investigate how cache size affects context switch misses. Our study shows that under relatively heavy workloads in the system, the worst case number of context switch misses for an application tends to increase proportionally with cache size, to an extent that may completely negate the reduction in other types of cache misses. The second goal of this dissertation is to propose a simple yet powerful analytical model that gives us the ability to answer several important questions: (1) How does off-chip bandwidth partitioning improve system performance? (2) In what situations is the performance improvement high or low, and what factors determine that? (3) In what way does cache and bandwidth partitioning interact, and is the interaction negative or positive? (4) Can a theoretically optimum bandwidth partition be derived, and if so, what factors affect it? We believe understanding the answers to these questions is very valuable to CMP system designers in coming up with strategies to deal with the scarcity of off-chip bandwidth in future CMPs. Modern high performance processors widely employ hardware prefetching techniques to hide long memory access latency. While very useful, hardware prefetching tends to aggravate the bandwidth wall due to high bandwidth consumption, where system performance is increasingly limited by the availability of off-chip pin bandwidth in CMPs. Prior studies have proposed to improve prefetching policy or partitioning bandwidth among cores to improve bandwidth usage. However, they either study techniques in isolation, leaving the significant interaction unexplored, or perform them in an ad-hoc manner, resulting in important insights missed. The third goal of this dissertation is to investigate how hardware prefetching and memory bandwidth partitioning impact CMP system performance and how they interact. To arrive at the goal, we propose an analytical model-based study. The model includes a composite prefetching metric that can help determine under which conditions prefetching can improve system performance, a bandwidth partitioning model that takes into account prefetching effects, and a derivation of the weighted speedup-optimum bandwidth partition sizes for different cores. Through model-driven case studies, we find several interesting observations that can be valuable for future CMP system design and optimization. We also explore simulation-based empirical evaluation to validate the observations and show that maximum system performance can be achieved by selective prefetching, guided by the composite prefetching metric, coupled with dynamic bandwidth partitioning. (Abstract shortened by UMI.)