• unlimited access with print and download
    $ 37 00
  • read full document, no print or download, expires after 72 hours
    $ 4 99
More info
Unlimited access including download and printing, plus availability for reading and annotating in your in your Udini library.
  • Access to this article in your Udini library for 72 hours from purchase.
  • The article will not be available for download or print.
  • Upgrade to the full version of this document at a reduced price.
  • Your trial access payment is credited when purchasing the full version.
Buy
Continue searching

Effective performance analysis and optimizations for memory intensive programs on multicore

Dissertation
Author: Lixia Liu
Abstract:
Memory bandwidth has become the performance bottleneck for memory intensive programs on modern processors due to the increasing gap between on-chip CPU speed and off-chip memory bandwidth. The emergence of multicore chips has made the gap worse. Understanding the memory bandwidth requirement of a program and its performance impact is critical for performance tuning, performance prediction, and optimization strategies. This dissertation proposes a new methodology for analyzing the memory bandwidth requirement and the performance of memory intensive programs on multicore machines. The methodology is based on two notions. The notion of zero-latency instruction issue rate quantifies the ideal execution speed of a program, or different parts of a program, under no memory bandwidth constraint. The notion of memory access intensity quantifies the memory bandwidth requirement of a program or different parts of a program. The memory access intensity can be then used to identify the memory bottlenecks in a program. Under these two basic notions, an analytical prediction model is developed to estimate the program performance. Experiments show that our new methodology accurately identifies the memory bottlenecks and predicts the performance trend for a suite of test programs. The memory bottlenecks, once identified, may be corrected through appropriate program transformations. Unfortunately, certain transformations may potentially introduce overheads such as additional instructions and reduced parallelism. In this dissertation, we study such complex interactions between instruction count, parallelism, and memory behavior in the context of two program transformations. First, we discuss how to balance the trade-off between instruction count and memory access intensity for effective compression of integer and pointer arrays. Three encoding methods are developed for our compression scheme. We derive formulae to analyze the cost and benefit of each encoding method. We implement our compression scheme in a compiler to automatically transform the program into different versions corresponding to different encoding methods. Results show that our compiler-automated adaptive scheme improves the execution speed over the original version by an average of 9% for a set of benchmark programs. These results take compression overheads into account in order to make adaptive decisions. Next, we discuss the use of carefully controlled asynchronous algorithms to attain data locality and parallelism simultaneously for a class of numerical programs. We show that carefully controlled asynchrony and loop tiling can significantly improve the performance of parallel iterative algorithms on multicore processors due to improved data locality and loop-level parallelism. Evidence from experiments with three well-known numerical kernels is presented.

v TABLE OF CONTENTS Page LIST OF TABLES................................vii LIST OF FIGURES...............................ix ABSTRACT...................................xi 1 INTRODUCTION..............................1 1.1 Thesis..................................1 1.2 Background on Multicore Processors.................4 1.2.1 Memory Hierarchy and Memory Bandwidth.........4 1.2.2 Data Prefetching........................6 1.2.3 Memory Bandwidth versus Memory Latency.........7 1.3 Contributions..............................8 1.4 Organization..............................10 2 PERFORMANCE ANALYSIS METHODOLOGY.............11 2.1 A Three-step Methodology.......................11 2.1.1 Step 1—Zero-latency Instruction Issue Rate.........12 2.1.2 Step 2—Memory Access Intensity...............14 2.1.3 Step 3—Performance Prediction Model............15 2.2 Measurement Tools...........................17 2.2.1 Hardware-based Performance Tools..............17 2.2.2 A Simulation Tool.......................19 2.3 Illustrating Examples..........................22 2.3.1 Non-blocking Matrix Multiplication..............23 2.3.2 Blocking Matrix Multiplication................25 3 COMPRESSION FOR BANDWIDTH REQUIREMENT REDUCTION.28 3.1 Background...............................28 3.1.1 Data Compression.......................28 3.1.2 Challenges on Memory Bandwidth Requirement Reduction.31 3.2 Encoding Methods...........................33 3.2.1 Double Array Compression...................33 3.2.2 Delta Double Array Compression and the Special Version..34 3.3 A Compiler-automated Adaptive Scheme...............37 3.3.1 Program Analysis and Transformation............38 3.3.2 Benefit Model..........................40 3.3.3 Compiler Implementation...................47

vi Page 4 ASYNCHRONOUS MODEL FOR LOCALITY AND PARALLELISM.50 4.1 Challenges of Loop Tiling.......................50 4.2 Asynchronous Model..........................53 4.3 Enabling Loop Tiling with Asynchrony................55 4.4 Chunk Size Selection Algorithm....................59 4.4.1 Impact of Chunk Size......................59 4.4.2 Adaptive Chunk Size Selection Scheme............62 5 EVALUATION OF ANALYSIS METHODOLOGY............65 5.1 Evaluation on Numerical Solvers....................65 5.1.1 ScaLAPACK..........................66 5.1.2 Spike...............................71 5.1.3 A New Spike Algorithm....................76 5.2 Performance Prediction.........................79 5.2.1 Verification with Synthetic Programs.............80 5.2.2 Numerical Solvers........................82 6 EXPERIMENTAL RESULTS OF OPTIMIZATIONS...........88 6.1 Experimental Setup...........................88 6.2 Evaluation of Compression.......................89 6.2.1 Benchmarks...........................89 6.2.2 Compression Impacts......................90 6.2.3 Performance Results......................94 6.2.4 Comparison to Manual Compression Methods........98 6.3 Asynchronous Loop Tiling.......................100 6.3.1 Tile Size Selection.......................101 6.3.2 Performance on Numerical Kernels..............102 6.3.3 Evaluation on Adaptive Chunk Size Selection........109 7 RELATED WORKS.............................117 7.1 Performance Prediction Model.....................118 7.2 Data Compression...........................120 7.3 Asynchronous Loop Tiling.......................123 8 CONCLUSIONS AND FUTURE WORK..................126 LIST OF REFERENCES............................129 VITA.......................................141

vii LIST OF TABLES Table Page 2.1 Performance counters used for evaluation.................18 2.2 Micro-benchmarks used in evaluation....................21 2.3 Evaluation results of the simulation tool..................22 2.4 Code examples for matrix multiplication..................23 2.5 Memory access intensity for matrix multiplication.............25 2.6 Performance prediction for matrix multiplication.............26 3.1 Example of compressing integer array to unsigned short array......35 3.2 Example of transformation with DDAC..................36 3.3 Branch misprediction ratio of DAC method................46 3.4 Code snippet of SpMV from CG......................47 3.5 Open64 implementation details.......................49 4.1 Tiling with speculative execution......................53 5.1 Zero-latency instruction issue rate on each step..............67 5.2 ScaLAPACK:Memory access intensity for narrow banded system....69 5.3 ScaLAPACK:Memory access intensity for medium banded system...69 5.4 ScaLAPACK:Memory access intensity for wide banded system.....70 5.5 Spike TA0:Memory access intensity for narrow banded system.....72 5.6 Spike TA0:Memory access intensity for medium banded system.....74 5.7 Spike TA0:Memory access intensity for wide banded system......74 5.8 Spike NEW:Memory access intensity for narrow banded system.....77 5.9 Spike NEW:Memory access intensity for medium banded system....78 5.10 Spike NEW:Memory access intensity for wide banded system......78 5.11 Predicted speedup of numerical solvers on Intel Q6600..........85 6.1 Memory configurations of test systems...................89

viii Table Page 6.2 Test inputs used in experiments.......................91 6.3 Benchmarks used in experiments......................92 6.4 Memory usage increase...........................93 6.5 Memory bandwidth requirement with and without compression.....98 6.6 Performance summary of Jacobi......................106 6.7 Off-chip memory accesses and cache misses of Jacobi...........106 6.8 Performance summary of GS........................107 6.9 Off-chip memory accesses and cache misses of GS.............107 6.10 Performance summary of SOR.......................108 6.11 Off-chip memory accesses and cache misses of SOR............109 6.12 Performance summary of adaptive chunk size on three machines.....113

ix LIST OF FIGURES Figure Page 1.1 Memory hierarchy on multicore chips....................5 2.1 Performance improvements with different block sizes...........24 2.2 Execution time and speedup of non-blocking and blocking versions of ma- trix multiplication..............................26 3.1 Example of array compression.......................34 3.2 Performance comparison among the encoding methods..........38 3.3 Framework of the automated compression scheme.............39 3.4 Compiler generated code for runtime selection among multiple versions.41 3.5 Impact of the input order on the performance of DAC..........46 4.1 Tiling with speculative execution without recovery............54 4.2 Jacobi before and after tiling under synchronous model..........56 4.3 Gauss-Seidel kernel.............................57 4.4 Gauss-Seidel data flow............................57 4.5 Jacobi after tiling under asynchronous model...............58 4.6 Fitting of the time model for the original Jacobi and the kernel of a tiled version....................................60 4.7 The optimum chunk size for tiling with recovery and its approximation.61 5.1 ScaLAPACK:Performance on Intel Q6600.................68 5.2 Example of partitioning in Spike algorithm................72 5.3 Spike TA0:Performance on Intel Q6600..................73 5.4 Performance speedup over ScaLAPACK..................76 5.5 Spike NEW:Performance on Intel Q6600.................79 5.6 Code example of a synthetic program...................80 5.7 Prediction model evaluation on synthetic programs............81

x Figure Page 5.8 ScaLAPACK:Performance prediction on Intel E5530...........86 5.9 Spike TA0:Performance prediction on Intel E5530............87 6.1 Instruction overhead of compression....................95 6.2 Performance of compression methods on Intel Q6600...........96 6.3 Performance of compression methods on AMD 8350...........97 6.4 Size of off-chip memory references.....................97 6.5 Slowdown of DDAC without using the benefit model...........98 6.6 Speedup of DAC and SDDAC over CSR-DU................99 6.7 Performance impact of the tile size for Jacobi under the synchronous model.....................................102 6.8 Correlation between chunk size and tile size................102 6.9 Best tile sizes on three machines......................103 6.10 Performance evaluation of Jacobi......................105 6.11 Overshooting overhead—more iterations..................108 6.12 Evaluation of red-black Gauss-Seidel....................109 6.13 Performance evaluation of Gauss-Seidel..................110 6.14 Evaluation of red-black SOR........................111 6.15 Performance evaluation of SOR.......................112 6.16 Performance of adaptive chunk size on AMD 8350............114 6.17 Performance of adaptive chunk size on Intel Q6600............115 6.18 Performance of adaptive chunk size on Intel E5530............116

xi ABSTRACT Liu,Lixia.Ph.D.,Purdue University,December,2010.Effective Performance Analy- sis and Optimizations for Memory Intensive Programs on Multicore.Major Profes- sor:Zhiyuan Li. Memory bandwidth has become the performance bottleneck for memory inten- sive programs on modern processors due to the increasing gap between on-chip CPU speed and off-chip memory bandwidth.The emergence of multicore chips has made the gap worse.Understanding the memory bandwidth requirement of a program and its performance impact is critical for performance tuning,performance prediction, and optimization strategies.This dissertation proposes a new methodology for ana- lyzing the memory bandwidth requirement and the performance of memory intensive programs on multicore machines.The methodology is based on two notions.The notion of zero-latency instruction issue rate quantifies the ideal execution speed of a program,or different parts of a program,under no memory bandwidth constraint. The notion of memory access intensity quantifies the memory bandwidth require- ment of a program or different parts of a program.The memory access intensity can be then used to identify the memory bottlenecks in a program.Under these two basic notions,an analytical prediction model is developed to estimate the program performance.Experiments show that our new methodology accurately identifies the memory bottlenecks and predicts the performance trend for a suite of test programs. The memory bottlenecks,once identified,may be corrected through appropriate program transformations.Unfortunately,certain transformations may potentially in- troduce overheads such as additional instructions and reduced parallelism.In this dissertation,we study such complex interactions between instruction count,paral- lelism,and memory behavior in the context of two program transformations.

xii First,we discuss how to balance the trade-off between instruction count and mem- ory access intensity for effective compression of integer and pointer arrays.Three encoding methods are developed for our compression scheme.We derive formulae to analyze the cost and benefit of each encoding method.We implement our com- pression scheme in a compiler to automatically transform the program into differ- ent versions corresponding to different encoding methods.Results show that our compiler-automated adaptive scheme improves the execution speed over the original version by an average of 9% for a set of benchmark programs.These results take compression overheads into account in order to make adaptive decisions. Next,we discuss the use of carefully controlled asynchronous algorithms to attain data locality and parallelism simultaneously for a class of numerical programs.We show that carefully controlled asynchrony and loop tiling can significantly improve the performance of parallel iterative algorithms on multicore processors due to improved data locality and loop-level parallelism.Evidence from experiments with three well- known numerical kernels is presented.

1 1 INTRODUCTION 1.1 Thesis The gap between CPU speed and memory speed has continued to increase since 1980s [1].To bridge such a widening gap,sophisticated memory hierarchy has been employed in modern computer architectures.Today,such memory hierarchy typi- cally contains several levels of on-chip caches in addition to an off-chip main memory. The data transfer rate between the CPU and the main memory is called the mem- ory bandwidth.Off-chip memory accesses include both read and write accesses to the main memory.The effective memory bandwidth of a program is measured by the number of off-chip memory access rate performed in a unit time during the pro- gram execution.The effective memory bandwidth is bounded by the slowest physical bandwidth at various levels between the CPU and the main memory.This can be the pin bandwidth,the front side bus bandwidth,or the physical bandwidth between memory controller and main memory.Then,it is important to understand how fast the data transfer rate needs to be in order not to slow down the program execution. Such data transfer rate for a program is known as the memory bandwidth requirement of the program. The design of mainstream microprocessors is following a trend to place multiple processing cores on a single chip,giving rise to multicore processors,also known as multicore chips or simply multicore.This trend is driven by the needs to balance performance against power consumption.The number of cores on a single chip is projected to increase as the processor circuits continue to shrink in size [2].While such abundant processing power offers tremendous potential for speeding up large scale computation,memory bandwidth has become a severe limitation on the performance for many realistic applications [1,3–11].This is due to the following facts.First,

2 historical data during 1990s shows that the average annual increase in CPU speed is 55%,whereas the average increase of memory bandwidth is merely 28% [12,13]. Second,the emergence of multicore chips aggravates the gap because the memory bandwidth is shared by all the cores on a single chip.Statistically,each increase in the number of cores per chip comes at the cost of the memory bandwidth provided to each core [14].Third,to mitigate the limitation of memory bandwidth,all modern machines hope to serve most,if not all,data accesses from on-chip caches rather than straining the memory bandwidth.However,machine caches have been ineffective for applications that have large data sets and irregular memory access patterns. Understanding the memory bandwidth requirement and its performance impact is critical to performance tuning,performance prediction,and optimization strategies on multicore chips.The reasons are as follows.First,this helps programmers to determine if memory bandwidth is the performance bottleneck in a given program. This information will drive the optimization efforts on the program.For example, data compression [15] can be used on a program where memory bandwidth is the memory bottleneck.However,if memory bandwidth is not the bottleneck,this op- timization technique will have negative impacts on performance.Second,existing optimizations,such as prefetching and thread parallelization,can yield better per- formance if the optimizations are aware of the bottleneck.The current prefetching optimizations have reached a sophisticated level such that memory latencies can be well hidden when a single core accesses the memory in predictable strides.Unfortu- nately,the effectiveness of prefetching diminishes rapidly when multiple cores access the memory simultaneously,straining the shared memory bandwidth.In fact,ag- gressive data prefetching can increase the memory bandwidth requirement because it causes additional,yet unnecessary,memory transfer due to the inherent inaccuracy of perfetching and inter-core interference [16–18].In the same light,the conven- tional wisdom that memory latency can be hidden by creating numerous concurrent threads no longer holds true as the increased number of threads aggravate the band- width problem.Third,the memory bandwidth requirement analysis provides crucial

3 insights in the application performance on future processors.Armed with a model of the memory bandwidth requirement and hardware architecture,the behavior of a program executing on arbitrary processors can be simulated,even the processors do not exist yet.This provides an early indication of the scalability and the per- formance of the program,which can be used to guide the design of future hardware and software.Fourth,the understanding of all of the above can lead to a new class of optimizations.These optimizations may increase the number of instructions in a program,which will require higher computation resource,but significantly reduce the number of off-chip memory accesses.While this type of optimizations may degrade the performance on uniprocessors,it may significantly improve the performance on multicore chips because the architectures of both systems and their limitations are very different.Furthermore,the optimizations can be enabled dynamically based on the memory bandwidth requirement to avoid the optimization overhead when memory bandwidth is no longer the performance bottleneck. The basic principle in resolving the memory bandwidth issue in a memory in- tensive program is to reduce the memory bandwidth requirement of the program [12,15,19–22].Unfortunately,these optimizations also introduce various kinds of overheads,such as additional operations and branch mispredictions,which require additional computation resources.Moreover,these overheads can offset the benefits obtained from the optimization,and may lead to performance degradation.Conse- quently,balancing the tradeoff between the potential benefits and its overheads is important for an optimization to be effective,e.g.,disable the optimization when there is a significant penalty.Among these optimizations,some methods specifically aim at enhancing the data locality of the program by reordering the data accesses in the program [12,19,20,23].These data locality enhancement methods can sig- nificantly reduce the memory bandwidth requirement,and even remove the memory bottleneck in the program.However,the reordering of data accesses can cause a loss to parallelism due to the permutation of data dependencies.For example,the skewed loop tiling technique causes a limited parallelism in the program due to the data

4 dependencies between tiles [20,24].Thus,the challenge of those methods is how to improve data locality and parallelism simultaneously. The thesis of this dissertation can be stated as follows: As memory bandwidth has become the performance bottleneck for memory intensive programs on multicore,an effective methodology that quantifies the memory bandwidth requirement of a program and its performance impact is critical for performance tun- ing,performance prediction,and optimization strategies.Moreover,(i) the trade-off among optimization benefits and optimization overheads,and (ii) improving both data locality and parallelism,are very important considerations in designing an effective optimization technique for memory intensive programs. The rest of this chapter presents the background,the contributions,and the or- ganization of this dissertation. 1.2 Background on Multicore Processors This section describes the essential background information about multicore.First, the memory hierarchy and the memory bandwidth are illustrated using two different multicore chips.Second,data prefetching techniques employed on multicore chips are presented.Third,a discussion between memory bandwidth and memory latency is presented. 1.2.1 Memory Hierarchy and Memory Bandwidth Figure 1.1 shows the memory hierarchies of two multicore chips from major pro- cessor vendors,the Intel Quad-core processor Q6600 [25] and the AMD Quad-core processor 8350 [26].The main memory in each memory hierarchy is not shown in the figure because it is located outside the chip.Besides,the memory controller of the Intel Quad-core processor Q6600 is also located outside the chip;thus,it is also not shown.As the figure shows,the Intel Quad-core processor Q6600 consists of four 2.4 GHz cores,and each core has a private 32 KB L1 cache.Each pair of cores

5

(a) Intel Quad-core processor Q6600

(b) AMD Quad-core processor 8350 Figure 1.1.Memory hierarchy on multicore chips. shares a 4 MB L2 caches.The L2 caches are then connected to a 1066 MHz,64 bit wide front side bus (FSB).As a result,the FSB is shared by all four cores.The memory bandwidth of the Intel Quad-core processor Q6600 is constraint by both the FSB bandwidth and the bandwidth between the memory controller and the main memory.The FSB bandwidth is 8.5 GB per second,whereas the other bandwidth is 12.8 GB per second.Thus,the memory bandwidth of the multicore is determined as 8.5 GB per second.Meanwhile,the AMD Quad-core processor 8350 has four 2 GHz cores that have private L1 and L2 caches but share a 2 MB L3 cache.One major difference from the Intel Quad-core processor Q6600 is that the memory bandwidth of this AMD multicore is mainly determined by the bandwidth between the memory controller and the main memory,which is 10.7 GB per second. Each core in both two multicore processors above performs out-of-order execution and completes up to four full instructions per cycle.The main objective of out-of- order execution is to hide memory and functional unit latencies by allowing more instructions to be issued in one cycle.However,this technique increases the memory bandwidth pressure due to more simultaneous memory accesses. Different from the Intel Quad-core processor Q6600,recent Intel multicore chips based on the Nehalem architecture are equipped with on-chip memory controller integrated,and the memory bandwidths of those multicore chips are determined by the same factors as the AMD multicore [27,28].In other words,the bandwidth constraint on the front side bus has been removed in the latest Intel multicore chips.

6 In addition,the memory hierarchy of those Nehalem-based Intel multicore chips is very similar to that of the above AMD multicore.For example,the Intel Quad-Core Xeon E5530 processor has three levels of caches including private L1 and L2 caches and a relatively large shared 8 MB L3 cache [27].A bigger capacity of the shared cache may mitigate the memory bandwidth problem by reducing capacity misses and interference among cores.However,the bandwidth problem cannot be avoided, especially for the programs with large data sets. 1.2.2 Data Prefetching Data prefetching is a widely used technique in both hardware and software to tolerate or to hide memory latency [16,17,29,30].This technique can identify the data locations of cache misses and fetch the data from memory earlier during the instruction execution such that the memory latencies are overlapped with other use- ful computations.In current generation of multicore chips,sophisticated hardware prefetching techniques are employed in each core.Therefore,memory latency is well hidden when accessing the memory in predictable strides,given that the memory bandwidth is not saturated.As an illustration,consider the Intel Quad-core proces- sor Q6600,which has four hardware prefetchers in each core.Among these prefetch- ers,two prefetchers fetch data from the main memory to the corresponding shared L2 cache,and another two prefetchers fetch data from the L2 cache to the private L1 cache.The details of those prefetchers are presented as follows.One prefetcher may fetch an adjacent 64-byte cache line from the main memory into the L2 cache, and one may bring multiple data steams with detectable strides into the L2 cache.In this architecture,sixteen forward streams and four backward streams can be fetched simultaneously,depending on the memory reference patterns.These two prefetchers are triggered when successive cache misses in the L2 cache occur.Among the other two prefetchers that fetch data into the L1 cache,one is a streaming prefetcher that fetches multiple data streams with detectable strides.The other prefetcher is intended

7 to find striding patterns associated with instruction pointers [31].The accuracy of these hardware prefetchers are continually improved in the newer generations of Intel multicore chips [28]. Nonetheless,data prefetching does not reduce any memory accesses of a pro- gram.On the contrary,it can incur additional off-chip memory accesses because the prefetchers may fetch incorrect data or fetch data into a cache prematurely;thus,the data is flushed out of the cache before it is needed.Additionally,aggressive prefetch- ers of each core on a multicore chip can cause severe interference in the shared caches due to escalating contention on cache capacity [17,18].This interference leads to extra memory accesses and hence increases the memory bandwidth requirement of a program. 1.2.3 Memory Bandwidth versus Memory Latency Although memory latency is considered as the most prominent performance bot- tleneck on uniprocessors for years,memory bandwidth has become a more important performance bottleneck on multicore processors for the following reasons. • In the past decades,to resolve the memory latency problem,many techniques have been proposed in both hardware and software to reduce or to hide memory latency.However,many of these techniques do not reduce,or even increase memory bandwidth requirement in two ways.First,many of these techniques, such as data prefetching,can increase the total traffic between CPU and main memory.Second,techniques that reduce memory latency indeed increase the consumption rate of operations in a program because the execution time of the program is reduced.Examples include out-of-order execution and wide-issue features on the multicore chips. • The increasing number of cores on a chip significantly aggravates the memory bandwidth requirement but does not increase memory latency.In particular,

8 the memory bandwidth requirement can increase proportionally to the number of cores when all the cores access memory at the same rate simultaneously. 1.3 Contributions This dissertation proposes a new methodology for analyzing the performance of memory intensive programs,and two techniques to optimize those programs.The methodology obtains the memory bandwidth requirement and predicts the perfor- mance speedup of a program with a three-step approach.The first optimization technique is an adaptive compression scheme to reduce the memory bandwidth re- quirements of these programs,in which the compression benefit and overheads are taken into account.The second technique,referred to as asynchronous loop tiling,is to apply asynchronous model together with loop tiling for attaining both the data locality and parallelism simultaneously. The main contributions for the methodology of performance analysis are as follow. • We introduce the notion of zero-latency instruction issue rate and demonstrate its usage to estimate the execution speeds of the programs,under no memory bandwidth constraint. • We introduce the notion of memory access intensity and demonstrate its ap- plication to quantify memory bottlenecks in the programs executed on today’s multicore microprocessors. • We formulate an analytical model to predict the program speedup based on the zero-latency instruction issue rate and the memory access intensity. • We present the analysis on three numerical solvers for large sparse linear sys- tems using memory access intensity and show the critical factors that affect the performance of parallel code on a multicore machine. • We demonstrate the accuracy of our prediction model with synthetic programs and two numerical solvers,on a multicore machine.We also show the effective-

9 ness of our model on future multicore chips by projecting the speedups of two numerical solvers on a new multicore chip. The main contributions for the adaptive compression scheme are as follows. • We develop three encoding methods and show,by experiments,that these meth- ods compare favorably with previous methods that are manually applied.We also show the transformations of these methods are simple to be automated. • We formulate the benefit and cost of each encoding method based on parameters available at runtime.The decisions on whether to compress or not,and which encoding method to use are then made at runtime. • We implement an automated compression optimization in compiler to exploit the redundancy of integer and pointer arrays adaptively.The compiler can automatically identify the compression candidates and transform the original program to use different encoding methods. • We apply our scheme to a large number of test cases and demonstrate that our compiler-automated adaptive scheme can avoid the performance penalties suffered by previous methods in the cases where the compression overheads dom- inate.On the other hand,in the cases where the compression benefit dominates, the automated scheme delivers performance that is close to,or better than,the results previously obtained,which have applied the compression manually. The major contributions for the asynchronous loop tiling technique are as follows. • We present an asynchronous model to enable effective loop tiling such that both parallelism and locality can be attained simultaneously.We also present the effectiveness of using the asynchronous model to relax data and control dependencies. • We show that the ways of partitioning the iterations in tiled programs into chunks significantly impact the performance of these programs.In addition,

10 we show that the optimum chunk size can be derived when the total iteration count is known. • We present an adaptive algorithmto dynamically determine the chunk size when the total iteration count is unknown in advance. • We evaluate three well-known numerical kernels on three different multicore systems and collect extensive statistics to examine various factors that may have an impact on the performance.We show that the asynchronous loop tiling has a clear performance advantage over other tiling methods under the synchronous model. 1.4 Organization This dissertation is organized as follows.In Chapter 2,a new methodology of analyzing memory intensive programs is presented.Then,a compression technique to reduce memory bandwidth requirement is introduced in Chapter 3.The asynchronous model together with loop tiling for attaining data locality and parallelismis described in Chapter 4.Next,the evaluation of the methodology is presented in Chapter 5 and the experimental results of the two optimizations are presented in Chapter 6. Chapter 7 discusses the related works.Last,Chapter 8 concludes and discusses future work.

11 2 PERFORMANCE ANALYSIS METHODOLOGY This chapter presents a new three-step methodology for quantitatively analyzing the performance impact of memory bandwidth on multicore.First,in Section 2.1,the three-step methodology is presented.Second,the tools being used for measurements are described in Section 2.2.Last,illustrating examples are presented in Section 2.3. 2.1 A Three-step Methodology The execution time of a memory intensive program is affected by various factors such as instruction count,data dependencies,function units,memory latency,and memory bandwidth.As we are focusing on the memory bandwidth problem,analyz- ing all factors together would not provide valuable insights to the impact of memory bandwidth.Instead,we propose a three-step methodology that isolates the memory bandwidth factor from the others in our analysis.This three-step methodology is applied on each of the user-defined program segments.This allows users to define an appropriate granularity of the analysis such as function level or loop level.As an example,a simple strategy for determining the granularity is to find the largest scope of the code which exhibits similar memory bandwidth requirement.The following shows the goal of each step in the proposed methodology.The details of each step will be described in the following subsections. 1.The first step is to estimate the execution speed of a programsegment,under no memory bandwidth constraint.The notion zero-latency instruction issue rate is introduced for this estimation,defined as the average number of instructions issued per cycle,assuming all the instruction operands are always available on- chip.Such estimation is based on an assumption that,when memory bandwidth is sufficient,sophisticated data prefetching techniques can fetch data into caches

12 prior to data uses such that memory latency does not affect the execution speed of the program significantly. 2.The second step is to quantify the memory bandwidth requirement of a program segment.The notion memory access intensity is introduced for this quantifica- tion,defined as the average size of off-chip memory accesses issued per cycle, assuming there is no memory bandwidth constraint.Then,the memory access intensity is compared with the available memory bandwidth on a target ma- chine to decide whether memory bandwidth is the performance bottleneck of this program segment. 3.The third step is to predict the performance and scalability of a program seg- ment with a given number of cores.An analytical performance prediction model based on the two notions above is proposed. 2.1.1 Step 1—Zero-latency Instruction Issue Rate Zero-latency instruction issue rate is the average number of instructions issued per cycle (IPC),assuming all the operands are always available on-chip.It is computed to estimate the execution speed of a program segment,under no memory bandwidth constraint,by taking both instruction-level and thread-level parallelism into consid- eration.In a typical programexecution,the zero-latency instruction issue rate is sub- stantially lower than the nominal instruction issue rate of a multicore chip because there exist many factors,such as instruction dependencies,resource conflicts,and branch misprediction penalties,that stall the instruction pipelines and hence reduce the issue rate.For example,each core in the Intel Quad-core processor Q6600 [25] can complete up to four full instructions per cycle.Subsequently,this multicore has a nominal instruction issue rate of 4n,where n denotes the number of cores executing the given program segment.However,our experiments,in Section 5.1,show that the issue rates of most parallelized applications only achieve less than 2n,because the actual issue rates on a single core are less than two.

Full document contains 158 pages
Abstract: Memory bandwidth has become the performance bottleneck for memory intensive programs on modern processors due to the increasing gap between on-chip CPU speed and off-chip memory bandwidth. The emergence of multicore chips has made the gap worse. Understanding the memory bandwidth requirement of a program and its performance impact is critical for performance tuning, performance prediction, and optimization strategies. This dissertation proposes a new methodology for analyzing the memory bandwidth requirement and the performance of memory intensive programs on multicore machines. The methodology is based on two notions. The notion of zero-latency instruction issue rate quantifies the ideal execution speed of a program, or different parts of a program, under no memory bandwidth constraint. The notion of memory access intensity quantifies the memory bandwidth requirement of a program or different parts of a program. The memory access intensity can be then used to identify the memory bottlenecks in a program. Under these two basic notions, an analytical prediction model is developed to estimate the program performance. Experiments show that our new methodology accurately identifies the memory bottlenecks and predicts the performance trend for a suite of test programs. The memory bottlenecks, once identified, may be corrected through appropriate program transformations. Unfortunately, certain transformations may potentially introduce overheads such as additional instructions and reduced parallelism. In this dissertation, we study such complex interactions between instruction count, parallelism, and memory behavior in the context of two program transformations. First, we discuss how to balance the trade-off between instruction count and memory access intensity for effective compression of integer and pointer arrays. Three encoding methods are developed for our compression scheme. We derive formulae to analyze the cost and benefit of each encoding method. We implement our compression scheme in a compiler to automatically transform the program into different versions corresponding to different encoding methods. Results show that our compiler-automated adaptive scheme improves the execution speed over the original version by an average of 9% for a set of benchmark programs. These results take compression overheads into account in order to make adaptive decisions. Next, we discuss the use of carefully controlled asynchronous algorithms to attain data locality and parallelism simultaneously for a class of numerical programs. We show that carefully controlled asynchrony and loop tiling can significantly improve the performance of parallel iterative algorithms on multicore processors due to improved data locality and loop-level parallelism. Evidence from experiments with three well-known numerical kernels is presented.