# Advanced algorithms for VLSI: Statistical circuit optimization and cyclic circuit analysis

Table of Contents Acknowledgments ii List of Tables vii List of Figures viii Chapter 1. Introduction 1 Chapter 2. Statistical Optimizations of Digital Circuits 8 2.1 Introduction 8 2.2 Literature Survey 9 2.2.1 Statistical Yield Optimization 10 2.2.2 Gate Sizing 11 2.2.3 SSTA: Statistical Static Timing Analysis 14 2.3 SSTA Overview 16 in

2.3.1 Introduction to SSTA 16 2.3.2 Challenges and Assumptions in SSTA 19 2.4 SSTA-based Circuit Optimization: Problem Overview 21 2.4.1 Problem Formulation 21 2.4.2 Overview of Research 26 2.5 Statistical Gate Sizing 28 2.5.1 Overview of Algorithm 28 2.5.2 FULLSSTA : Full Statistical Static Timing Analysis . . 29 2.5.3 FASSTA : Fast Statistical Static Timing Analysis .... 34 2.5.3.1 Statistical Critical Path Identification 39 2.5.3.2 Subcircuit extraction and ranking 43 2.5.4 Experimental results 44 2.5.5 Concluding Remarks 49 2.5.6 Benefits of Research 49 2.6 Summary 50 Chapter 3. An Efficient Algorithm for Analysis of Cyclic Circuits 52 3.1 Introduction 52 iv

3.2 Notation and Definitions 55 3.3 Literature Survey 57 3.3.1 Origins of Cyclic Circuits 57 3.3.2 Analysis of Cyclic Circuits 58 3.3.3 Synthesis of Cyclic Circuits 60 3.3.4 Most Recent Publications on Analysis of Cyclic Circuits 61 3.4 Types of Cycles 62 3.5 Our Circuit Model 63 3.6 Combinational Circuits 65 3.7 Finding a Combinational Cover for a Cyclic Circuit 68 3.7.1 Theoretical Background 68 3.7.2 Searching for combinational behavior 71 3.7.3 Merging partial assignments 73 3.7.4 Another Example 79 3.8 Experimental Results 83 3.9 Benefits of Proposed Research 83 3.10 Conclusions 83 v

3.11 Summary 86 Chapter 4. Conclusions 90 4.1 Statistical Optimization of Digital Circuits 90 4.2 An Efficient Algorithm for Analysis of Cyclic Circuits 92 Bibliography 94 VI

List of Tables 2.1 Experimental Results: A = 3 45 2.2 Experimental Results: A = 9, runtime is in minutes 46 3.1 Comparison with Edwards [22] 84 vn

List of Figures 1 Mapping from circuit to timing graph for timing analysis. ... 17 2 Examples of cumulative and probability distribution functions for a circuit's timing 19 3 Example of circuit output Delay PDFs 25 4 Example of circuit output Delay CDFs 26 5 Overview of Statistical Sizer Algorithm 30 6 Extracting Subcircuit Cost for Statistical Sizer 31 7 Probabilistic event representing delay at a given edge in an SSTA timing graph 31 8 Shift with scaling and grouping techniques to perform convolu tion of input and gate-delay PDFs to compute the output-delay PDF 32 Vl l l

2.9 Tracing worst negative statistical slack (WNSS) path. Num bers in parenthesis are (//, a) of arrival time. The shaded nodes indicate the WNSS using our method 40 2.10 Normalized Mean-Std Variation for C432 at different A. The x-axis shows the mean while the y-axis has the standard variation. 48 3.1 A trivial cyclic circuit and its truth table 53 3.2 Cyclic circuit for illustrating definitions 56 3.3 Rivest's Circuit 58 3.4 Cyclic circuit arising from resource sharing due to Stok [56] . 59 3.5 Examples of true and false cycles 63 3.6 The three-valued simulation algorithm, which takes a circuit (G, I, W), an input function x, and an infinite schedule of gates s. It evaluates gates until it reaches a fixed point using EVAL, which updates a single (NAND) gate 66 3.7 (a) A cyclic circuit, (b) Partial assignments and their induced frontiers—the boundary between defined and X-valued gates after applying inputs 67 IX

3.8 Our algorithm for finding a minimal set of PAs for a circuit (SCC) that together cover all its combinational behavior. ... 72 3.9 Illustration of merging PAs at a gate 77 3.10 Our PA merging algorithm: return a set of PAs that apply non- controlling values to every input of a gate 80 3.11 Small cyclic circuit for illustrating partial assignment extraction 81 3.12 PAs from applying controlling values to each input in isolation. All frontiers are either gate V or gate Z 88 3.13 Partial assignment extraction on a small cyclic circuit (a) POS and final ISOP for frontier gate V. (b) POS and ISOP for Z. (c) A minimal set of partial assignments that reproduce all combi national behavior. 89 x

Chapter 1 Introduction Advances in VLSI technology continue to present both challenging and exciting opportunities for advanced research in electrical and computer engi neering. Moore's law has continued to motivate designers to keep fulfilling its prediction by continually shrinking down device geometries and packing more devices per square micron while meeting numerous challenges brought on by the most recent technologies. These challenges include several param eters such as power density and dissipation, supply voltage droop, and relia bility under wide operating conditions. However, the most pressing difficulty facing designers today is the decreasing correlation between physical verifi cation (PV) models used in pre-silicon design and optimization and behavior of manufactured circuits on silicon. Manufacturing variations and its adverse effect on predictability on silicon behavior have spurred designers to seek new tools and methodologies to deal with these variations in order to better predict 1

performance of circuits during design cycle. This dissertation focuses on two emerging fields in VLSI. The first is use of statistical formulations to tackle one of the classical problems in VLSI design and analysis domains, namely gate sizing. The second is on analysis of non-traditional digital systems in the form of cyclic combinational circuits. Neither field is really new, early publications in both topics can be traced back to the 60 's and 70 's as our literature survey will show. However, both fields have received renewed interest recently, with statistical approaches in particu lar featured prominently at all major CAD conferences nowadays and getting increased coverage in journals. Usage of statistical approaches has been well-known in parametric yield analysis for post-manufacturing die sorting and analysis, but had not made its foray yet into pre-silicon analysis or optimization areas. It began to receive increased focus around the turn of the 21st century when Physical Verification (PV) models of pre-silicon behavior started to diverge substantially from ac tual silicon measurements. The field has exploded in the past 5 years, with almost every analysis or optimization problem in VLSI revisited from a sta tistical perspective. It remains to be seen whether this is merely an academic curiosity or whether statistical analysis and optimization techniques will nar row the widening gap between pre-silicon models and post-silicon behavior. 2

As of this writing, IBM is the only company which has publicly claimed to deploy statistical static timing analysis in pre-silicon PV models of industrial circuits [32,33]. It is not clear to what extent does IBM use this methodology, and whether it merely augmented or completely replaced standard static tim ing tools as the golden timing verification model. At the same time, the field cannot be neglected. At least for the time being, it provides a rich field for re search, though competition is stiff with numerous researchers both in academia and industry attacking a variety of CAD problems using statistical techniques very aggressively. Cyclic circuits appear to have been a black sheep of digital circuits. Convincing examples of cyclic circuits that had provably less gates than any acyclic equivalents have been around for decades. Nevertheless, cyclic circuits have not received much attention in industry as candidates for deployment in real-life ASICs or custom designs. While circuits that have registers (flops or latches) that depend on current state for future state and output calculation are commonplace in state machine design, purely combinational cyclic circuits are not intuitive to reason about. Despite this, the lure of area savings and po tential for other advantages has continued to spur researchers to study these circuits. More recently, a synthesis engine was proposed that produces cyclic implementations at an area saving compared to traditional synthesis. The re- 3

search was well-accepted, receiving best paper award at Design Automation Conference in 2003 [49]. An alternate motivation for tackling cyclic circuits arises during pro cessing of high level hardware modeling languages such as ESTEREL [8]. Some of the literature on cyclic circuit analysis was contributed by researchers who were trying to grapple with ESTEREL and other synchronous program ming languages such as LUSTRE [25] and ARGOS [36]. Synthesizing these languages into digital circuits often yields loops that are difficult for regular CAD tools to handle. Ability to handle cycles became imperative for compi lation of these languages, forcing researchers to find ways to take out these cycles as a post-processing step before handing off these circuits to other tools. This dissertation tackles both areas separately. The first part focuses on usage of statistical analysis in a digital circuit optimization setting. The main contribution is an adaptation of well-known gate sizing techniques to use a statistical timing model toward reducing the performance variation of a circuit at design time. The second part of this dissertation investigates more efficient methods for analysis of cyclic circuits. We note that the two topics are distinct with no overlap in our context. At a high level, there are similarities in the two research areas we pro- 4

pose here. Both problems involve netlist level analysis and optimization at the gate-level granularity. Gate sizing is NP-complete while cyclic circuit anal ysis is considered to be co-NP-complete. There are also distinct differences between the two areas. Gate sizing belongs to the class of electronic design automation problems, while cyclic circuit analysis is an enumeration problem as we will show. There are advantages to tackling such disparate problems as part of a single PhD research. The field of IC design is increasingly becoming more vertical, with design, analysis and verification becoming strongly coupled with an expectation that a designer can move between these area with ease. Another rationale here was that a practitioner in the field of VLSI design would do well to understand in depth both analysis and design optimization domains of CAD techniques from an algorithmic and practical perspective as they present uniquely different challenges. Our characterization below of the differences is rather subjective but reflects the author's combined industrial and academic experience with these fields. Automated circuit optimization techniques are much more heuristics based, with many decisions and tunings that can work for one design but not other designs. In addition, there is a possibility of oscillations or other unex pected problems where the algorithm seems to go astray. Given that most of the 5

problems EDA tools attempt to solve are NP-complete, a thorough understand ing of the challenges of overcoming local optima and explaining otherwise odd outputs is a daily struggle for any engineer attempting to use and steer EDA design tools. This has been a perennial component of this author's job at In tel as an automation design engineer covering automated synthesis, placement, sizing, and routing tools. The research that was done in this field has helped the author tremendously with understanding the difficult tradeoffs these tools are juggling especially as the given timing, area, and power constraints that are usually impossible to meet at once. The chief challenge with analysis techniques in CAD is reducing run time while keeping peak memory usage within reasonable bounds. Analysis algorithms have different requirements than optimization algorithms. A design optimization algorithm might be successful even if terminates due to exceed ing acceptable runtime or runs out of memory and stops earlier than it would otherwise having achieved a satisfactory result. On the other hand, an anal ysis algorithm must complete its execution; a partial result is of no value in practice. This makes data representation and programming methodology used critical to a successful implementation. Usage of existing technologies such as SAT, BDD manipulation, and ATPG techniques should be considered as much as possible by mapping the given problem into one of these formulations. This 6

enables designers to reuse efficient solvers that are publicly available for each of these formulations and improving the state of the art by focusing on the unique problem at hand. The rest of this dissertation is structured as follows. Chapter 2 starts with an overview of statistical analysis and optimization of digital circuits. It presents an extensive literature survey covering the topic and gives an introduc tion of the problem we addressed and motivation for it. It presents our proposed algorithm for statistical gate sizing and provides experimental results and de tailed analysis of the algorithm's performance on tested circuits. Chapter 3 presents a literature survey on cyclic circuits and presents motivation for the problem we tackle. We provide theoretical underpinnings for our circuit model and present our original algorithm for cyclic circuit analysis as well in-depth step-by-step review of how it works in practice with aid of examples. Finally chapter 4 gives concluding remarks about our contributions and directions for future research. 7

Chapter 2 Statistical Optimizations of Digital Circuits 2.1 Introduction Recent advances in VLSI have continued to shrink device geometries at a steady rate in accordance with Moore's Law. However, this advancement has also been accompanied by increasing variations in the performance of fabri cated circuits. Numerous factors have contributed to this trend including clock PLL jitter, noise, PV model inaccuracies, and manufacturing variations. Nev ertheless, it is often desirable to manufacture ASICs on advanced technology nodes due to substantial increase in available device count, reduction in power consumption, higher yields and lower costs due to the larger 300mm wafers. Researchers have recently focused on statistical analysis approaches in an attempt to grapple with these sources of performance variations. Statistical static timing analysis (SSTA) is a modification of static timing analysis (STA) 8

for determining delay across a circuit. SSTA models delay arcs across gates as random variables rather than discrete values which are used in regular STA. SSTA propagate timing constraints across a circuit using probability distribu tion functions (pdfs). A virtual sink is often used for all the circuits' outputs producing a single pdf that represents delay across the circuit. When this research was first conceived, a substantial focus had gone into the analysis aspect of this problem [1,28]. However, research into statis tical optimization of circuits had been surprisingly diminutive. Circuit opti mization was done in [29] by using LANCELOT [17] but had severe limitation on circuit size and used unrealistically simple gate delay models. A concept of criticality of gates was used in [27] but did not address the variance of the timing path delays. A transistor level approach was presented in [4]. Several yield-specific techniques were presented in [21]. 2.2 Literature Survey An extensive review of prior work on areas related to this research area was undertaken before research into this area was started. Below is a summary of contributions in this field. It should be noted that research into application of statistical techniques to mainstream EDA problems continues to advance 9

at a very rapid pace, with almost all major CAD conferences dedicating at least one or two sessions to statistical analysis and optimization approaches. For example, the entire 2004 ACM/IEEE TAU Workshop on Timing Issues in the Specification and Synthesis of Digital Systems was dedicated to statistical approaches to timing analysis. Many topics that had previously appeared to mature such as static timing analysis, power analysis, and gate-level design optimization are now being re-examined using statistical formulations. We survey publications in a number of research thrusts below. However, we stress that such a survey is only a sampling of what is rapidly becoming a vast body of literature covering all aspects of electronic design and analysis. 2.2.1 Statistical Yield Optimization A wide variety of methods for yield optimization has been developed over the last few decades. A comprehensive reference that covers an expo sition of representative techniques is [21]. Traditional statistical optimization methods define the yield as the probability of a random variable that represents a performance metric belonging to an acceptability region. This acceptability region can be expressed as a multi-dimensional integral which is typically eval uated by Monte-Carlo based methods or by relying on analytical expressions for the circuit performance parameters of interest. Monte-Carlo techniques are 10

far too expensive to deploy for digital circuit design due to the dimensionality of the statistical space. While Monte-Carlo techniques find many uses in analysis of circuits, they are rarely deployed in a circuit optimization context. Modeling of per formance metrics such delay along with possible variations using analytical expressions is also intractable especially in deep submicron technologies. In light of this, we found that traditional yield optimization techniques while be ing highly useful in parametric yield contexts are not directly usable for the problem at hand. 2.2.2 Gate Sizing Gate sizing has been studied extensively in the literature. Gate sizing is typically performed after technology mapping during logic synthesis and re peated several times during the physical design process. The aim of gate sizing is to assign sizes to all gates in a circuit such that some objective function is satisfied, possibly under some constraints. Typical formulations include min imizing area or power subject to a maximum delay constraint. Various gate delay models have been proposed in the literature such as Load-Independent Delay Model (LIDM) and Load Dependent Delay Model (LDDM). 11

The choice of which gate delay model to use has a direct impact on choice and efficacy of the gate sizing algorithm to be deployed. Since the output load of a gate has a great impact on delay across it, LIDM is of little value in real optimization contexts. Gate sizing has been shown to be NP- complete under LDDM which rules out finding globally optimal solutions for real-life circuits which consist of tens to hundreds of thousands of gates. Research in circuit sizing has been carried out both at the transistor as well as gate level. Transistor level sizing is more accurate but presumes ability to size and therefore adjust layout on a per transistor basis, which is becoming less common due to layout complexity of recent processes. It is also limited to smaller circuits compared to gate-level approaches. Gate sizing relies on standard cell libraries that can come from library vendors which are designed in discrete sizes, laid out, and pre-characterized for timing, area and power. A typical cell characterization produces lookup tables for every input-pin output- pin transition. Timing characterization tables represent input slope and output capacitance as inputs with output slopes and delay through gate as outputs. Gate sizing algorithms can be classified into one of two categories: global approaches and local approaches. Global approaches solve gate sizing in the continuous domain by relying on optimization techniques such as con vex programming with posynomials [23], linear programming [5], sequential 12

quadratic programming [37], or Lagrangian Relaxation [13]. While these ap proaches can claim a globally optimum solution, they have two drawbacks. The presumption of a convex problem where a single global optimum exists is not supported by practical evidence. More importantly, library gates tend to come in pre-determined discrete sizes and solving the problem in the continuous do main requires snapping back size assignments to closest available gate sizes. Since standard cell library gates tend to be sized in a geometric progression of drive strength, this discretization may assign drive strengths significantly different from the values obtained in the continuous domain. Advantages of global approaches include a global solution without oscillations and faster run time compared to local approaches. Local sizing approaches assign gate sizes using local gain-based or greedy heuristics. Examples of this approach are available in [14,19,40]. Most of these algorithms share several common elements. The critical path, some times referred to as the Worst Negative Slack (WNS) path, is usually targeted for optimization. We note that the WNS path can change as the optimization proceeds so the path being evaluated for resizing must be updated at specific in tervals in the optimization iteration. The algorithms can be run in a constrained mode where delay for example is optimized first then area is recovered as far as possible without violating a delay constraint. Other constraints can be similarly 13

satisfied either during optimization by not violating some cost/benefit ratio or in a recovery mode after unconstrained optimization. Coudert [19] argues that accurate delay models make gate sizing a non-linear, non-convex, constrained, discrete optimization problem. Our ex perience corroborates this assertion, especially for deep-submicron technolo gies which are the target domain for this research. Many of the commercial tools for logic and physical synthesis such as Design Compiler® and Physical Compiler® from Synopsys® also use local approaches for gate sizing as these approaches are more accurate despite being slower than global approaches. 2.2.3 SSTA: Statistical Static Timing Analysis The earliest paper that suggested a statistical approach to timing anal ysis known to the author is [41]. The author attempted to determine the dis tribution of delay from source to sink of an acyclic directed graph that had probability distributions associated with its elements. However, the focus on use of statistical approaches in timing analysis is relatively new. Pioneering works in this field appeared in [11,20,30]. While difficulties in deterministic timing analysis such as false path de tection carry into statistical approaches, the latter also introduce their own set 14

of challenges. In particular, deterministic timing analysis relies on two opera tions for propagating timing through a network, sum and max. The summing operation adds arrival times at inputs of gates to delays from those input pins to the output. The max operation decides which of these to propagate for max frequency analysis. Performing these calculations on pdfs is more expensive computationally than their counterparts in the deterministic case. Moreover, the degree of correlation between two pdfs arriving at a gate's inputs due to reconvergent fanouts needs be taken into account for accurate calculations. In the past few years statistical techniques for timing analysis of digi tal circuits have received tremendous focus with representative works includ ing [2,12,34,47]. A recent paper [9] reviews many of the the recent devel opments in SSTA. It discusses its underlying models and assumptions, then surveys the major approaches, and closes by discussing its remaining key chal lenges. It also has a large number of references which constitute a compendium of recent publications on statistical techniques in timing analysis and optimiza tion. 15

2.3 SSTA Overview This section provides an overview of statistical timing analysis based on [9]. For more in-depth treatments, the reader is referred to the recent overview paper [9] and the references that paper cites. 2.3.1 Introduction to SSTA Traditional timing analysis abstracts a timing graph from a combina tional circuit as folios. The nodes of the timing graph represent primary in puts/outputs of the circuit and gate input/ output pins. The edge of the tim ing graph represent the timing elements of the circuit, namely, the gate input- pinoutput-pin delay and wire delay from a driver to a receiver, as shown in Fig ure 2.1. The weight on these edges represents the delay of the corresponding timing element. For a combinational circuit, it is convenient to connect all primary inputs to a virtual source node with virtual edges having weight equal to the input arrival times. Similarly, all the primary outputs are connected to a virtual sink node through virtual edges with weights representing the required arrival times. The resulting timing graph, therefore, has a single source and sink node. 16

Figure 2.1: Mapping from circuit to timing graph for timing analysis. SSTA uses the same fundamental concept but uses random variables (RVs) to model gate delays. The random variables capture the uncertainty introduced by the manufacturing variations which are prevalent in deep submi- cron technologies. A formal definition of statistical timing analysis follows. Definition 1. A timing graph G = N,E,ns,rif is a directed graph having exactly one source node ns and one sink node nf, where N is a set of nodes, and E is a set of edges. The weight associated with an edge corresponds to either the gate delay or the interconnect delay. The timing graph is said to be a statistical timing graph ifith edge weight di is an RV. 17

In traditional DSTA, the most basic goal of the analysis is to find the maximum delay between the source node and the sink node of a timing graph, which is the delay of the longest path in the circuit. When modeling process- induced delay variations, the sample space is the set of all manufactured dies. In this case, the device parameters will have different values across this sample space, hence the critical path and its delay will change from one die to the next. Therefore, the delay of the circuit is also an RV, and the first task of SSTA is to compute the characteristics of this RV. This is performed by computing its probability-distribution function (PDF) or cumulative-distribution function (CDF) (see Figure 2.2). Alternatively, only specific statistical characteristics of the distribution, such as its mean and standard deviation, can be computed. Note that the CDF and the PDF can be derived from one another through differentiation and integration. Given the CDF of circuit delay of a design and the required performance constraint the anticipated yield can be determined from the CDF. Conversely, given the CDF of the circuit delay and the required yield, the maximum frequency at which the set of yielding chips can be oper ated at can be found. 18

kF(t) performance yield constraint JU Figure 2.2: Examples of cumulative and probability distribution functions for a circuit's timing. 2.3.2 Challenges and Assumptions in SSTA This section goes through various underlying assumptions and chal lenges pertaining to usage of SSTA in digital circuits • Gates versus wires: Most literature to date presumes that gates are far more susceptible to variations than interconnects. There is some evi dence for this in what little real silicon data has surfaced. As such, in this work, we will also stay with this assumption, modelling gate delays as random variables and ignoring wire delays as they do not impact our analysis or optimization 19