# The Detection of Reliability Prediction Cues in Manufacturing Data from Statistically Controlled Processes

Contents Page Contents.......................................iv List of Tables....................................vii List of Figures....................................viii CHAPTER.....................................1 1 INTRODUCTION................................1 2 BACKGROUND AND LITERATURE REVIEW...............10 2.1 Statistical Process Control........................10 2.2 Support Vector Machines.........................12 2.3 Divergence Estimation Using MinimumSpanning Trees.........16 2.4 Order Statistics and L-moments......................17 3 HYPERPLANE CLASSIFIERS AND THE SUPPORT VECTOR MACHINE 23 3.1 Binary Classiﬁcation Using Hyperplanes.................23 3.2 The Optimal Hyperplane.........................26 4 CUE DETECTION FOR STATISTICALLY CONTROLLED DATASETS..35 4.1 Data-Dependent Anomalies of SVM’s..................35 Colinearity Resolution..........................35 Vector Element Scaling..........................38 4.2 Statistical Normalization.........................41 5 CASE STUDIES................................42 5.1 Case A:Lot-Dependency of Statistically Controlled Datastream....42 Experiment A1..............................44 Experiment A2..............................46 Experiment A3..............................47 Experiment A4..............................48 Experiment A5 and Experiment A6....................49 iv LIST OF TABLES LIST OF FIGURES TABLE OF CONTENTS Page

Chapter Page Observations and Conclusions......................50 5.2 Case B:Latent Sensor Failure.......................52 Experiment B1..............................54 Experiment B2..............................54 Observations and Conclusions......................55 5.3 Case C:Range Capability.........................57 Experiment C1..............................59 Experiment C2..............................60 Experiment C3..............................61 Observations and Conclusions......................62 6 KERNELS AND THE HYPERPLANE CLASSIFIER.............65 6.1 Basic Deﬁnitions.............................65 Hilbert Space...............................65 Kernel Function..............................66 Reproducing Kernel Hilbert Space (RKHS)...............66 6.2 Properties of Reproducing Kernel Hilbert Spaces............67 6.3 The “Kernel Trick” and the Hyperplane Classiﬁer............68 Non-Linear Transformation and Linear Separability...........68 Deriving Kernels fromData........................71 Application of Non-linear Kernel Techniques to a Statistically Con- trolled Dataset...........................75 7 L-MOMENT KERNELS............................77 7.1 Deﬁnitions and Derivations........................77 7.2 Estimation of L-moments fromSample Data...............79 7.3 Applying L-moment Kernels to Data...................81 7.4 SVM’s and L-moments..........................83 7.5 Application of L-Moment Kernels to Case Studies............86 v CHAPTER

Chapter Page 8 SVMIMPLEMENTATION OF WESTERN ELECTRIC COMPANY RULES 90 8.1 Deﬁnitions.................................90 8.2 Using the Modiﬁed SVMConstruction to Utilize WECO Rules.....90 8.3 Effects of Extending the SVMInput Vectors with WECO Conditions..94 8.4 Case Study D:Homogenous Data Streams................96 Experiment D1..............................99 Experiment D2..............................99 Observations and Conclusions......................102 9 SUMMARY AND DIRECTIONS FOR FURTHER RESEARCH.......105 9.1 Summary.................................105 9.2 Directions for Further Research......................108 Bibliography....................................110 vi CHAPTER BIBLIOGRAPH Y

List of Tables Table Page 5.1 Summary of Results for Case Study A....................52 5.2 Summary of Results for Case Study B....................57 5.3 Summary of Results for Case Study C....................64 6.1 Recap of Results for Case Study A......................76 8.1 HP Estimates for Case Study D........................98 8.2 Classiﬁcation Results for Case Study D...................103 vii LIST OF TABLES

List of Figures Figure Page 4.1 Colinearity Resolution Example.......................37 4.2 Vector Element Scaling Example.......................39 5.1 Training and Weight Vectors for Experiment A1...............45 5.2 Training and Weight Vectors for Experiment A2...............47 5.3 Weight Vector for Experiment A4......................50 5.4 Training and Weight Vectors for Experiment B1...............55 5.5 Training and Weight Vectors for Experiment B2...............56 5.6 Training and Weight Vectors for Experiment C1...............60 5.7 Weight Vector for Experiment C2.......................61 5.8 Training and Weight Vectors for Experiment C3...............63 6.1 X (Input) Domain...............................68 6.2 V (Transform) Domain............................69 7.1 Extended Weight Vector for Experiment B1.................87 7.2 Extended Weight Vector for Experiment A1.................88 7.3 Extended Weight Vector for Experiment C1.................89 8.1 Extended Weight Vector for Experiment D1.................100 8.2 Extended Weight Vector for Experiment D2.................101 viii LIST OF FIGURES

Chapter 1 INTRODUCTION In the testing or quality-control phase of a manufacturing process,data is collected and analyzed in order to ensure that the manufactured products meet some acceptance criteria.This data may include selected process data and product subcomponent data as well as product performance data.Often this data is used not only to grade the product (or service) but also as a means of identifying the control state of the manufacturing process. In statistical process control,the statistics of one or more parameters are used to develop a set of control limits.For example,the average measured value ( X) of a parameter over a deﬁned subgroup (or “lot”) of assemblies might be tracked across an increasing set of such subgroups as a process statistic.Using historical data a sample mean of X ( X) and a sample standard deviation (σ) are determined.In general, X is assumed to be normally distributed. 1 Upper and lower control limits are then typically determined as this sample mean ±3σ.The units which performoutside the control limits are considered deviations fromthe controlled process (outliers) or as indications that the process has gone out of statistical control.In either case,in an SPC system,a process alarm(or signal) is set.Of course,if the distribution of a test parameter is indeed Gaussian,the probability of false alarmis non-zero.The classical “3-sigma” control limits assume a normal distribution of the process parameter or variable.In situations where the process variable is not normally distributed,control limits (or control regions) may be set based on a chosen probability of false alarm(Type I error) 1 Based on the Central Limit Theorem,this assumption is increasingly justiﬁed as the ﬁxed number N of independent (or partially-correlated) elements included in each subgroup or lot is increased.Typically,for ease of implementation,the number N is ﬁxed across lots,but for particular applications N may be variable if associated ad- justments are made for the calculation of the standard deviation of X across multiple subgroups. 1

or risk of false rejection of alarm(Type II error). Even for production units which performwithin the control limits,certain additional limitations (such as the “Western Electric Rules”) may be imposed to identify possible process or measurement abnormalities known as runs[1][2,p.25]. Runs consist of a series of successive readings whose low joint probability of occurrence can be used to signal a process problem.For example,Western Electric “Rule 4” indicates an alarmcondition if ﬁfteen consecutive points (readings) in a row all fall within a one-sigma region on one side of the mean. Products whose performance is consistent with a controlled process,exhibit no abnormalities,and meet product speciﬁcations are deemed good.However,even good product may have latent defects or environmental susceptibilities that may impinge upon the product’s life or reliability.Some of these reliability characteristics may be detectable but unidentiﬁed given the available process data.Other characteristics may have no corresponding cues contained in the implemented process data set.If reliability history is available,this history might be used in concert with historical process data to develop predictive criteria.Since some failures may be induced by environmental events apart frominherent product quality deviations,some means of specifying the conﬁdence of a prediction must be provided.The question “could this failure have been predicted as a function of the process data?” might possibly have the answer “no.” Given a limited population of returns due to failure,it might be possible to develop a predictor function (or machine) on the prior process data that would be consistent with respect to that population.However,if with high probability,the returned population could represent simply an unbiased randomsampling of available ﬁelded units,then the predictor machine may be overﬁt and not generalize well for other test samples.To empirically determine the generalization ability of the predictor, one would check the classiﬁcation accuracy of the initial predictor operating on a representative population of both failed and survived units that were not included as 2

samples in the training or design of the predictor. 2 During this research,a binary classiﬁcation methodology was developed that can be used to design and implement predictors of end-itemﬁeld failure/survival or downstreamproduct test pass/fail performance based on upstreamtest data that may be composed of single-parameter,time-series,or multivariate real-valued data. Additionally,the methodology has proved useful as a forensic tool in failure analysis investigations as it provides indicators of which of several upstreamproduct parameters have the greater inﬂuence on the downstreamfailure outcomes.While the data analysis or design portion of this generalizable methodology requires several input data processing and transformation steps,the implementation form(synthesis) of the prediction machine is relatively simple,only requiring taking the inner product of a derived weight vector with the upstreaminput data for a particular component or end-item,adding a derived offset,and then basing the classiﬁcation decision on the sign of the result.Once designed for a speciﬁc dataset,the prediction machine can enable effective screening out of suspect components or end-items,especially in cases where the methodology has identiﬁed high correlation between one or more parameter elements of the upstreamdata and the downstreamfailure mode.As a interimoutput, the methodology also provides a normalized weight vector whose elements are weighting values that can serve to indicate which of the elements of a parameter input (or time-series) vector is of more key importance to the classiﬁcation decision.This interimweight vector has proved useful as a forensic tool in determining likely contributing factors to low downstreamtest yields or failure modes.In real-world scenarios,the correlation between the downstreamfailure and cues in the upstream 2 Depending on the speciﬁc fault mode and period of performance on which “sur- vival” is deﬁned,the current set of “survived” units may or may not contain potential future failures.For example,if survival is deﬁned as not exhibiting a particular failure mode prior to some ﬁxed number of years,then data (if available) fromnon-overlapping sets of failed units and survived units could be used to test the predictor.In other sce- narios,the possibility of potential future failures among the current survivals should informthe interpretation of this classiﬁcation accuracy testing. 3

manufacturing data may not,if they exist at all,be perfectly correlated to the failure mode under review.If the correlation is only partial,the predictor generated by this methodology also has a non-zero probability of either screening out units that would not fail or allowing units that would fail to escape the screen.Hence,in practice,there tends to be a trade-off between the detection rate (or,the ability to positively identify “bad” units) and the false positive rate (or,the fraction of “good” units falsely rejected).There are cases where the upstreamtest data would not be expected to provide true cues since the downstreamfailure mode is related to a latent defect that evidences no performance change in the affected product until the defect (such as,for example,a broken structural support or ruptured vapor barrier) actually occurs.In such cases,the prediction machine developed under this methodology would be expected to not performwell (as a predictor) on input test data not included in the design analysis even if the training data were to be classiﬁed with little or no error.In such cases,as will be demonstrated in one of the case studies explored in this dissertation,the resultant prediction may have a false positive rate rivaling or even exceeding the detection rate. In this dissertation,we explore the use of downstreamfactory test data or product ﬁeld reliability data to infer data mining or pattern recognition criteria onto manufacturing process or test data by means of support vector machines (SVM’s) in order to provide reliability prediction models.In concert with a risk/beneﬁt analysis, these models can be utilized to drive reliability improvement of the product or,at least, through screening to improve the reliability of the product delivered to the customer. Additionally,such models can be used to aid in reliability risk assessment based on detectable correlations between the product test performance and the sources of supply,test stands,or other factors related to product manufacture. This work provides the following contributions: 4

• Algorithmic details of a modiﬁed SVMclassiﬁer that can be trained on labeled subsets of data froma statistically controlled process along with performance analysis of the classiﬁer on several sets of actual manufacturing test data.The classiﬁer so trained could then be used as a predictor function on the members of the overall dataset with respect to inclusion in the classes represented by the training data. • The use of L-moment vectors and/or L-moment extensions to the input data vectors as means of increasing the discrimination power of the SVMupon the data streams froma statistically controlled process or upon multi-parameter vectors that may have correlation between elements. • Algorithmic details and performance analysis of a modiﬁed SVMclassiﬁer that uses speciﬁc functions of order statistics of input vectors in order to embed discriminant information into the classiﬁer equivalent to that required in the implementation of the classical 3-sigma process limits and Western Electric Rules. The general classiﬁer design methodology involves variations of the following top-level plan: 1.Begin with real-valued data froma statistically controlled process,with all data falling within some deﬁned sigma level (say,3 to 6 standard deviations fromthe process mean) 2.Ensure the data are organized into a set of vectors that each have the same number of elements. 3.On an element-by-element basis statistically normalize the data using the ensemble means and standard deviations calculated over the available dataset or subset of interest. 5

4.Extend or replace the input data vectors with elements representing problem-speciﬁc functions on or transformations of the input data vectors.. 5.Depending on the speciﬁc dataset or problem,statistically normalize the extension or replacement element (recommended if the classiﬁer weight vector is to be later utilized to determine the relative inﬂuence of the input elements). 6.Use a portion of the dataset to train the binary classiﬁer.(This assumes that samples fromboth classes are available.) 7.Review the resultant weight vector to determine which input data or extended input data vector elements are the most signiﬁcant. 8.If desired,reduce the number of vector elements (or,alternately,set those elements to zero in the weight vector) and retrain the classiﬁer. 9.Use the classiﬁcation parameters to implement (synthesize) a classiﬁer or predictor speciﬁc to the dataset. 10.If desired,transformthe classiﬁer parameters so that the classiﬁer can be used directly on the input data in its native form. 11.Test the classiﬁer on new data or a portion of the original input data (statistically normalized,of course) not used in the design (analysis) or training of the predictor. In practice,it may be necessary to iteratively improve the classiﬁer by varying the training set in order to enhance the performance of the classiﬁer over the test set (i.e.a set of data not included in the training itself). As part of this research,a modiﬁed SVMimplementation has been applied to real-world product test data fromseveral statistically controlled processes in an aerospace manufacturing environment.In each of three case studies,SVM’s were 6

trained using measurement and/or error data vectors fromtwo labeled classes.The generalization ability of the resultant SVM’s was explored by using the SVM’s to classify end items using transformed versions of the actual test data.These experiments are detailed in Chapter 5.Each sample vector for these experiments consists of sets of elements representing the values of several different measurement parameters.In Chapter 8,a fourth case study using SVM’s is explored in which the sample vectors consist of set of instantiations of the same measurement parameter. Feature selection continues to be a viable area of research in the SVMﬁeld and is often dependent on the particular dataset and data usage under consideration.Along with completion of the intended contributions outlined above,research objectives accomplished as part of this effort include exploration of the application of the Structural Risk Minimization approach to normalization of feature-vectors,reduction of feature vector length,and effects of using varying numbers of training vectors for particular sets of measurement and measurement error data 3 derived fromaerospace sensor manufacturing processes. Following this introduction,Chapter 2 provides a review of the literature and background material in four areas:statistical process control,support vector machines,divergence estimation using minimumspanning trees,and order statistics (especially L-moments).These provide context for subsequent discussion about the application of support vector machines in the analysis of data fromstatistically controlled processes.Chapter 3 provides a detailed development of the hyperplane classiﬁer and its use in a modiﬁed Support Vector Machine (SVM) which relaxes the requirement to necessarily locate the optimal hyperplane.Chapter 4 examines several data-dependent limitations of this modiﬁed SVMand how these can be mitigated. 3 Due to the potentially conﬁdential nature of the real-world data used for this re- search,afﬁne transformations of the data are performed whenever the need arises to use speciﬁc data in providing application examples. 7

Three case studies using statistically normalized versions of real-world data are used in Chapter 5 to demonstrate the application of the SVMto statistically controlled datasets.Chapter 6 provides a brief exposition of kernel theory and its application to the hyperplane classifer.Chapter 7 introduces L-moment kernels and the application of L-moment kernels in SVM’s.Chapter 8 describes methods of adding discriminant information equivalent to the Western Electric Company (WECO) to the SVMand provides a fourth case study utilizing both L-moments and WECO information in various SVMimplementations.Chapter 9 provides a summary of results and observations along with suggestions for further research. During this research,a binary classiﬁcation methodology was developed that can be used to design and implement predictors of end-itemﬁeld failure/survival or downstreamproduct test pass/fail performance based on upstreamtest data that may be composed of single-parameter,time-series,or multivariate real-valued data. Additionally,the methodology has proved useful a forensic tool in failure analysis investigations as it provides indicators of which of several upstreamproduct parameters have the greater inﬂuence on the downstreamfailure outcomes.While the data analysis or design portion of this generalizable methodology requires several input data processing and transformation steps,the implementation form(synthesis) of the prediction machine is relatively simple,only requiring taking the inner product of a derived weight vector with the upstreaminput data for a particular component or end-item,adding a derived offset,and then basing the classiﬁcation decision on the sign of the result.Once designed for a speciﬁc dataset,the prediction machine can enable effective screening out of suspect components or end-items,especially in cases where the methodology has identiﬁed high correlation between one or more parameter elements of the upstreamdata and the downstreamfailure mode.As a interimoutput, the methodology also provides a normalized weight vector whose relative weights serve to indicate which of the input elements of a parameter input (or time-series) 8

vector is of more key importance to the classiﬁcation decision.This interimweight vector has proved useful as a forensic tool in determining likely contributing factors to low downstreamtest yields or failure modes. 9

Chapter 2 BACKGROUND AND LITERATURE REVIEW 2.1 Statistical Process Control We begin with a brief background study and literature review of the area of statistical process control as applied in the manufacturing arena.Mass production assembly line processes,as introduced by Henry Ford and others in the early 1900’s,required that the form,ﬁt,and function of assemblies (or subassemblies) made by different individuals or machines be identical within allowable tolerances.One means of monitoring the quality (i.e.uniformity) of manufacture is to inspect each individual assembly using some means of measurement to ensure that the product meets predetermined speciﬁcations.Products that fall outside of speciﬁcation limits are rejected or reworked.The fallout or rejection rate may be used as indicators of the need for a process to be improved or corrected.However,this 100%inspection of the outcome of each subprocess may be both unnecessary and uneconomical.In the 1920’s,H.F.Dodge and H.G Romig developed the use of statistical sampling as a means of reducing this inspection burden [3,p.10].Human errors,measurement systemerrors,unobservable defects,and randomprocess variation all pose limitations to the beneﬁt of relying primarily on inspection as a means of quality control. In about 1924,Walter A.Shewhart and others at Western Electric’s Bell Telephone Laboratories began work on the application of statistics to the control of production processes.This work formed the basis of what is now known as Statistical Process Control (SPC).A process is said to be in a state of statistical control with respect to a particular quality variable when the variation of that variable can be approximately described by a ﬁxed probability distribution [4,pp.30-31].This quality variable may be a direct measurement variable,a derived variable (such as the mean or range of a subgroup),or a vector of variables.Shewhart introduced process control 10

charts (“Shewhart Charts”) for utilization in tracking the control state of manufacturing processes [4,p.2][5,p.xiii].If the control charts indicated that a process output exceeded control limits or had changed control states (such as a signiﬁcant shift of the mean),some action might be taken to address the “special” cause of the variation or to stabilize the process.Other variations of process outputs within the control limits are considered to be “common” cause variations resulting fromthe operation of a stable process.The use of the quality control concepts described by Shewhart in his book Economic Control of Quality of Manufactured Product (1931) [6] grew in the U.S.A.until World War II,but declined thereafter. However,during the 1950’s,W.Edwards Deming and J.MJuran successfully promoted the use of statistical process control in Japan.The success of the Total Quality Control (TQC) movement in Japan would later prove an important inﬂuence on the resurgence of interest in the use of statistical process control in the U.S. manufacturing sector. 1 By the mid-1950’s,it had been recognized that the Shewhart-type chart was insensitive to some process abnormalities (small “shifts”) that may occur with no points falling outside of the process control limits [7].As a result,in 1956,the Western Electric Company introduced ﬁve rules (known as the “Western Electric Rules”) for guidance in determining alarmconditions fromthe classical control chart. The ﬁrst of these rules is simply a restatement of the rule already used with the Shewhart chart,namely,to signal an alarmwhen a point lies beyond 3-sigma of the mean of the estimation parameter.The remaining rules indicate alarmconditions for some unlikely (i.e.low-probability) runs of successive points within the control limits. Other enhancements to the control chart have been developed to (1) detect smaller 1 However,as noted in both [4,p.6] and [5,pp.xix-xxi],there have been historical differences between Japan and the U.S.A.in emphases and philosophical approaches to the use of statistical methods with respect to quality control.A good discussion of these issues can be found in [5,pp.1-6]. 11

process changes than the Shewhart chart,(2) account for autocorrelated data,and (3) provide for multivariate detection of changes.Among these are the cumulative sum (CUSUM) charts [8,p.12][3,pp.127-135] and the Exponentially Weighted Moving Average (EWMA) charts [3,pp.135-136].Multivariate control charts enable consideration of interactive effects among multiple process variables in establishing control limits or rules [8,p.12].In the 1940’s,Hotelling developed the T-squared (T 2 ) control charts for detection of shifts in a multivariate process [3,p.22].Principal component analysis (PCA) has been applied as a means of transforming correlated variables into a set of uncorrelated variables upon which the traditional univariate control chart methods can then be applied [9,p.147]. 2.2 Support Vector Machines The systematic study of the problemof inferring statistical relations in data began in about the 1920’s as extensions of the work of Fisher [10,p.2] for parametric approaches (i.e.parameter estimation based on maximumlikelihood) and of the work of Glivenko,Cantelli,and Kolmogorov for general or non-parametric (inductive) methods [10,pp.2-3].The development and utilization of parametric methods proceeded rapidly through the 1930’s and into the 1960’s.It was not until the expanded availability of computers to researchers in the 1950’s and 60’s (which enabled extensive analysis of inference models on “real-life” datasets) that some practical shortcomings of classical parametric statistical methods were formally revealed to the statistical research community [10,pp.2-7][11,pp.ix - x].The classical methods,as framed at that time,demonstrated limited utility in cases where the real-world datasets 1.were multivariate (the so-called “curse of dimensionality”), 2.had densities that could not be approximated by classical closed-form, parametric density functions, 12

3.were weighted sums of two or more normal distributions,or 4.had low cardinality (i.e.small sample sizes). Research into the extension of classical methods to address these issues did continue, but awareness of the aforementioned issues served to motivate parallel research into methods that could be used to infer or “learn” patterns (relations or structure) directly fromthe data (i.e.inductively) rather than predetermining a parametric structure and using the data to determine the best ﬁt or discriminator using maximumlikelihood or expectation maximization.Frank Rosenblatt is credited with the introduction of the ﬁrst supervised learning machine 2 ,the Perceptron [12,pp.62-68][13,pp.11-19].The Perceptron essentially extends the McCulloch-Pitts neuron model (introduced in 1943 by Warren McCulloch and Walter Pitts) [12,pp.62-63] by feeding back the comparison of the present neuron output with the correct output as a means of adjusting the values of the neuron’s internal weighting factors that operate upon the input data.Constructed to address a two-class pattern recognition problem,the Perceptron was demonstrated to be able to determine a hyperplane that correctly segmented the training data into two classes if the training data are linearly separable. Under the assumption that the training data are a representative sampling of the two ﬁxed-distribution classes,the ability of this hyperplane classiﬁer to generalize (i.e.to correctly classify subsequent test data) is related to the margin of separation between the two classes of training data. While the study of learning machines based on neural networks progressed, learning machines (including general adaptive ﬁlters) not necessarily based on neurobiological models also demonstrated the ability to learn patterns or generalize based on training data.A common general principle uniting various learning 2 While,as Vapnik points out [11,p.1],Fisher had considered the separation of two sets of vectors using their set probability distributions,Fisher had not used data or “examples” directly to infer the classiﬁcation relation of the two sets of vectors. 13

approaches is the strategy of empirical risk minimization (ERM) [10,p.7].In this strategy one chooses,froma given set of decision rules or functions,the function that minimizes the risk of training error (empirical risk).In the 1960’s,this induction principle fromthe statistical sciences was applied to the pattern recognition problem using indicator functions (i.e.functions whose range is the discrete set { 0,1 } ).By the end of the 1970’s,ERMtheory was expanded to include real-valued functions in solution of regression and density estimation problems [10,p.8] For any set of indicator functions with ﬁnite VC dimension 3 ,the ERMinduction process is a consistent method —that is,it converges in probability to a solution with minimum expected risk among the candidate functions as the number of training samples (or observations) increases 4 .However,if the set of functions is chosen such that for any possible ﬁnite set of training vectors and classiﬁcations assignments,training will be error-free,then generalization may not be possible due to overﬁtting.Stated another way,there exists an inherent trade-off between the classiﬁcation power or capacity of a learning machine (family of functions) and its ability to generalize fromthe training data to new test samples. Capacity (or VC dimension) control is a key feature of the statistical learning theory fromwhich support vector machines were eventually developed.Based on bounds for the non-asymptotic (i.e.limited sample set) rate of convergence of the ERMlearning principle and related bounds on the probability of test error of a learning machine,an induction approach known as “Structural Risk Minimization” (SRM) was developed [10,pp.55-57][10,p.10].Given a nested structure of admissible 3 The VC(Vapnik-Chervonenkis) dimension for this case is deﬁned by Vapnik as the greatest number h of data vectors that can be “separated into two different classes in all 2 h possible ways using this set of functions (i.e.the VC dimension is the maximum number of vectors that can be shattered by the set of functions).” [10,p.147] Thus VC dimension is a measure of the binary classiﬁcation capacity of a the learning machine (set of functions). 4 See proofs in [10,pp.121-137]. 14

machines 5 and a predetermined conﬁdence interval,the SRMinduction principle recommends selection of the machine for which minimizing the training error (empirical or sample-based risk) yields the lowest bound on the probability of test error (actual or global risk) [11,pp.93-96].This bound is related to the VC dimension, the number of training errors,and the number of training samples [14,pp.123-124].It should be noted that the error convergence “bounds” on the learning machines that we have been discussing are not absolute bounds in the sense that,as the number of training samples increases,the generalization or expected test error cannot exceed some given δ >0,but rather that with probability 1−η,where 0<η <1,the learning error will not exceed that δ.For this reason,this statistical approach or learning model is generally known in the computer science community as the “Probably Approximately Correct” (or pac) model [13,pp.52-54]. Application of SRMto high dimensional linear learning problems proved to have accuracy and generalization results that rivaled those of neural networks including multilayer perceptrons.In combination with the use of kernels which can be used to map non-linear inputs into linear feature spaces,learning algorithms using the learning bias suggested by the SRMapproach and well known Lagrange multiplier optimisation and dual theory lead to the development in the early 1990’s of what is now known as Support Vector Machines (SVM’s) [13,p.7].Support vector machines use the Grammatrix relationships between functions of the training input vectors to train a learning machine that,in many cases,turns out ﬁnally to be a function of only a subset of the input vectors.These vectors are therefore called the support vectors since the training machine that is to be used to test new inputs is independent of the other (non-supporting) input vectors. 5 A machine (set of indexed functions,Q a (x),a ∈ Λ,where Λ is the set of indices) is admissible if it has ﬁnite VC dimension and the set is totally bounded or,at least, ||Q a || N /||Q a || 1 is bounded for all a ∈ Λ for some integer N >2 [11,pp.94-95]. 15