# Unsupervised data mining methods for functional data analysis and feature selection

TABLE OF CONTENTS

ACKNOWLEDGEMENTS ................................................................................................ iv

ABSTRACT ....................................................................................................................... vi

LIST OF FIGURES ............................................................................................................ xi

LIST OF TABLES ........................................................................................................... xiii

Chapter Page

1. INTRODUCTION……………………………………..………..….. ................. 1

1.1 Functional Data Analysis ....................................................................... 2

1.2 Data Mining ........................................................................................... 3

1.2.1 Supervised Learning ............................................................... 3

1.2.2 Unsupervised Learning ........................................................... 4

1.3 Applications ........................................................................................... 6

1.4 Organization of this Dissertation ........................................................... 7

2. AN EFFECTIVE CLUSTERING PROCEDURE OF NEURONAL RESPONSE PROFILES IN GRADED THERMAL STIMULATION .............. 8

2.1 Introduction ............................................................................................ 8

2.2 Analytical Methods .............................................................................. 11

2.2.1 Smoothing ............................................................................. 12

2.2.2 Transformation ..................................................................... 12

2.2.3 Clustering Analysis ............................................................... 13

x

2.2.4 Clustering Validity Measures ............................................... 15

2.3 Simulation Study ................................................................................. 16

2.4 Spinal Cord Dorsal Horn Responses to Graded Thermal Stimuli ....... 21

2.4.1 Data Description ................................................................... 21

2.4.2 Smoothing and Transformation ............................................ 22

2.4.3 Functional Clustering Results ............................................... 23

2.5 Conclusions .......................................................................................... 28

3. UNSUPERVISED FEATURE SELECTION USING WEIGHTED PRINCIPAL COMPONENTS .......................................................................... 29

3.1 Introduction .......................................................................................... 29

3.2 The Proposed Unsupervised Selection Approach ................................ 32

3.2.1 Weighted Principal Component ............................................ 32

3.2.2 Moving Range-Based Thresholding Algorithm ................... 34

3.2.3 Features Validity Measures .................................................. 37

3.3 Simulation Study ................................................................................. 37

3.3.1 Simulated Data ...................................................................... 37

3.3.2 Simulation Results ................................................................ 38

3.4 Experiments with Real Data ................................................................ 41

3.5 Conclusions .......................................................................................... 47

4. SUMMARY AND FUTURE WORKS………………………..………..….. ... 49

REFERENCES .................................................................................................................. 51 BIOGRAPHICAL STATEMENT ..................................................................................... 55

xi

LIST OF FIGURES Figure Page

1.1 Illustration of functional data ......................................................................................... 2

1.2 Illustration of microarray gene expression data ............................................................. 7

2.1 Five True Clusters of Simulated Data ......................................................................... 17

2.2 Graphical Representation of (a) k-means clustering results on simulated data after smoothing and transformation and (b) mean response profiles of each of five clusters ........................................................... 20

2.3 144 Neuronal Response Profiles to Gradient Thermal Stimulation ............................ 21

2.4 LOESS in a Mean Response Profile with Different Smoothing Parameters ............... 22

2.5 B-spline basis function plots with (a) order = 1, (b) order = 2, (c) order = 3, and (d) order = 4 .......................................................................................... 23

2.6 Graphical Representation of (a) k-means clustering results on the original dorsal horn neuron data and (b) mean response profiles of each of five clusters .......................................................................................... 24

2.7 Graphical Representation of (a) k-means clustering results on the experiment data after smoothing and transformation and (b) mean response profiles of each of five clusters ........................................................... 26

3.1 Weighted PC loading values of individual features. ................................................... 34

3.2 Dot plots of two significant features for Wisconsin diagnostic breast cancer data. (a) mean area of cell nucleus, (b) mean of the three largest areas feature. The features were identified by both the proposed method and the LSE method according to patient types (malignant, benign).. .......................................... 44

3.3 Dot plots of two significant features for Wisconsin diagnostic breast cancer data. (a) mean of texture for cell nucleus, (b) standard error of perimeter feature.

xii

The features were identified by only the LSE method (not by the proposed method) according to patient types (malignant, benign). .................................................. 45

3.4 A dot plot of the proline feature by the type of wine. The feature was identified by the proposed method. .................................................................................................... 46

3.5 A dot plot of the significant feature (alkalinity of ash) identified by only the LSE method (not by the proposed method) according to wine types.. ...................................... 46

3.6 Dotplots for leukemia data of (a) gene #1779 and (b) gene #1674 by type of leukemia .................................................................................. 47

xiii

LIST OF TABLES

Table Page

2.1 Clustering Validity Measures of Low-Noise Simulated Data ..................................... 18 2.2 Clustering Validity Measures of High-Noise Simulated Data .................................... 19 2.3 Clustering Validity Measures (Silhouette Value) for the Spinal Dorsal Horn Data from k-means Clustering and LCCA ............................................................... 27 3.1 Summary of the Simulated Data .................................................................................. 38

3.2 Number of identified features, sensitivity (Se), specificity (Sp), and CPU time on 10 scenarios ........................................................................................... 40

3.3 Summary of Real Datasets ........................................................................................... 41

3.4 Comparison of unsupervised feature selection methods on the Wisconsin diagnostic breast cancer, wine, and leukemia datasets ...................................................... 42

1 CHAPTER 1 INTRODUCTION

As computer and database technology grows rapidly, vast amount of data are being generated with high complexity. The problems arise when the traditional data analysis methods are unable to handle these types of data. The data mining approach is employed to overcome these problems and can be viewed as a multidisciplinary joint effort from machine learning, information technology, and statistics. Recently, data mining has gained a great deal of attention from various disciplines including manufacturing, telecommunications, economic, gene expression studies, and biomedical studies and etc. Among those disciplines, functional data are commonly found because usually data are collected over long periods of time. In bioinformatics, the data often involve a large number of genes e.g. several thousands. These types of data often challenge analytical capabilities because their high dimensionality and their complexity. The major purpose of this dissertation is to present unsupervised data mining algorithms as a tool for functional data analysis and feature selection.

2 1.1 Functional Data Analysis

Functional data analysis is an analysis of curves or functions of data. The basic concept of functional data an alysis is to consider the observed functions as a single objects rather than a sequence of indivi dual observations (Ramsay and Silverman, 2005). Longitudinal data can be considered as functional data beca use the observations are collected as a function of time or other continuous variables. Functional data analysis differs from the traditional analysis because it uses the rates of change or derivatives of the curves to analyze and visu alize data. Figure 1.1 illustrates example of functional data. The y -axis represents the observat ions that corresponding to the function of x .

Figure 1.1 Illustration of functional data.

3 According to Ramsay and Silverman ( 2005), the objectives of functional data analysis are as follows: (i) to discover patt ern or variation among da ta; (ii) to assist further analysis; (iii) to visualize data and highlight characteristics of data.

1.2 Data Mining According to Wegman and Solka (2005), the definition of data mining can be defined as an extension pro cess of exploratory data an alysis to discover the unknown and unanticipated structure in the data. Da ta mining tools are commonly separated into two categories: supervised le arning and unsupervised lear ning. Supervised approaches require both the input (predict ors) variable and the output (response) variable, while unsupervised approaches rely solely upon the input (explanatory) variables. 1.2.1 Supervised Learning Supervised learning is a data mining t echnique to obtain a model from a pair of input data and desired outputs. The characteristics of out put variables are used to categorize supervised learning into either classification or regr ession problems. For classification problems, the output variable is a categorical variable. The task of classification is to assign existing class labels to th e unknown observation. In the regression problems, the response is continuous va riable so that the task is to predict the value of function for any valid predictors. Linear regression models have been the most widely used approach for regression problems because of their simp licity and the models often provide an adequate and interpreta ble description of relationship between response and predictors.

4 The variable selection algorithms are used to obtain the model with the good sets of predictors. There has been several variable selection approaches developed for linear regression models. These methods are (1) forward selection, (2) backward elimination, (3) stepwise selection, and (4) best subsets selection (Larose, 2006). 1.2.2 Unsupervised Learning Unsupervised approaches attempt to extract the information without using any response variables. Although visualization techniques elicit the natural groupings of the observations, the interpretation of graphical results is not necessarily straightforward. Clustering analysis is an unsupervised approach that systematically partitions the observations by minimizing within-group variations and maximizing between-group variations, then assigning a cluster label to each observation. Clustering analysis includes hierarchical and nonhierarchical methods. Nonhierarchical clustering algorithms aim to group observations into k clusters. The number of k can be determined as a part of clustering procedure or can be specified in advance. The k-means clustering algorithm is one of the most popular of non hierarchical clustering methods (Kim et al., 2002). A brief summary of the k-means clustering algorithm is as follows. Starting with k seed points, each observation is assigned to one of the k seed points close to the observation, which creates k clusters. Then, seed points are replaced with the mean of the currently assigned clusters. This procedure is repeated with updated seed points until the assignments do not change. The k-means clustering algorithm depends on distance metrics and the number of clusters

5 (k). A variety of distance metrics are available. Generally, the Euclidean distance is a widely used distance metric to analyze the multivariate data. Principal component analysis (PCA) is one of the most widely used methods for unsupervised learning. The principal component analysis has two major proposes. The first objective is to reduce the data dimension and the second objective is to illustrate interpretation for the high dimensional data (Johnson and Wichern, 2001). The principal component is the linear combination of the original variables which transform into uncorrelated new variables. The transformed variables, called principal components (PCs) are uncorrelated, and generally, the first few PCs are sufficient to account for most of variability of the entire data. Thus, plotting the observations using these first few PCs facilitates the visualization of high-dimensional data. A brief summary of the algorithm of PCA is as follows: Let [ ] p XXXX,,, 21 L= ′

have covariance matrix Σ, with eigenvector i e ′ (i = 1,…, p) corresponding to the eigenvalue λ i , where λ 1 ≥ λ 2 ≥…≥ λ p (Johnson and Wichern, 2001). The principal component is the linear combination of the original variables which transform into uncorrelated new variables Y 1 , Y 2 , Y 3 ,…, Y p as follows. ppppppp pp pp XeXeXeXeY XeXeXeXeY XeXeXeXeY +++= ′ = +++= ′ = +++= ′ = K M K K 2211 222212122 121211111 (1.1) In order to reduce the dimensionality of data, let [ ] T d yyyY,,, 21 K= , and d < p, we able to obtain the new variables y j (i = 1,…, d).

6 1.3 Applications As previous mentioned, data mining has wide range of app lications, including manufacturing, finance, telecommunications, bioinformatics, and neuronal studies. In manufacturing, data mining met hods are widely implemented to predict the outcome of manufacturing process such as defective parts. In finance, credit analysis and prediction of loan payments are always critical to th e business of financial firms. Data mining methods assist the firms to evaluate risk of their customers a nd also identify the important factors and eliminate irrelevant fact ors. In telecommuni cations, data mining aids providers to facilitate their churn detection activ ities and analyze fraudulent patterns and any unusual activities. In past d ecades, there has been an explosive growth in bioinformatics. Data mining methods supp ort researchers to better understand the biological process e.g. molecular patterns, DNA and protein sequences. Combined with microarray technology, data mining methods have been applied to discover gene expression patterns and identify complex di sease genes and biomarkers for disease diagnostic. In this dissertation, we exp licitly focus on neurostimulation data and microarray gene expression data. Microarray gene expression data contain thousands of genes reflecting the state of cell with different protein and mRNA co mpositions (Ding, 2002). Of the thousands of genes, there are only a sm all number of significant featur es. Thus, selecting of most important and relevant genes is an important task. Figure 1.2 illustrates example of microarray gene expression data . The unsupervised feature selection is applied in order to select a set of relevant genes.

7

0 1000 2000 3000 4000 5000 6000 7000 -8 -6 -4 -2 0 2 4 6 8 10 12

All ALL AML

Figure 1.2 Illustration of microa rray gene expression data.

1.4 Organize of this Dissertation This dissertation begins with a brief in troduction of data mining and biomedical signals in Chapter 1. Chapter 2 proposes a clustering strategy procedure for neuronal response profiles in graded thermal stimuli. Chapter 3 presents an unsupervised feature selection approach by using weighted princi pal component combines with thresholding algorithms. Finally, Chapter 4 discusses the future research directions.

Feature Intensity

8 CHAPTER 2 AN EFFECTIVE CLUSTERING PROCEDURE OF NEURONAL RESPONSE PROFILES IN GRADED THERMAL STIMULATION

2.1 Introduction In recent years, functional data analysis has received considerable attention in a variety of fields of applied science in whic h collected data are formed with functions (Ramsay and Silverman, 2005). In the present study we apply functional data analysis to study how the dorsal horn neurons of a rat re spond to graded heat stimuli. Pain is one of the most extensively studied neural syst ems. Pain can be evoke d in the periphery by mechanical, thermal, and chemical stimuli. In the spinal cord, dor sal horn neurons play a critical role in rece iving not only excitatory input from peripheral tissues (e.g., skin), but also descending inhibitory input from supraspinal stru ctures (e.g., the brain stem). Thus, the response properties of these spinal dorsal horn neurons determine the final output of neural signals to th e higher center. It is expect ed that the outcome of this analysis will help better classification of spinal dorsal horn neurons, which will be used to correlate the efficacy of the descendi ng inhibition by electrical neurostimulation. Neurostimulation has been efficiently us ed to reduce or bl ock pain signals (Burchiel et al., 1996). A neurostimulation devi ce over the surface of the spinal cord or over the primary motor cortex, for example, could deliver low levels of electrical current or heat directly to nerve fibers or neurons. Therapeutic st udies have shown that

9 when used on carefully selected chronic-pain patients, neurostimulation could significantly improved pain relief and reduce the need for narcotic medications (Burchiel et al., 1996). Neurostimulation has several significant advantages. First, it can be very effective, with few side effects, for certain conditions. Second, the implanted device can be controlled by patients or doctors with little risk of addiction or overdose. Third, the implant can be removed if it does not achieve the desired level of pain or symptomatic relief. Several studies have been conducted to characterize the spinal cord dorsal horn neurons in a rat. Chung et al. (1986) and Owens et al. (1992) studied the response of spinal cord dorsal horn neurons to mechanical stimuli. Senapati et al. (2005) conducted electrical stimulation to induce inhibition of the responses of spinal cord dorsal horn neurons to noxious mechanical stimuli. Furthermore, a number of studies of spinal cord dorsal horn neurons were conducted to determine their response to graded heat stimuli (Kanui, 1985; Craig et al., 2001; Hayes and Rustioni 1981). The main conclusion from these studies has been that the responses of dorsal horn neurons increase as heat stimuli increase. Despite the number of studies conducted in this direction, few efforts have been made to characterize the response pattern of spinal cord dorsal horn neurons in deeper laminae to heat stimuli (Borzan et al., 2005). Borzan et al., (2005) used a latent class cluster analysis (LCCA) to categorize the dorsal horn neurons in the deep laminae of the rat in response to graded heat stimulation. They found five distinct response patterns in these neurons. LCCA is one of the model-based clustering methods that have been used to elicit the natural groupings of multivariate data. However, the responses

10 of the dorsal horn neurons are functional observations in that the responses can be represented as a function of the temperature. Thus, the direct use of a clustering method for this functional dataset may lead to inefficient and unsatisfactory conclusions. Indeed, the data often involve a number of complex nonlinear profiles that challenge analytical capabilities. To address this problem, we propose an effective clustering procedure to group a number of nonlinear profiles. In general bioinformatics research, a number of studies have used clustering methods, especially when gene expression data were the target, to discover patterns in functional data. Wakefield et al. (2002) proposed a model-based clustering method to efficiently cluster gene expression data. Schliep et al. (2003) suggested clustering genes based on hidden Markov models (HMM). They claim that HMM is robust to the noisy and frequently missing data and has an ability to handle cyclic and noncyclic biological time series. Luan and Li (2003) proposed a mixed-effects model with B-splines for clustering time-course gene expression data. Each of the aforementioned methods has its own advantages and disadvantages, and choosing among them depends on the purpose of the application. Recently, Serban and Wasserman (2005) developed CATS (clustering after transformation and smoothing) as a method to cluster a number of profiles and applied it to time-course gene expression data. The CATS method involves three required preprocessing steps before clustering analysis. These are transformation, smoothing, and screening, in that specific order. The purpose of the screening step in CATS is to filter out any constant profiles that may adversely affect the clustering. However, testing the constant profiles involves a statistical threshold that can be determined subjectively.

11 Further, although the au thors claimed that the screeni ng step in CATS is important, their simulation study found that the screeni ng step brought no signi ficant clustering improvement. In the present study we propose a modified version of CATS that does not require a screening step and efficiently clusters a number of nonlinear profiles collected for the heat stimulation study. The remainder of this chapte r is organized as follows. Section 2.2 describes a set of analytical methods, including smoothing, tr ansformation, cluste ring, and clustering validity measures, used in the proposed clustering procedure. Section 2.3 presents simulation studies to inves tigate the performance of th e proposed procedure and to compare it with one of the existing cluste ring methods. Section 2.4 presents a real application study to examine th e response patterns of deep do rsal horn neurons to heat stimuli. Section 2.5 contains our concluding remarks.

2.2 Analytical Methods The basic idea of our proposed clustering pr ocedure is to con duct the clustering analysis after the data are sm oothed and transformed, in that specific order. It should be noted that the order of smoothing and transf ormation is different from CATS. Detailed descriptions of our choices for smoothing and transformation met hods are presented in the following subsections.

12 2.2.1 Smoothing Generally, smoothing algorithms were used to remove noises and are presented in the data so as to highlight the real signals. A robust locally weighted regression (LOESS) method was used for smoothing a set of data points (x i , y i ) for i = 1, 2, …, n in which the fitted value is the polynomial fit to the data (Cleveland, 1979). LOESS starts with a local polynomial fitting of a subset of data and then applies a robust fitting method to obtain the final fit. One of the parameters of LOESS is a smoothing factor that controls the degree of smoothness of the regression function. The larger the smoothing parameter is, the less sensitive it is to fluctuations in the data. The smoothing parameter can be determined by a cross-validation technique (Cleveland and Devlin, 1988). Another parameter in LOESS is the order of the polynomial function. Using a zero degree is the simplest computation that converts LOESS into a weighted moving average. Typically, smoothing parameter of one is sufficient to smooth the data (Cleveland, 1979). 2.2.2 Transformation Having smoothed the data, we converted them into a functional form. In other words, we fitted the profiles of the smoothed data based on a set of basis functions that can be represented as follows: ( ) ( ) ∑ = = K k kk tctx 1 φ , (2.1) where K is the number of basis functions, c k are the coefficients, φ k are polynomials, and t is the parameter (e.g., time).

13 Spline functions are commonly used for approximation of nonperiodic functional data (Ramsay and Silverman, 2005; Levitin et al., 2007). The B-spline basis is formed by joining polynomial functions of order K at fixed points, called knots. Let L be subintervals separated by the value of τ i , where l = 1,…, L-1. For each interval, m is the order of a polynomial. The spline function S(t) can be represented as follows: ( ) ( ) ,, 1 1 ∑ −+ = = Lm k kk tBctS τ (2.2) where ( ) τ ,t k B is the value at t of the B-spline basis function as defined by the knot sequence τ, and c k is the corresponding coefficient (Ramsay and Silverman, 2005). Serban and Wasserman (2005) used the Fourier transform in their CATS method. The Fourier transform is efficient for the periodic data. However, the stimulation data from neuroscience often exhibit a nonperiodic pattern as shown in the latter part of this paper (please see Figure 2.3). Consequently, we propose here to use the B-spline function to transform the data. 2.2.3 Clustering Analysis We performed clustering analysis to group the response profiles based on the coefficients of the B-spline functions. Clustering analysis systematically partitions the dataset by minimizing within-group variation and maximizing between-group variation and then assigning a cluster label to each observation (Xu and Wunsch II, 2005). The present study applied a k-means clustering algorithm to a set of the coefficients from the B-spline transform. A brief summary of the k-means clustering algorithm is as follows: Given k seed points, each observation is assigned to one of the k seed points close to the

14 observation, a step that creates k clusters. Then, the seed points are replaced with the mean of the currently assigned clusters. This procedure is repeated with updated seed points until the assignments no longer change (Xu and Wunsch II, 2005). The k-means clustering algorithm depends on distance metrics and the number of clusters (k). A variety of distance metrics are available. Generally, the correlation distance is an appropriate choice to analyze the functional profile data because the correlation distance focuses on measuring the similarity in shapes between the two response profiles. As for determining the number of clusters, a number of methods are available. However, no consensus exists about which one is most satisfactory (Hastie et al., 2001). In the current study, we determined the appropriate number of k based on the opinions of domain experts. An earlier study by Borzan et al. (2005) used a LCCA algorithm to cluster the same heat stimulation data used in the present study. LCCA is a model-based clustering approach based on finite mixture probability distributions. LCCA provides an estimate of the number of clusters in multivariate data with statistical confidence (Borzan et al., 2005). A brief explanation of LCCA is as follows: Let c be the number of clusters, n be the number of observations, and let π i be the fraction of observations that belongs to cluster i (for i = 1,…, c). The probability density function can be denoted by ( ) Σ,, i xf μ where μ i is the mean vector and Σ is the covariance matrix. The likelihood function can be computed as follows:

( ) ( ) ,,;,, 1 1 1 ∏ ∑ = = ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ Σ= n j iij c i ic xfL μπθθ K (2.3)

15 where ( ) iiii Σ=,, μ π θ . Likelihood ratio testing methodology was applied to acquire a good latent class model. We fit models starting with a one-cluster model and then added another cluster for each successive model. The difference between the log-likelihood with the c-1 clusters and with the c clusters model represents the amount of fit improvement associated with the c clusters model in comparison with the c-1 clusters model. In general, the difference in maximized log-likelihood can be tested with the standard chi- square distribution method for a nested model. However, if the model is not nested, a bootstrapping method can be used to obtain the test statistic (Borzan et al., 2005). 2.2.4. Clustering Validity Measures To demonstrate the effectiveness of the clustering approach, the following misclustering rate ϕ can be defined for N profiles: ( ) , ˆ , 1 1 ∑ = = N j jj CCI N ϕ (2.4)