# Kendall's tau and Spearman's rho for zero-inflated data

TABLE OF CONTENTS ACKNOWLEDGEMENTS ii LIST OF TABLES vii LIST OF FIGURES ix CHAPTER 1 INTRODUCTION 1 1.1 Motivation and Background 1 1.2 Statement of the Problem 4 1.3 Organization of the Dissertation 4 1.4 Basic Definitions 4 1.4.1 Delta Distribution 5 1.4.2 Measures of Association 5 1.4.3 Chi-plot 14 2 LITERATURE REVIEW 19 IV

Table of Contents - Continued CHAPTER 3 PROPOSED ESTIMATOR OF KENDALL'S TAU 28 3.1 Adjustment of Kendall's Tau with Ties 28 3.2 Proposed Estimator of Kendall's Tau, r* 30 3.3 Asymptotic Distribution of r* 32 4 PROPOSED ESTIMATOR OF SPEARMAN'S RHO 35 4.1 Adjustment of Spearman's Rho with Ties 35 4.2 Proposed Estimator of Spearman's Rho, p*s 36 4.3 Asymptotic Distribution of p*s 38 5 SIMULATION STUDY AND RESULTS 41 5.1 Simulation and Results: Kendall's Tau 41 5.1.1 Simulation Plan 41 5.1.2 Results 42 5.2 Graphical Illustration 54 v

Table of Contents - Continued CHAPTER 5.3 Simulation and Results: Spearman's Rho 59 5.3.1 Simulation Plan 59 5.3.2 Results 60 5.4 An Example 71 6 FINAL COMMENTS AND FUTURE RESEARCH 74 BIBLIOGRAPHY 76 VI

LIST OF TABLES 1.1 Example of 2 x 2 contingency table for two categorical variables 13 5.1 (2.5th, 97.6th) percentile intervals for r* based from the 1000 estimates 44 5.2 Normality test from the 1000 r* estimates 45 5.3 Normality test from the 100 randomly selected r* estimates 46 5.4 Summary statistics including the bias and MSE for r* and r based on the 1000 estimates with n=30 sample size 47 5.5 Summary statistics including the bias and MSE for r* and r based on the 1000 estimates with n=50 sample size 48 5.6 Summary statistics including the bias and MSE for r* and r based on the 1000 estimates with n=100 sample size 49 5.7 Sample variance from the 2000 r* estimates. The estimates are calculated from 2000 simulations and the Shapiro-Wilk statistic was calculated using a random sample of 100 estimates 50 5.8 Asymptotic variance of r*. The estimates are calculated from 2000 simulations and the Shapiro-Wilk statistic was calculated using a random sample of 100 estimates 51 5.9 Additional results for the sample variance from the 2000 r* estimates. The estimates are calculated from 2000 simulations and the Shapiro-Wilk statistic was calculated using a random sample of 100 estimates 52 5.10 Additional results for the asymptotic variance of r*. The estimates are calculated from 2000 simulations. The Shapiro-Wilk statistic was calculated using a random sample of 100 estimates 53 vn

List of Tables - Continued 5.11 (2.5th, 97.6th) percentile intervals for p*s based from the 1000 estimates 61 5.12 Normality test from the 1000 p*s estimates 62 5.13 Normality test from the 100 randomly selected p*s estimates 63 5.14 Summary statistics including the bias and MSE for p*s and p~^ based on the 1000 estimates with n=30 sample size 64 5.15 Summary statistics including the bias and MSE for p*s and p~^ based on the 1000 estimates with n=50 sample size 65 5.16 Summary statistics including the bias and MSE for p*s and p~^ based on the 1000 estimates with n=100 sample size 66 5.17 Sample variance from the 2000 p*s estimates. The estimates are calculated from 2000 simulations and the Shapiro-Wilk statistic was calculated using a random sample of 100 estimates 67 5.18 Asymptotic variance of p*s. The estimates are calculated from 2000 simulations and the Shapiro-Wilk statistic was calculated using a random sample of 100 estimates 68 5.19 Additional results for the sample variance from the 2000 p*s estimates. The estimates are calculated from 2000 simulations and the Shapiro-Wilk statistic was calculated using a random sample of 100 estimates 69 5.20 Additional results for the asymptotic variance of p*s. The estimates are calculated from 2000 simulations and the Shapiro-Wilk statistic was calculated using a random sample of 100 estimates 70 5.21 Summary of HIV data 71 5.22 Calculated value of the estimators and the corresponding variances of the HIV data 73 vm

LIST OF FIGURES 1.1 Q-Q plot of measles antibody concentration versus the expected distribution 2 1.2 Kendall's tau as a function of Pearsons correlation coefficient in the bivariate normal model 8 1.3 Spearman's rho as a function of Pearsons correlation coefficient in the bivariate normal model 11 1.4 Sample chi-plot 17 1.5 Additional sample chi-plot 18 2.1 Scatter plot of plasma and semen viral loads from Wang (2007) 25 5.1 Behavior of the chi-plot on varying proportions of zero, pOO = 0%, 30%, 60%, 80%; p = 0.0 55 5.2 Behavior of the chi-plot on varying proportions of zero, pOO = 0%, 30%, 60%, 80%; p = 0.20 56 5.3 Behavior of the chi-plot on varying proportions of zero, pOO = 0%, 30%, 60%, 80%; p = 0.50 57 5.4 Behavior of the chi-plot on varying proportions of zero, pOO = 0%, 30%, 60%, 80%; p = 0.80 : 58 5.5 Scatter plot and corresponding chi-plot of plasma and semen viral loads from Wang (2007) 72 IX

Chapter 1 INTRODUCTION 1.1 Motivation and Background Statistical concerns related to analysis of zero-inflated data have been identified as early as in 1955 especially in relation to the estimation of the location para meter (Aitchison 1955). The term "inflation" was used to emphasize that the probability mass at zero exceeds the value coming from a parametric family of distributions. Such data occurrence is common in medical research and also in the fields of finance, insurance, manufacturing, economics and engineering, to name a few. Statistical methodology for such type of data is still being investigated by statisticians in response to the need in these areas. Some examples of zero-inflated data are as follows: Example 1. Household expenditure in Aitchison (1955). If a certain com modity is targeted, some households might not be purchasing that commodity. For example, if one is interested in studying the household expenditure on children's clothing, a zero value will be reported for households without any children. Example 2. Marine surveys in Pennington (1983). Particular species of fish and plankton usually occupies only a part of the total area. In the survey of marine 1

species, zero inflation is brought about by areas unoccupied or maybe unsuitable for some species. Example 3. Exposure measurements in Taylor, et. al. (2001). Depending on work schedules, some workers may be required to spend certain time during the data collection process in control rooms free of contamination. This will give zero exposure measurements for these workers. Example 4- Antibody response to the measles vaccine in Moulton and Halsey (1995). There are several known factors for the results of these assays to be zero-inflated. One might be due to the passively acquired maternal antibody by the infants that is interfering to respond to the measles vaccine. A Q-Q plot of the partial data is presented in Figure 1.1. 2 c o 5 "c O c o O >s • D O n < i r i i i i T - 3 - 2 - 1 0 1 2 3 Figure 1.1: Q-Q plot of measles antibody concentration versus the expected dis tribution. 2

As indicated in the examples above, the non-ignorable zeroes can be at tributed to real zeroes, non-response or non-detects, i.e., falling below some limit of detection. The presence of these zero observations has brought some problems for researchers, statisticians or data analysts. Due to inapplicability of some of the existing statistical methods, common, although not always appropriate, practice in the analysis of zero-inflated data is exclusion or analysis of just the nonzero pairs of observations in a bivariate case or using average ranks in the nonparametric procedures. Association of two or more variables is a very important research topic. The Pearson's correlation coefficient, while the most commonly used, detects only linear association between two variables, it also needs the normality assumption for each of the random variables. Since real data often violate normality and relationship other than linear is often of interest, Kendall's tau and Spearman's rho are indices that can be used. They are both estimated as rank correlations, so the relations are between the rankings, rather than the actual values of the observations. There have been several adjustments to these rank correlations in the literature that try to take into consideration tied observations but none of them were designed for zero-inflated data. Calculating estimates for these measures of rank correlation using just the nonzero pairs of observations in a zero-inflated data usually leads to inaccurate results. 3

1.2 Statement of t he Problem This research will focus on studying the well known measures of association, the Kendall's tau and Spearman's rho. Multiple zeroes in the zero-inflated data can be seen as a special case of tied observations. The treatment of these measures with the presence of ties will be studied and compared with a proposed new approach in estimating these measures. 1.3 Organization of t he Dissertation Background information introduced in the remainder of this chapter includes the delta distribution and the classical indices of association not only in the continuous case but also in discrete and categorical cases. A graphical tool will also be pre sented. Chapter 2 will give a review of the current literature. Chapters 3 and 4 will give the proposed estimators for Kendall's tau and Spearman's rho, respectively. The asymtotic distribution of the proposed estimators will also be defined. Chap ter 5 will present the simulation plan and the results. This dissertation will end with the final comments in Chapter 6 which will also outline the future research plan. 1.4 Basic Definitions We will define the basic distribution, coined by Aitchison as the delta distribution, which incorporates the probability mass at zero while the distribution of the posi tive values is lognormal. We will also look at the different indices of association for later comparison. A graphical tool called a chi-plot will also be presented which will be used for data evaluation alongside the scatter plot. 4

1.4.1 Delta Distribution For the univariate case, assume that a random variable X has continuous dis tribution for its positive values with density hx(x) and a positive mass at 0, P(X = 0)=p>0. Then the distribution function can be written as f(x)=pdx[(l-p)hx(x)]1-dx, (1.1) where dx = 0 if x > 0 and dx = 1 if x = 0. Consequently, ( 0 if s < 0 Fx(s) = I V g if s = 0 [ p + (1 — p) J0S hx{x)dx if s > 0. If hx(x) is a density of a lognormal distribution, X has so-called delta distribution (Aitchison 1955). The mean and variance for this distribution are E{X) = (l-p)a (1.2) and Var(X) = (l-p)f3 + p{l-p)a2, (1.3) where a and j3 are the mean and variance, respectively, of the hx(x) distribution. 1.4.2 Measures of Association There are several measures available to study the association of discrete or contin uous data. The most common measure for a continuous pair of random variables is the Pearson's correlation coefficient, p. Other measures, such as Kendall's tau, r, and Spearman's rho, ps are also used and will be the focus of this study. 5

Pearson's Correlation Coefficient, p Pearson's correlation coefficient is a measure of linear relationship between two random variables. Suppose X and Y are two jointly distributed random variables, the Pear son's correlation coefficient between X and Y is given by ,= C ° v ( X'r ) , (1.4) y/V{X)V{Y) where Cov(X, Y) is the covariance between X and Y and V(X) and V(Y) denote the variances of X and Y, respectively. From a sample of n paired observations, p is estimated by HX^-jnXxY) ( 1 5 ) \j:Xi-nX2)(Y,V-nYS where X and Y are the sample means of Xj's and Yj's, respectively. Some drawbacks of this measure are: (1) it is not invariant under strictly increasing nonlinear transformations and it is highly affected by extreme outliers, and (2) it is sensitive to the departure from normality, r tends to have large bias and large variance when calculated from a bivariate nonnormal distribution with skewed marginals, p ^ 0 especially for smaller sample sizes. 6

Kendall's Tau, r Kendall's tau was proposed by Maurice Kendall (1938) as a measure of association of two jointly distributed continuous random variables. It is defined as a difference between the probability of concordance and discordance of two random variables. A pair of observations is said to be concordant if a larger value of X is more likely associated with a larger value of Y. The pair is discordant if a larger value of X is more likely associated with a smaller value of Y. The population Kendall's r is defined as r = P[(XX - X2)(Yl - Y2) > 0] - P[{XX - X2)(Y1 - Y2) < 0], (1.6) P (concordance) F'(discordance) where (X2,Y2) is an independent replicate of (Xi,Y\). As a difference of two probabilities, — 1 < r < 1 with a positive r indicating positive association between the variables and higher absolute value indicates stronger association. For (X, Y) following a bivariate normal distribution with correlation coef ficient p, Kruskall (1958) presented the relationship between Pearson's correlation coefficient and Kendall's tau. / H(x,y)dH(x,y) - 1 = -arcsi n(p). (1.7) -oo J —oo " Graphical illustration shown in Figure 1.2 suggests that r is a nearly linear function of p. To get the estimate of tau, let [X\) Yi),..., (Xn,Yn) be a random sample from the joint distribution of (X,Y). The Kendall rank correlation statistic K is 7

Figure 1.2: Kendall's tau as a function of Pearson's correlation coefficient in the bivariate normal model calculated as n—1 n (l.S here wnere Q((a,b),(c,d)) 1, if (d-b)(c-a) > 0 (1.9) - 1, if {d-b)(c-a) <0. As Kendall proposed, K can be used to obtain a distribution free test of H0 : X and Y are independent vs. Hi : r ^ 0 where r is defined as in (1.6). The estimate r is based on the statistic K and is defined as K 2K T — n(n — 1) (1.10)

It can be shown using standard U-statistic theory (see e.g., Randies and Wolfe, 1979) that E{T) = T (1.11) and Var(f) = -j^[2{n - 2)Ci + C2], (1.12) n 2 where Ci = Cov[(Q(XllY1),(X2,Y2)),(Q(X1,Yl),(X3,Y3))}, (i > 0 and C2 = Vax[Q(X1,y1),(X2,y2)]. If there are ties among the observations Xi, ...,Xn and/or separately among the observations Yi, ...Yn, function (1.9) is replaced by f 1, if (d-b)(c-a) > 0 Q*{(a,b),{c,d))= I 0, if (d - 6)(c - a) = 0 (1.13) { - 1, if (d-b)(c-a) < 0, and K is now defined as n—1 n ^ - E E Q*((xt,iap^))- (i-i4) The estimate, r, of the Kendall population coefficient r in (1.10) is then redefined as f = 7 2 K (1.15) VCTo-TxXro-r,,) where T0 = n(n - 1), Tx = ^ s 2 ( s/ - 1) and Ty = ^ m i m ( t m - !)• Here, / is the number of tied observations in X and s^ is the size of the Ith tied group in X observations and, equivalently, m is the number of tied observations in Y and tm is the corresponding size of this group. Consequently, the denominator of (1.15) is 9

a geometric average of the number of pairs untied on X and the number of pairs untied on Y. It can easily be seen that (1.15) reduces to (1.10) if there are no tied observations. Spear man's Rho, ps Another popular measure of association is the Spearman's rho. Let (Xi,Yi), (X2, Y2) and (X3, Y3) be independent random vectors with the same distribution as (X, Y). Then ps = 3P[(XX - X2)(Y1 - y3) > 0] - 3P[(Xi - X2){Y1 - Yz) < 0]. (1.16) The coefficient ps is proportional to the difference between the probabilities of concordance and discordance of the random vectors (Xi,Y\) and (X2, Y3), where X2 and Yz are independent variables with the same marginal distributions as X\ and Yi, respectively. For the bivariate normal models with correlation coefficient p, Kruskall (1958) similarly has shown that ps = 12 J™ J™ F(x)G(y)dH(x,y)-3 = ^rcsin^y (1.17) The rank-based estimator of this correlation parameter was introduced by Spearman in 1904 as rg=l-6pr^, (1.18) n[nl — 1) where Di is the difference between the ranks of Xi and Y^ in their separate rankings. 10

Figure 1.3: Spearman's rho as a function of Pearson's correlation coefficient in the bivariate normal model With the presence of ties among the n X observations and/or separately among the n Y observations, the estimate in (1.18) can be redefined as ^o-6Er=iA2-im+ry} rs = (1.19) y/(Wo-Tx)(W0-Ty) where W0 = n(n2 - 1), Tx = J2isi(sl ~ !) a n d Tv = YJmtm{t2m ~ !)• Similarly to Kendall's r, I is the number of tied observations in X and si is the size of the Ith tied group in X observations and, equivalently, m is the number of tied observations in Y and tm is the corresponding size of this group. 11

Discrete Case In a discrete case, ties can be viewed as a combination of three different scenarios (see, e.g., Liebetrau, 1983). Given (Xi,Yi) and (X2,Y2), they can be tied only on X, i.e., (Xi = X2,Y1 ^ Y2) with probability ixf, or tied only on Y, i.e., (X\ ^ X2} YY = Y2) with probability 7rty, or tied on both X and Y, i.e., (Xi = X2, Yx — I2) with probability irfY. The range of r depends on the probability of ties, therefore, (1.6) will not be suitable for discrete data. In this case, multinomial sampling is more appropriate. If ptj is defined as P(X = xitY = yj), then the Kendall's tau, denoted by T5 for discete case, can be defined as Tb = — —" —T7?> ( L2 °) {^ZLPI)(I-I:UPI> l V 2' under the multinomial sampling model and TTQ is the probability of two randomly selected members of the population that are concordant and up is the probability that they are discordant. Also, 1 — £ i = 1 Pi+ = 1— vf ~ ^ Y ^s ^n e probability that the observations are not tied in Y and equivalently, 1 — £.- =1 V2+j = 1 ~~ ^J ~~ ^fY is the probability that the observations are not tied in X. Given that X and Y discrete variables are jointly sampled then (1.20) can be estimated by the formula 2x(C-D) n = ( «2 - Ei «i +) ( r c2 - £j n+j ) 1/2 : (1.21) where n^s are the observed frequency. Also, C is the number of concordant pairs and D is the number of discordant pairs. 12

Similarly, the Spearman's rho can be estimated by the formula Ps j_ 12 ( n 3 - ^ n f + ) (n3-J2jnlj 1/2' :i.22) where R(i) = ^ nfc+ + k

*symmetric contingency tables larger than a 2x2; and Cramer's V for asymmetrical tables. Given a contingency table for variables measured in ordinal categories such as low/medium/high, with a large number of tied ranks, the gamma coefficient, G is used as the appropriate measure of association, defined as » = ? = §?£• (L25) The population version of gamma is rv-n, 1.4.3 Chi-plot In addition to scatter plot of raw data and ranks, association between random variables will be graphically illustrated using chi-plots. These were originally pro posed by Fisher and Switzer (1985), and later expanded in Fisher and Switzer (2001), where they showed how a single chi-plot can highlight different forms of dependence. To generate this plot, given a random sample of n pairs of random samples from a bivariate distribution, one should determine the following quantities. 1 Hi = —#( j ^:^ < 4 y 3 < y,), (1.27) Fi = -t—JttJ^i-.XjKXi), n — 1 Gt = -L-#(j^i -.YjKYi), and n — 1 St = Sign { (#- 0.5 ) ( Gi - 0.5) }. 14 *

*It can be seen that these quantities depend entirely on the ranks of the distributions. Fisher and Switzer proposed that the chi-plot be a scatter plot of the pairs (AJ,XJ), where Aj is the distance between the observation (x^y*) and the center of the dataset and Xi is a function of the signed square root of the traditional chi-square test statistic for independence in a two-way table. These are defined as Hi — FiKji X* = — r, (1-28) {Fi(l - Fi)Gi(l - G%)Y A,. 45, max {(F, - 0.5)* , (d - 0.5)*} , (1.29) where Xi £ [~1>1]- In order to help with the interpretation of the chi-plot, Fisher and Switzer recommended that a pair of horizontal lines be displayed showing ±cp/\/n, where cp is selected such that approximately (100 x p)% of pairs (Aj, Xi) n e between these lines. They reported cp values 1.54, 1.78, and 2.18 that correspond to p = 0.90, 0.95 and 0.99, respectively, obtained through simulations. Figures 1.4 and 1.5 are shown to illustrate the expected behavior of the chi- plots with two independent random variables and with the presence of increasing monotone association. Data were randomly generated from a bivariate standard normal distribution with n = 100 and correlation p = 0.0, 0.20, 0.50, 0.95. The left portion of each figure shows the scatter plot for each case while the corresponding graph on its right is the chi-plot. The horizontal lines represents the 95% control limit, which suggests that 95% of the x% values should fall within these lines if there is no association between the variables. The points depart from this band as the association becomes more prominent. In Figure 1.4(b), majority of the points are within the 95% band which indicates the lack of association between 15 *

*the variables as depicted in its corresponding scatter plot in Figure 1.4(a). As the correlation coefficient is increased, the points depart from the band which leads to a picture similar to the one shown in Figure 1.5(h). In this figure, there is evidence of monotone dependence between the two variables. 16 *

*(a) (b) *

*(e) (f) CO I o o ° o 00e o o o 0 © o o _ <& O 0 ° cfL oo0 8 o o 8° o o 1 1 a ° ^ 8 <§> o 1 1 (g) (h) CO I Figure 1.5: Additional sample chi-plot. Left column shows the scatter plots and the right column their corresponding chi-plots, for simulated samples of size 100 from the bivariate normal distribution with correlation coefficients, 0.50 and 0.95, respectively. 18 *

*Chapter 2 LITERATURE REVIEW Several studies have been published regarding the location parameters for single, paired or independent samples having a mass at zero, the earliest was Aitchison (1955). Examples have been provided to illustrate the problem at hand, one of which is the analysis of household expenditure on a certain commodity. Some households may not use or buy the product which results to a zero observation. The presence of these cases skews the distribution which can then be approximated by a lognormal curve. Aitchison proposed efficient estimates of the mean and variance. He further applied his results using several distributions and then used real data as examples. The concepts presented by Aitchison were used by Pennington (1983) in finding efficient estimators of abundance for fish and plankton surveys. He pointed out that inflation at zero can also be observed in marine survey, which is brought about by having areas that are not occupied or unsuitable for some species. He applied Aitchison's estimators on ichthyoplankton survey and concluded the ef ficiency of the mean estimator based on the delta-distribution due to the large variability of the log of the nonzero values. He was also able to extend Aitchison's work and presented an estimate of the variance for the estimator of the mean. 19 *

*Owen and DeRouen (1980) also studied the mean estimation with zero- inflated data. In addition to just having zero observations, they also looked into having a left-censoring and a combination of both and used the mean square error approach. They reported that the maximum variance unbiased estimator of the zero inflated mean has lower MSE than the MLE with just the nonzero censored data. Several other papers were published that dealt with zero-inflated data. One of the main motivations for this research was the study by Moulton and Halsey (1995). They presented a measles vaccine data from an immunogenicity study on sera collected from children 12 months of age. The zero values in this data arise from values falling below a limit of detection. A mixture model approach using lognormal distribution for the nonzero values was used. An interesting point to further illustrate when zero-inflated data can occur was made by Taylor, et. at. (2001). In their paper, they presented the study of exposure measurement falling below a fixed limit of detection. In this type of data, at least 20% of the data are expected to fall below the set limit of detection, which give rise to the zero-inflation problem. However, they pointed out that it is false to assume that all zero values are due to the fact that the observed value is below a limit of detection. Some of those are real zeroes which were observed from personnel assigned to work in a controlled environment for a certain period of time. Bascoul-Mollevi, et. al. (2005) presented several two-part statistics that can be used to analyze paired data from a mixed distribution. These statistics are a sum of a test of proportions (for the count of zero values) and a parametric or non-parametric statistic comparing the means from two paired samples. The 20 *

*
Full document contains 91 pages
*

*
Abstract:
Zero-inflated continuous distributions have positive probability mass at zero in addition to a continuous distribution. Such type of data can be encountered, for example, in medical, environmental and financial research. The main focus of this research is to study the association of nonnegative random variables, both having a positive probability mass at zero. New estimators of the classical measures of association, Kendall's tau and Spearman's rho, appropriate for the zero-inflated distributions, are proposed and their asymptotic distributions are derived. Performance of the estimators is assessed by a Monte Carlo simulation study. New ideas are illustrated by a real data example.
*