# A comparison of kernel equating and traditional equipercentile equating methods and the parametric bootstrap methods for estimating standard errors in equipercentile equating

TABLE OF CONTENTS LIST OF FIGURES iv CHAPTER 1 INTRODUCTION 1 Equating Functions and Data Collection Designs 1 Discreteness and Irregularity Problems in Equipercentile Equating 5 Kernel Equating 7 Equating Error 17 Research Questions 19 CHAPTER 2 LITERATURE REVIEW 22 Research on Smoothing Techniques in Equating 22 Research on Kernel Equating (KE) 28 Research on Derivations of Analytic Equating Errors 31 Research on the Bootstrap Method 33 Bias and Variance Trade-off 34 CHAPTER 3 METHOD 37 Equating Under the EG Design 37 Equating Under the NEAT Design 41 Parametric Bootstrap Method 46 CHAPTER 4 RESULTS 47 Comparisons Among Equating Methods Under the EG Design 48 Comparisons Among Equating Methods Under the NEAT Design 54 Comparisons Among Estimation Methods of Standard Errors in Equating 58 CHAPTER 5 DISCUSSION 61 REFERENCES 68 APPENDIX A BIAS, STANDARD ERROR OF EQUATING (SEE), AND ROOT MEAN SQUARED DEVIATION (RMSD) FOR EQUATING UNDER THE EG DESIGN 72 APPENDIX B BIAS, STANDARD ERROR OF EQUATING (SEE), AND ROOT MEAN SQUARED DEVIATION (RMSD) FOR EQUATING UNDER THE NEAT DESIGN 90 APPENDIX C BIAS, STANDARD ERROR OF STANDARD ERRORS OF EQUATING (SEE), AND ROOT MEAN SQUARED DEVIATION (RMSD) IN ESTIMATING STANDARD ERRORS OF EQUATING 108 AUTHOR'S BIOGRAPHY 126 in

LIST OF FIGURES Figure Page 1 Population distributions for tests with n = 25, 50, and 75 under the EG design 40 2 Population distributions for tests with n = 60, 80, and 125 under the NEAT design 43 AI Bias, SEE, and RMSD for equating under the EG design (n = 25, N = 300) 72 A2 Bias, SEE, and RMSD for equating under the EG design (n = 25,N = 500) 73 A3 Bias, SEE, and RMSD for equating under the EG design (n = 25, N = 700) 74 A4 Bias, SEE, and RMSD for equating under the EG design (n = 25, N = 1000) 75 A5 Bias, SEE, and RMSD for equating under the EG design (n = 25, N = 3000) 76 A6 Bias, SEE, and RMSD for equating under the EG design (n = 25, N = 5000) 77 A7 Bias, SEE, and RMSD for equating under the EG design (n = 50, N = 300) 78 A8 Bias, SEE, and RMSD for equating under the EG design (n = 50, N = 500) 79 A9 Bias, SEE, and RMSD for equating under the EG design (n = 50,N = 700) 80 A10 Bias, SEE, and RMSD for equating under the EG design (n = 50, N = 1000) 81 All Bias, SEE, and RMSD for equating under the EG design (n = 50, N = 3000) 82 A12 Bias, SEE, and RMSD for equating under the EG design (n = 50, N = 5000) 83 AH Bias, SEE, and RMSD for equating under the EG design (n = 75,N = 300) 84 IV

Figure Page A14 Bias, SEE, and RMSD for equating under the EG design (n = 75, N = 500) 85 A15 Bias, SEE, and RMSD for equating under the EG design (n = 75, N = 700) 86 A16 Bias, SEE, and RMSD for equating under the EG design (n = 75, N = 1000) 87 Al 7 Bias, SEE, and RMSD for equating under the EG design (n = 75, N = 3000) 88 A18 Bias, SEE, and RMSD for equating under the EG design (n = 75, N = 5000) 89 Bl Bias, SEE, and RMSD for equating under the NEAT design (n = 60, N = 300) 90 B2 Bias, SEE, and RMSD for equating under the NEAT design (n = 60, N = 500) 91 B3 Bias, SEE, and RMSD for equating under the NEAT design (n = 60, N = 700) 92 B4 Bias, SEE, and RMSD for equating under the NEAT design (n = 60, N = 1000) 93 B5 Bias, SEE, and RMSD for equating under the NEAT design (n = 60, N = 3000) 94 B6 Bias, SEE, and RMSD for equating under the NEAT design (n = 60, N = 5000) 95 B7 Bias, SEE, and RMSD for equating under the NEAT design (n = 80, N = 300) 96 B8 Bias, SEE, and RMSD for equating under the NEAT design (n =80, N = 500) 97 B9 Bias, SEE, and RMSD for equating under the NEAT design (n = 80, N = 700) 98 B10 Bias, SEE, and RMSD for equating under the NEAT design (n = 80, N = 1000) 99 Bll Bias, SEE, and RMSD for equating under the NEAT design (n = 80, N = 3000) 100 v

Figure Page B12 Bias, SEE, and RMSD for equating under the NEAT design (n = 80, N = 5000) 101 BIS Bias, SEE, and RMSD for equating under the NEAT design (n = 125, N = 300) 102 B14 Bias, SEE, and RMSD for equating under the NEAT design (n = 125, N = 500) 103 B15 Bias, SEE, and RMSD for equating under the NEAT design (n = 125, N = 700) 104 B16 Bias, SEE, and RMSD for equating under the NEAT design (n = 125, N = 1000) 105 Bl 7 Bias, SEE, and RMSD for equating under the NEAT design (n = 125, N = 3000) 106 B18 Bias, SEE, and RMSD for equating under the NEAT design (n = 125, N = 5000) 107 CI Bias, SEE, and RMSD in estimating standard errors of equating (n = 25, N = 300) 108 C2 Bias, SEE, and RMSD in estimating standard errors of equating (n = 25, N = 500) 109 C3 Bias, SEE, and RMSD in estimating standard errors of equating (n = 25, N = 700) 110 C4 Bias, SEE, and RMSD in estimating standard errors of equating (n = 25, N = 1000) Il l C5 Bias, SEE, and RMSD in estimating standard errors of equating (n = 25, N = 3000) 112 C6 Bias, SEE, and RMSD in estimating standard errors of equating (n = 25, N = 5000) 113 C7 Bias, SEE, and RMSD in estimating standard errors of equating (n = 50, N = 300) 114 C8 Bias, SEE, and RMSD in estimating standard errors of equating (n = 50, N = 500) 115 VI

Figure Page C9 Bias, SEE, and RMSD in estimating standard errors of equating (n = 50, N = 700) 116 CIO Bias, SEE, and RMSD in estimating standard errors of equating (n = 50, N= 1000) 117 Cll Bias, SEE, and RMSD in estimating standard errors of equating (n = 50, N = 3000) 118 CI2 Bias, SEE, and RMSD in estimating standard errors of equating (n = 50, N = 5000) 119 CI3 Bias, SEE, and RMSD in estimating standard errors of equating (n = 75, N = 300) 120 C14 Bias, SEE, and RMSD in estimating standard errors of equating (n = 75, N = 500) 121 C15 Bias, SEE, and RMSD in estimating standard errors of equating (n = 75, N = 700) 122 CI6 Bias, SEE, and RMSD in estimating standard errors of equating (n = 75, N = 1000) 123 CI 7 Bias, SEE, and RMSD in estimating standard errors of equating (n = 75, N = 3000) 124 C18 Bias, SEE, and RMSD in estimating standard errors of equating (n = 75, N = 5000) 125 vn

CHAPTER 1 INTRODUCTION To administer a test at different times and places, testing programs frequently must construct multiple forms of a test. In such situations, it is important that all of the test forms measure the same ability or trait. All such tests should be developed according to the same content and statistical specifications. However, even when several forms are constructed carefully, differences in difficulty among the test forms might exist to such a degree that the scores from the test forms are not interchangeable without some adjustments. Equating is a statistical process that is intended to adjust for differences in difficulty among test forms built to be similar in difficulty and content so that scores on different forms can be used interchangeably. Equating Functions and Data Collection Designs To achieve the equating purpose, various equating procedures have been developed. In terms of equating functions, the most familiar method of all the equating functions is the linear equating function. In linear equating, scores that are an equal distance from their means in standard deviation units are set equal. Define ff(X) and a(Y) as the standard deviation of new Form X and old Form Y scores, respectively. The linear function is defined by setting standardized deviation scores on the two test forms to be equal such that LinY(x) = HY + (aY/

However, in practice the most frequently used method in observed score equating is the equipercentile equating function. The equipercentile equating function, developed by Braun and Holland (1982), can be constructed by identifying scores on Form X that have the same percentile ranks as scores on Form Y so that the distribution of scores on Form X converted to the Form Y scale is equal to the distribution of scores on Form Y in the population. To be more precise, assuming both X and Y are continuous random variables, consider the following definitions of terms developed by Braun and Holland (1982) and adapted by Kolen and Brennan(2004): X is a random variable representing a score on Form X Y is a random variable representing a score on Form Y F(x) is the cumulative distribution function (cdf) of X in the population. G(x) is the cumulative distribution function (cdf) of Y in the population. According to Branun and Holland (1982), the following function is an equating function to convert scores on Form X to Form Y: eY(x) =G-1[F(x)] (1.2) where G"1 is the inverse of the cumulative distribution function G. Every test equating includes an explicit data collection design to separate the effects of examinee ability from the assessment of the differences in the difficulty of the two test forms. There are two ways to control for examinee ability in test equating. The first is by use of "common examinees" in which examinees take both tests. Using the terminology adopted by Kolen and Brennan (2004), "Single-Group Design", "Single- Group Design with Counterbalancing", and "Random-Groups Design" use this approach. The other approach to take into account examinee ability is to use "common items" rather than common examinees. Kolen and Brennan (2004) named this design as the "Common- 2

Item Nonequivalent Groups Design" that is identical to the "Non-Equivalent groups with Anchor Test (NEAT) Design" by von Davier, Holland and Thayer (2004). Among these data collection designs, only the NEAT Design and the "Random- Group Design" that is also called the "Equivalent - Groups (EG) Design" by von Davier et al.(2004) were considered in the current study. In the EG design, two independent random samples are drawn from a common population of examinees (P) and text X is administered to one sample while test Y is administered to the other sample. Since each examinee takes only one form of the test in this design, testing time can be minimized. The design requires that all the test forms be available and administered at the same time, which might be difficult in some situations, especially when test form security matters (Kolen & Brennan, 2004). The EG design usually demands the largest sample sizes to achieve a given level of precision measured by the standard error of equating (von Davier et al, 2004). In the NEAT design, there are two different populations (P and Q) of examinees and sample of test takers from each population. The sample from population P takes test X, the sample from population Q takes test Y and both samples take a set of common items, the anchor test (A). The common-item set should be a mini-version of the total test form in terms of content and difficulty. The items also should behave similarly in both tests. It is usually advised that the anchor test be administered in the same order to both samples of examinees so that scores on the anchor test and on the other tests are affected in the same way. When the score on the set of common items contributes to the examinee's score on the test form, the set of common items is referred to as "internal". 3

When the score on the set of common items does not contribute to the total test score, the set is referred as "external". In the NEAT design, the group of examinees taking form X is not considered to be equivalent to the group of examinees taking form Y. Thus, differences between score distribution characteristics on Form X and Form Y can result from a combination of examinee group differences and test form differences. The major task in equating with the NEAT design is to separate group differences from test form differences. Also, due to the difference of each population, the target population (T) for this design must be explicitly considered. The target population (T) for the NEAT design could be conceived of as a larger population that contains P and Q as two mutually exclusive and exhaustive strata. This could be denoted as the mixture of P and Q as T = wP + (l-w)Q, (1.3) where 0

Discreteness and Irregularity Problems in Equipercentile Equating In defining equating function (1.2), X and Y are assumed to be a continuous random variable. If F(x) from (1.2) were a continuous and strictly increasing cdf, F(X) has the uniform distribution over (0, 1). Also, if G from (1.2) has a proper inverse function, G'\u) over u in (0, 1), then G"'[F(x)] has the distribution specified by G when F(X) has the uniform distribution on (0,1). Thus, the composed transformation, G"1[F(x)], will have exactly the same distribution as Y over target population. However, almost all score distributions are discrete and therefore the cdf s are not continuous and strictly increasing. Instead, they are step functions with jumps at the possible values of the discrete distribution. Thus, in order to use the equation (1.2), all methods of equipercentile equating must find some ways to circumvent the discreteness of test score distributions. For example, in an equipercentile equating method that is in wide use and called the "percentile rank method"(PRM) by von Davier et al.(2004), the discrete cdf s , F(x) and G(y), are replaced by piecewise linear continuous cdf s. Given a discrete integer- valued random variable X , Y and a uniform distributed random variable U with the range -1/2 to 1/2, new random variable X =X + U and Y = Y* + U are defined. This new random variable is continuous and their cdf s are strictly increasing on the interval. Using these continuized cdf s, a new version of the equipercentile equating function, G* " (F* (x)), can be defined and used. However, the effects of this arbitrary piecewise linear interpolation on equating results have not been thoroughly investigated yet. For example, we know that V(X*) = V(X) + 1/12. Thus, the second moments of X and X* are different. A more elegant method of continuization might be needed. 5

Another problem with the conventional equipercentile equating is related to using sample statistics in place of population parameter in (1.2). Kolen & Brennan (2004) indicated that for linear equating, the use of sample means and standard deviations instead of parameters usually yields adequate equating precision even with fairly small sample sizes. However, when sample percentiles and percentile ranks are used to estimate equipercentile relationships, equating often is not sufficiently precise for practical purposes even with large sample sizes. One such indication is that score distributions and equipercentile equivalents appear quite irregular when they are graphed. If very large sample sizes or the entire population were available, score distributions and equipercentile relationships would be smooth. Considering the fact that score points with little or zero frequency are not uncommon in equating, some procedures must be found to cope with this problem. Smoothing techniques have been used to produce estimates of the empirical distributions and the equipercentile relationship that will have smoothness property. Kolen and Brennan (2004) discussed two types of smoothing methods in equating: pre- smoothing and post-smoothing. In pre-smoothing, the score distributions are smoothed, while the equipercentile equivalents are smoothed in post-smoothing. One pre-smoothing method uses polynomial log-linear models with polynomial contrasts to smooth score distributions (Holland and Thayer, 1987, 2000). Another classical pre-smoothing method uses a strong true score model in which a distributional form for true score and a conditional distribution given the true score are specified (Lord, 1965). For both methods, after the distributions are smoothed Form X is equated to Form Y using the smoothed distributions. 6

In the post-smoothing method, the equipercentile equivalents obtained from equating are smoothed directly. In implementing this method, caution must be exercised so that the smoothed distributions should not depart too much from the observed distribution. In equating, cubic smoothing splines were usually used to smooth the equipercentile equivalents (Kolen, 1984). As will be discussed later in the next chapter, these methods have been studied empirically through simulation and have been shown to improve estimation of test score distributions (Kolen, 1991). An annoying problem related to smoothing methods in equating is that when there are very few score points or when the score distributions is locally very sparse, the equating relationships can still appear irregular even after smoothing. In equating, score discreteness and irregularity due to data sparseness have been dealt with separately. When the conventional equipercetnile equating such as PRM is used, only discreteness problems is considered at the method level. When parametric smoothing methods are used, score discreteness is ignored. Until recently no attempt has been made to tackle with these two problems together in a systematic way. Kernel Equating initiated by Holland and Thayer (1981) and fully developed by von Davier et al. (2004) is the first attempt to consider these discreteness and irregularity problems in an integrated way. In the following section, major steps in Kernel Equating will be briefly presented and discussed in detail in the next chapter. Kernel Equating Kernel Equating (KE) is a unified approach to test equating. Basically, this is an equipercentile equating that contains the linear equating function as a special case. The 7

name "Kernel Equating" arises because of its use of the well known nonparametric density estimation using a Gaussian Kernel. The major difference between KE and traditional equating methods is in the extensive use of the polynomial log-linear pre- smoothing and continuization of discrete scores using a Gaussian Kernel for all equipercetntile equating methods and data collection designs. The developers view KE as having five distinctive steps: 1) pre-smoothing; 2) estimation of the score-probabilities on the target population; 3) continuization; 4) computing and diagnosing the equating function; 5) computing the standard error of equating and related accuracy measures. Unlike traditional equating methods, KE brings these steps together in an organized whole rather than treating them as separate problems. Since a book-length comprehensive details of KE procedures are available in von Davier et al. (2004) and comparisons of equipercentile equating between the traditional methods and KE in EG and NEAT designs are focus of the current study, the five steps of KE in EG and NEAT designs are briefly discussed below. Step 1: Presmoothing. In this step, estimates of the relevant univariate or bivariate score probabilities are obtained by fitting appropriate log-linear models. In the univariate case as in EG design, the polynomial log-linear method fit a model of the following form to the distribution of test scores: Log [p(x)] = ft + 0, x + ft x2 + ... + ftxc (1 -4) where p(x) is a sample density and /3's are parameters in the model that can be estimated by the method of maximum likelihood. The resulting fitted distribution is known to have the moment matching property such that the first C moments of the fitted 8

distribution are the same as those of the sample distribution (Holland and Thayer, 2000). This moment preservation property plays an important guiding role in deciding on the proper degree of smoothness in which we want the fitted distribution to be close to the sample distribution with the smallest number of parameters in the model as possible. In choosing C, because this method uses a log-linear model that is a class of exponential families of discrete distributions, the statistical methods for assessing the fit of these models also can be used. For example, likelihood ratio chi-square goodness-of-fit statistics are calculated for each C and can be tested for significance. In addition, since the model is hierarchical, likelihood ratio difference chi-squares can be tested for significance. For example, the difference between the overall likelihood ratio chi-square statistics for C =3 and C =4 can be compared to a chi-square table with one degree of freedom. A significant difference would suggest that the model with more terms fits the data better that the model with fewer terms. In the bivariate case as in a NEAT design, there are two populations, P and Q, of test-takers and a sample of examinees from each. The sample from P takes test X, the sample from Q takes test Y, and both samples take a set of common items, the anchor test, A. Thus, under the NEAT design there are two sets of bivariate score distributions, (X, A) and (Y, A). These bivariate distributions of test scores can be represented by a doubly indexed set of frequencies, fy, whose value is the number of cases in the sample where the row score is Xj and the column score is aj. The various power moments of this bivariate distribution can be expressed as linear combinations of the frequencies, e.g., 5>X(/,/AT>. (1.5) ij When t = 0, this is the s-th moment of the distribution of the raw scores, and if s = 0, this 9

is the t-th moment of the distribution of the column scores. When s and t are both positive, these are the cross moments of the joint distribution of the row and column scores, e.g., a = b = 1 is the cross moment related to the covariance and correlation between the two scores. Thus, log-linear models for the cell-probabilities can be specified in a manner that is similar to (1.4) for the univariate case, for example, Log (Pij) = & + /3xl xt + &2 xf + faaj + &2 a) + /3xal i xt aj . (1.6) As shown above, in the bivariate case there are three classes of parameters (moments) that arise: those associated with the row score Xj, (i.e., j3xi), those associated with the column score aj (i.e., j8ai), and those associated with both x, a,- (i.e., j8xan). Because unusual features of the marginal distributions often propagate into the cells of the bivariate distribution, it is recommended that all three types of parameters should be considered when examining the fit of a model. In practice, powers as high as five or six are recommended to adequately fit the univariate margins of bivariate distributions. However, it is known that for many problems the joint distribution can be adequately represented by models that include the cross-moment of the form (1.5) with s, t < 2 (Holland and Thayer, 2000). How to select an appropriate model is more challenging in the bivariate case. For the bivariate log-linear modeling in the NEAT design, the best model for each univariate distribution is selected first. Then, chosen cross-product moments are added to the model. According to Holland and Thayer (2000), it is recommended to check the first three moments of the observed and fitted conditional distributions. Holland and Thayer (2000) describe several tools for assessing the fit of log-linear models for score distributions including the likelihood ratio chi-square statistic, the Pearson chi-square statistic and 10

Freeman-Tukey chi-square statistic. For a more detailed description of univariate and bivariate log-linear modeling for discrete test score distributions, see Holland and Thayer (2000) and von Davier et.al (2004). Step 2: Estimation of the Score Probabilities. In this step, the score probabilities on the target population (T) are obtained from the score distributions estimated in Step 1. Here, a crucial role is played by the Design Function (DF) that characterizes each data collection design. The DF is a linear or nonlinear transformation of the estimated score distributions from Step 1 into the estimated score probabilities for X and Y on the target population. In the EG design, the estimated score probabilities are obtained directly on the target population and there is no need to further transform the smoothed estimates obtained in the first step. Under the NEAT design, KE considers two different equating methods: Chain Equating (CE) and Post-Stratification Equating (PSE). CE uses a two-stage trans formation of X scores into Y scores. First, it equates X to A on population P and then equates A to Y on population Q. In order for CE to be an observed score equating method, the target population T in the form of (1.3) and assumptions needed in order for the score distributions of X and Y on T to be determined must be identified. Von Davier et al. (2004) showed that the target population T is irrelevant for CE and any T of the form (1.3) will result in the same equating function. They also showed that if, given any target population T, the two links (from X to A and from A to Y) are population invariant, CE could be an observed equating function. The Post-Stratification Equating (PSE) is the KE version of the so-called "Frequency Estimation" method, which is the most popular procedure for equating under 11

the NEAT design. For the method, as explained by Kolen and Brennan (2004), the frequency distributions of Form X and Form Y for a common synthetic population are estimated as follows: ft(x) = w, f,(x) + w2f2(x) (1.7) gt(y)= wj gi(y) + w2 g2(y), (1.8) where f(x) and g(y) are the population distributions for X and Y, respectively. The subscripts t,\, and 2 represents the synthetic population, Population 1 and Population2, respectively. The weights wi and vv? for Population 1 and Population2 are used to define the synthetic population. Here, f2(x) and gi(y) are not directly observed. The frequency estimation method assumes that for both forms X and Y, the conditional distribution of total score given each common item score is the same in both population. This assumption can be stated as follows: f,(x|a) = f2(x|a), (1.9) g l ( x| a) = g2(y| a). (1.10) Then it follows that fs(x) = w, f, (x) + w2f2(x) = w, f, (x) + w2 £ fx (x | a)h2 (a) (1.11) a gs(y) = w/gi(y) + w2g2(y)= w1^g2(^| a)/?] (a) + w2g2(y), (1.12) a where hi (a) and h2(a) are the marginal distributions of the common item test scores in Populations 1 and 2. Since all the quantities above are now observable for the NEAT design, equipercentile equating can be applied to fs(x) and gs(y). Unlike Chain Equating, in PSE the choice of w can affect the resulting equating function and must be specified. 12