• unlimited access with print and download
    $ 37 00
  • read full document, no print or download, expires after 72 hours
    $ 4 99
More info
Unlimited access including download and printing, plus availability for reading and annotating in your in your Udini library.
  • Access to this article in your Udini library for 72 hours from purchase.
  • The article will not be available for download or print.
  • Upgrade to the full version of this document at a reduced price.
  • Your trial access payment is credited when purchasing the full version.
Buy
Continue searching

Evaluating differential rater functioning in performance ratings: Using a goal-based approach

Dissertation
Author: Kevin B. Tamanini
Abstract:
Measuring performance in the workplace is an endeavor that has been the central focus of many applied researchers and practitioners. Due to the limited information that objective data provides to decision makers, subjective data are often used to supplement performance ratings. Unfortunately subjective ratings can be biased. Indeed, rating errors frequently bias ratings and have plagued performance evaluations. Much of the performance appraisal (PA) research has focused on ways of eliminating, detecting, or controlling these rater errors. The results from these areas are mixed and insufficient in providing insights and understanding about how to deal with rater errors. This research extends and tests a technique called differential person functioning (DPF; Johanson & Alsmadi, 2002) to the detection of rater bias (specifically leniency/severity) during a performance evaluation, as well as test a goal-based approach for performance evaluations. The DPF technique is used to identify the responses for a given individual that are different for different groups of items. The goal-based approach proposes that individuals' pursuit of different goals is what leads to different ratings. Two studies were conducted to examine these phenomena. The first study was a pilot study to refine the materials and manipulations that were to be used in the main study. Specifically, two different evaluation formats were compared, sex differences were examined, and the manipulation was tested. In the second study (i.e., the main study) the sensitivity and consistency of the DPF technique was compared with two other traditional methods for detecting leniency/severity. Participants completed an actual performance evaluation for a faculty member under one of three different response instructions. The results of the main study indicated that the DPF technique was not more sensitive than the other traditional methods. Indeed all methods examined were insensitive to the manipulation, thus all were ineffective at detecting rater bias. Although the DPF method was ineffective, results provided support for the goal-based approach. Those raters who were responding under different instructions (i.e., goals) provided significantly different ratings. These findings suggest that there was a reasonable opportunity for differential ratings to occur across groups, but yet none of the detection techniques were effective at detecting them. The discussions of these studies provide implications for the findings as well as implications for the DPF technique, the goal-based approach, and other personnel decisions.

Table of Contents Page

Abstract ............................................................................................................................... 3   Acknowledgments .............................................................................................................. 5   Introduction ....................................................................................................................... 13   Bias in Performance Appraisal ......................................................................................... 19   Rating Errors ................................................................................................................. 19   Halo. .......................................................................................................................... 21   Central Tendency. ..................................................................................................... 21   Leniency .................................................................................................................... 22   Methods for Addressing Rating Errors ......................................................................... 24   Rating Format ............................................................................................................... 25   Graphic Rating Scales. .............................................................................................. 27   Behaviorally Anchored Rating Scales (BARS). ....................................................... 28   Mixed Standard Scales (MSS). ................................................................................. 30   Behavior Observation Scales (BOS). ........................................................................ 31   Forced-Choice Ratings .............................................................................................. 32   Summary. .................................................................................................................. 33   Rater Training ............................................................................................................... 35   Rater Error Training (RET). ..................................................................................... 36   Frame-of-Reference training (FOR). ........................................................................ 36   Summary. .................................................................................................................. 37   Cognitive Approach to PA ............................................................................................ 38  

7

Rating Process Models .............................................................................................. 39   Goal-Based Perspective for PA .................................................................................... 41   Summary. .................................................................................................................. 45   Methods for Detecting Leniency/Severity .................................................................... 47   Utilizing IRT for Performance Appraisal ..................................................................... 49   Differential Item Functioning. .................................................................................. 50   Model-Fit Approach (e.g. Person-Fit). ..................................................................... 55   Limitations of Current IRT Approaches ....................................................................... 57   Differential Person Functioning .................................................................................... 60   Current Research ........................................................................................................... 64   Present Studies .............................................................................................................. 67   Pilot Study ......................................................................................................................... 69   Method .............................................................................................................................. 69   Participants .................................................................................................................... 69   Response Format. ...................................................................................................... 69   Response Instructions. .............................................................................................. 69   Evaluation Form. ....................................................................................................... 71   Measures ....................................................................................................................... 73   Evaluation Process and Format Reactions. ............................................................... 73   Goal Importance Questionnaire ................................................................................ 73   Additional Items ........................................................................................................ 73   Procedure ...................................................................................................................... 74  

8

Results ............................................................................................................................... 76   Reliability Analysis ....................................................................................................... 76   Manipulation Check ...................................................................................................... 76   Tests for Sex Effects ..................................................................................................... 78   Item-type Effects ........................................................................................................... 80   Pilot Study Discussion ...................................................................................................... 81   Main Study ........................................................................................................................ 83   Methods ............................................................................................................................ 83   Participants .................................................................................................................... 83   Manipulations ............................................................................................................... 83   Response Instructions ............................................................................................... 83   Measures ....................................................................................................................... 85   Evaluation Form ........................................................................................................ 85   Evaluation Process and Format Reactions. ............................................................... 85   Goal Importance Questionnaire. ............................................................................... 86   Additional Items. ....................................................................................................... 86   Psychology Department Rating Form. ...................................................................... 87   Procedure ...................................................................................................................... 87   Results ............................................................................................................................... 90   Test for Fatigue and Order Effects ............................................................................... 90   Manipulation Check/Goal Manipulation ...................................................................... 90   Comparison of Detection Methods ............................................................................... 93  

9

Classification with the Differential Person Functioning Method. ............................ 94   Classification with the Mean Score Method. ............................................................ 98   Classification with Skewness Ratings. ................................................................... 100   Classification Consistency between Methods ............................................................. 101   Summary ................................................................................................................. 109   Scale Differences ........................................................................................................ 110   Proportion of Biased Raters ........................................................................................ 110   Evaluation Reactions .................................................................................................. 112   Goal Questionnaire Relationships .............................................................................. 114   Discussion ....................................................................................................................... 118   Primary Findings ......................................................................................................... 119   Additional DPF Findings ............................................................................................ 125   Additional Goal-Based Findings ................................................................................ 129   Summary ..................................................................................................................... 133   Personnel Implications ................................................................................................ 134   Conclusions ................................................................................................................. 135   References ....................................................................................................................... 140   Appendix A ..................................................................................................................... 159   Appendix B ..................................................................................................................... 168   Appendix C ..................................................................................................................... 179   Appendix D ..................................................................................................................... 180   Appendix E ..................................................................................................................... 182  

10

Appendix F ..................................................................................................................... 183   Appendix G ..................................................................................................................... 184   Appendix H ..................................................................................................................... 186   Appendix I ...................................................................................................................... 187   Appendix J ...................................................................................................................... 188   Appendix K ..................................................................................................................... 190   Appendix L ..................................................................................................................... 197   Appendix M .................................................................................................................... 204   Appendix N ..................................................................................................................... 211   Appendix O ..................................................................................................................... 218   Appendix P ..................................................................................................................... 220   Appendix Q ..................................................................................................................... 222  

11

LIST OF TABLES   Table Page

1. Descriptives and Sex Differences across Format Type and Instruction Condition for Overall Evaluation Ratings .......................................................................................79

2. Open-ended Manipulation Check Item Percentages ...................................................91

3. Means and Standard Deviations for the Overall Scores for Each Response Condition.........................................................................................................................93

4. Results of the Classification Consistency Analysis for the DPF, Mean Score, and Skewness Methods for Detecting Rater Bias ................................................................105

5. Results of the Classification Consistency Analysis for the DPF, Mean Score, and Skewness Methods for each Type of Bias (i.e., Leniency and Severity) ......................107

6. Descriptive Statistics for Evaluating Reaction Items ................................................113

7. Descriptive Statistics for the Goal Importance Questionnaire ..................................114

12

LIST OF FIGURES

Figure Page

1. Generic Item Characteristics Curve ..........................................................................137

2. DPF Analysis Showing Two PCCs for Different Types of Items: Demonstrating DPF ...............................................................................................................................138

3. Theoretical Decision Classification Table ................................................................139

13

Introduction Defining, understanding, and evaluating performance within a work context is a central issue within industrial and organizational psychology (Arvey & Murphy, 1998; Landy & Farr, 1980). Realizing the importance of performance measurement and measuring it accurately are two distinct matters (Landy & Farr, 1980). Although it is a goal of an organizational decision maker to determine accurate assessments of employees’ performances (Murphy & Balzer, 1989), doing so is easier said than done. Often, decision makers assume the most accurate measurement of performance is hard, objective criteria (e.g., absences, accidents, or tardiness). However, these are commonly deficient measures that do not adequately capture an individual’s overall performance. Performance (as an ultimate criterion) is a complex construct that is difficult to completely capture. Deficiency occurs when the measurement of the performance criteria is incomplete. Indeed, Landy and Farr (1983) note several aspects that lead to this deficiency in objective data. First objective indices tend to have low reliability. For example, in terms of absences, the observation period may not be stable across measures or external factors (i.e., sick leave policies) may influence the reliability of absence measures. Second, objective measures are only available for a limited number of jobs. For example, it does not make sense to look at tardiness from those who may not have a predetermined work day, or even work from home on a frequent basis (e.g., consultants, contractors, etc.). Finally, the changing nature of work often makes objective measures inappropriate for measuring work performance. Technological advances make outputs more dependent on

14

those technologies than on individual performance. Because the goal of a performance appraisal is to choose criteria that optimize the assessment of job success, keeping overall deficiency to a minimum is imperative (Riggio, 2003). To compensate for the deficiencies in objective data, most ratings of individual performance depend on subjective indices (Guion, 1965; Murphy & Cleveland, 1995). Unfortunately the subjective data often leads to contaminated/biased ratings (i.e., rating errors). Indeed, according to Holzbach (1978; p. 578), “Rater bias, in its various forms and manifestations, is perhaps the most serious common drawback to performance ratings.” Because subjective ratings can be contaminated (biased), they lose the accuracy that decision makers desire (Borman, 1979, Landy & Farr, 1980). Hence, the dilemma; if objective data are deficient, and subjective data contaminated, how should performance be evaluated? Despite the realization that subjective indices may yield biased ratings, organizations have had no choice but to continue to use them because there is no other alternative. Indeed, subjective appraisals are found in 90% of organizations (Bernthal, Sumlin, Davis, & Rogers, 1997) and influence decision making processes (Bernardin & Villanova, 2005). This heavy use has influenced researchers and practitioners to seek a “cure” for dealing with biased ratings (Landy & Farr, 1980; Murphy & Cleveland, 1995; Saal, Downey, & Lahey, 1980). For example, research has examined how different rating formats (e.g., graphic rating scale vs. behaviorally anchored rating scales), different rater characteristics (e.g., peer vs. supervisor), different ratee characteristics (e.g., race, sex), different rater training programs (e.g., frame of reference training), and different

15

statistical controlling techniques influence the occurrence of various rating errors. Much of this research was based on the assumption that individuals unknowingly commit rating errors. In turn, errors are assumed the result of unconscious (i.e. automatic) information processing processes that might be overcome by “raising the consciousness” of raters through techniques such as error training. However, some researchers claim that there are instances in which individuals are aware of the biases in the ratings they give (Murphy & Cleveland, 1995). The evidence that biased ratings are due to the deliberate, “volitional” distortion of performance ratings has been growing (Bernardin & Beatty, 1984; Bernardin & Villanova, 1986; Murphy & Cleveland, 1995; Tziner, Murphy, & Cleveland, 2005). There have been speculations as to why individuals may intentionally distort their responses, including: 1) performance appraisal purpose, 2) organizational goals, and 3) rater goals (Murphy & Cleveland, 1995). However, there have been few empirical studies that have attempted to provided evidence of a goal-based (i.e., motivational) aspect behind the occurrence of rating errors. If individuals are intentionally distorting their responses, then the aforementioned approaches will remain insufficient for adequately understanding rating errors. Indeed, the goal-based approach, in which individuals provide ratings based on the goals they are pursuing, is the only current perspective that examines rater errors as an intentional process (Murphy & Cleveland, 1995). Because of this, rather than using techniques to control errors (e.g., format, training), it may be better to utilize methods to detect those who are committing errors and deal with their ratings accordingly. Unfortunately, the

16

limitations with the performance appraisal research are not isolated to interventions (e.g., format changes, training, etc). Indeed, there has been an unrealized opportunity to utilize newer statistical techniques that could provide better insights and explanations about why rater errors occur. These techniques typically focus on identifying ratings that fit a certain response pattern. Once those ratings are identified, then that information is utilized to make decisions regarding the usefulness (i.e., reliability and validity) of those ratings. Depending upon the type of error that one is attempting to detect, there are various statistical procedures that may be used (e.g., mean correlation among performance dimensions, over ratees [halo], mean ratings over ratees and dimensions [leniency], etc.). Even though these methods are consistently used, some have argued that there is still ambiguity concerning the detection of these rating errors, in part, because the incorrect unit of analysis has been utilized (Murphy & Balzer, 1989). For example, there has been a considerable amount of performance appraisal research that has examined leniency by examining differences between groups (e.g., peer ratings vs. subordinate ratings). However, this approach assumes that groups, not the individuals within the groups are lenient or severe. Additionally, mean ratings across ratees, although predominantly used, may not provide the most accurate information on how or why rating biases arise. The problem is that the ratings are confounded with the raters. For example, it could be that raters are applying biases only to a subset of ratees, but without a fully crossed design (i.e., all raters rate all ratees), this cannot be determined. Because of this, a technique that does not confound the rater with the ratings would be more ideal.

17

An alternative to the traditional methods of understanding and detecting rating errors is to examine person-fit models. Person-fit models use item response theory (IRT) and differential item functioning (DIF) to assess specific latent trait IRT models that represent rater effects (Wolfe, 2004). According to this person-fit approach, if an individual does not fit a model, then there is evidence that the individual is responding in a biased manner (e.g., his or her ratings are lenient or severe). Although these techniques have provided some useful information regarding the examination of rater effects, in that they demonstrate the usefulness of utilizing IRT models in conjunction with non-IRT based functions (i.e., DIF) for detecting various errors, there are still limitations to consider. For example, the person-fit modeling approach is similar to the traditional methods discussed previously, in that it assumes a typical response pattern. Although this person-fit approach may identify biased raters, it merely provides information about individuals and little, if any, information about groups of items. Information from the item level is not being captured; therefore even if a rater is identified as giving biased ratings, there is no understanding as to why. Just as with test bias, we should not ignore the item level. Indeed, it is the interaction between the individual and the items that should be the focus of our attention. For example, if an individual is demonstrating differential functioning as a rater, the properties of the item could then be examined to determine if the effects are due to the items themselves or possibly to some other factor (e.g., goals). Ideally, we would like to be able to cross our levels of analysis and obtain

18

information about both individuals and items that would allow us to determine who is giving biased ratings and why they are doing so. Fortunately there is a technique that may allow for a more sensitive examination of both individuals and item properties simultaneously. This technique is called differential person functioning (DPF: Johanson & Alsmadi, 2002). Rather than determining which items are “acting” differentially for different groups (e.g., peers vs. subordinates), DPF is a technique that can be used to determine if the responses for given individuals are different across different groups of items (e.g. focal vs. referent). Because of the fact that DPF takes both items and persons into account, it may be a more sensitive (and appropriate) technique than the traditional methods (i.e., mean differences, skewness) for identifying biased raters. Specifically, utilizing the DPF technique is not just a matter of specificity (i.e., one rater for one ratee as opposed to multiple ratings averaged across multiple ratees for a given rater), but it is more sensitive in that the information about individuals allows for the detection of biased raters and the information about the items allows for an understanding of why he/she is giving those ratings. The DPF technique has yet to be used in performance appraisal research; hence it is my intention to utilize this technique to detect rater errors during a performance appraisal situation. Specifically, I extend the DPF technique to detecting rater effects (specifically leniency/severity) from a performance evaluation measure. Unlike the other research that has attempted to detect rater effects with simulated data (i.e., person-fit models, IRT-base approaches), I will use field data. By using the more sensitive DPF

19

technique, I hope to demonstrate a larger effect when compared to the commonly used methods for detecting leniency. Additionally, I attempt to provide some empirical support for the goal-based, motivational approach that has been proposed (i.e., Cleveland & Murphy, 1992). Although some evidence exists for the goal-based conceptualization of rater errors, it is weak. Specifically, I will test the goal-based theory by manipulating rater goals (e.g., administrative decisions vs. feedback) as well as item properties. In addition to providing evidence for a goal-based perspective, this test serves as a validation of the DPF technique for detecting lenient raters. In the following sections of this paper common rater errors, approaches to detecting and dealing with rater errors, the differential person functioning approach, and the methods utilized in the paper are discussed. Bias in Performance Appraisal Rating Errors As long as organizations continue to rely on rating instruments to evaluate the performance of employees, the quality of ratings will continue to be of interest to both managers and researchers (Tsui & Barry, 1986). It is important to know whether performance ratings provide an accurate reflection of performance for those being rated. Performance appraisal (PA) has traditionally been viewed as a measurement problem, which has focused on various issues including the reduction of test and rater bias (Murphy & Cleveland, 1991). Indeed, rater bias is considered a substantial source of error within psychological research (Hoyt, 2000). Because of this, there is an inherent need for criteria that can be used to assess the quality of ratings, focusing much of the PA research

20

on the search for “better,” more accurate, techniques for measuring job performance (Murphy & Cleveland, 1991). The most common approach used to examine the quality of performance ratings is to examine the psychometric characteristics/properties of the ratings themselves (Borman, 1991, Cleveland & Murphy, 1995). According to Murphy and Cleveland (1995) these measures of the psychometric quality of ratings are classified into three broad groups: 1) traditional psychometric criteria (e.g., reliability, validity); 2) indices of rater errors that reflect response biases on the part of the raters; and 3) direct measures of the accuracy. Of these, the rater error approach has been the most common. Rater error approaches assume accuracy is a function of the presence or absence of rating errors (Murphy & Cleveland, 1995). Likewise, many believe that rater errors tend to undermine the reliability and validity of the information obtained (Bannister, Kinicki, DeNisi, & Hom, 1987). Hence, the most common method for evaluating ratings involves the assessment of rater errors (Landy, 1986). Based on a comprehensive review of the literature, Saal and colleagues (1980) identified the major categories of rater errors: 1) halo, 2) central tendency, and 3) leniency (or severity). Research that has examined rater errors has taken many different perspectives. As such, there are numerous operational definitions of each type of error. To further complicate the matter, there are different statistical methods of detecting each type of error, according the operational definition that is used. Below I will review each of the three errors as well as the typical definitions that are used for each error.

21

Halo. Halo refers to a rater’s tendency to give similar evaluations to separate aspects of an individual’s performance, even though the dimensions are clearly distinct (Thorndike, 1920). Typically, halo is defined in one of two ways: 1) a rater’s tendency to allow overall (global) perceptions to distort the ratings on specific aspects of a ratee’s performance, or 2) a raters unwillingness to discriminate among separate aspects of an individual’s performance (Saal et al., 1980). The first definition tends to agree with the belief that raters commit halo unintentionally, therefore there are statistical methods to control for such errors (see Ritti, 1964). However there are several researchers who have shown that this approach to control for halo tends to do more harm than good and should not be used (Harvey, 1982; Hulin, 1982; Murphy, 1982). According to Murphy and Cleveland (1995) the second definition suggests that individuals intentionally distort their ratings so that the correlations among dimensions correspond to the conceptual similarity among dimensions. This definition is more in line with the current notion behind rater errors. The issue with this definition is that one reason why ratings on separate dimensions may be correlated is that the behaviors being rated really are correlated (valid (true) halo). It is invalid (illusory) halo that is a result of the intentional distortion on the part of the rater, therefore the rating error that is occurring. Central Tendency. Central tendency refers to a rater’s unwillingness to assign extreme (i.e., high or low) ratings. This is an error in which a rater assesses a disproportionately large number of ratees as performing in the central part of a distribution of rated performance, in contrast to their true level of performance

22

(Muchinsky, 2006). The assumption is that the true distribution of performance is assumed to be normal and the true variability of performance is considered “substantial” (Murphy & Cleveland, 1995). When the variability of the ratings is small, there is range restriction. When the range restriction falls around the center of the scale, then central tendency is believed to be occurring. If a rater is committing this error, one can imply that they view everyone as “average”, because only the middle part of the evaluation scale is utilized. Many times central tendency occurs when raters are supposed to rate aspects of an individual’s performance unfamiliar to them. Leniency. Leniency typically refers to the tendency of raters to “rate well above the midpoint of the scales used” (Kneeland, 1929; p. 356), as indicated by average ratings over all ratees (Saal, et. al., 1980). The assumption in this case is that the true mean level of performance corresponds to the scale midpoint. The notion behind this error is that a rater may give ratings that are higher than warranted by actual performance (leniency) or ratings lower than warranted (severity). Leniency (as with central tendency) is a distributional error in that the restriction of range in scores around the upper end of the scale (high mean ratings) imply leniency. There is much speculation (especially within performance appraisal research) as to why raters give lenient/sever ratings (e.g., inaccurate frame of reference or norms, PA purpose, etc). Indeed, inflation is one of the most frequently cited problems associated with performance ratings (Bernardin & Orban, 1990; Ilgen & Feldman, 1983; Murphy & Cleveland, 1995). The appraisal process for the military and civil service are examples of domains where the pervasiveness of leniency in ratings often renders and entire appraisal system worthless (Bernardin & Orban, 1990).

23

Similarly, Hide (1982) noted that there are often “vast quantities” of inflated reports that lead to severe consequences when using performance ratings. Lenient ratings can lead to a variety of outcomes that can severely influence decision making. Specifically, lenient ratings are a source of problems when an organization wants to terminate an employee because of poor performance (Bernardin & Cascio, 1988), as well as when personnel decisions are based on comparisons of individuals to some standard (Bernardin & Orban, 1990). Similarly, Murphy and Cleveland (1995) provide a detailed discussion of several consequences as a result of inflated ratings. These include: 1) consequences for the ratee – pay, promotion, etc.; 2) consequences for the rater – manager looks better with higher performing employees; 3) avoidance of negative reactions – reduce confrontations with employees; and 4) maintaining the organization’s image. Just as with halo and central tendency, leniency could be the result of a rater’s unwillingness to give accurate ratings. Because intentional distortion is a possibility, the traditional methods for dealing with leniency may not appropriate in many situations. As such, the motivational (i.e., goal-based perspective) may be helpful in attempting to understand why raters give lenient (e.g., inaccurate) ratings. One purpose of this study is to provide empirical support for this perspective. Whereas the majority of rater error research has focused on these three main errors, there have been other errors discussed within the literature: logical error (Newcomb, 1931); contrast error (Murray, 1938); proximity error (Stockford & Bissell, 1949); similar-to-me (Latham, Wexley, & Pursell, 1975); the first impression error

24

(Latham, et. al., 1975); and systematic distortion (Kozlowski & Kirsch, 1987). Due to the lack of research surrounding these errors, they will not be the focus of the remainder of this paper. Within the performance appraisal literature, it is has been noted that leniency (i.e., inflated ratings) is the most serious problem that needs to be dealt with due to the implications lenient or severe ratings may have on personnel decisions (Ilgen & Feldman, 1983; Landy & Farr, 1980; Murphy & Cleveland, 1995). Interestingly though, leniency may not be an “error” at all, but rather a behavior that allows a rater to obtain rewards and avoid punishments (Murphy & Cleveland, 1995). From this perspective, there are many understandable reasons for giving inaccurate (typically inflated) ratings, and more importantly, relatively few reasons for giving accurate ratings. As such, applied researchers have been focused on finding ways to eliminate and/or reduce lenient raters. Below, the techniques that were developed for this purpose are reviewed. Methods for Addressing Rating Errors Over the last 80 years, there have been many attempts to understand and deal with rater errors (Murphy & Cleveland, 1995). Over that time researchers have taken several approaches. Much of the early work regarding the issue of rater errors, focused on the development and comparison of different rating formats. Rater training focused on reducing rating errors and improving observation skills has also received substantial attention (Ilgen, Barnes-Farrell, & McKellin, 1993). Results from format research are somewhat mixed, and although there is evidence that training does reduce certain rating errors, there is a common theme to both perspectives. They perceive the rater as committing errors unknowingly; therefore changes to the environment should alleviate

25

the occurrence of errors. Because of the consistently mixed results from both the format research and the training research, the focus began to shift away from structural changes to process changes. In general, there was a belief that cognitive characteristics of raters (e.g., rater characteristics, ratee characteristics, etc.) held the most promise for understanding the rating process (e.g., Feldman, 1986; Landy & Farr, 1980). More recently, a motivational approach has begun to make some headway because it addresses the issue of why individuals provide certain ratings (e.g., rating errors). Specifically, researchers believe that the goals of the rater, and/or goals of the organization, will influence the types of ratings that individuals will give (Murphy & Cleveland, 1995). Although, research based on the cognitive approach and the goal-based approach have been more helpful at providing answers to issues regarding rater errors there has been limited success at best. Each of these areas of research and their results are reviewed below. Rating Format Much of the early work dealing with PA had focused on the development of many different types of rating formats to be used for both research and practice. As noted by Borman (1991), it has been compelling for researchers to believe that there are characteristics of the rating formats themselves that play a role in determining the accuracy of ratings. Indeed, there is an enormous amount of research devoted to the efforts of exploring the potential effects of rating formats on rating errors. According to Murphy and Cleveland (1995), if the number of studies devoted to the rating scale format were counted, it would appear as if this were the most important issue in PA, dating back

26

to the pioneering work of Paterson (1922) and his development of the graphic rating scale. Most of the popular methods typically require raters to provide some judgment of performance based upon some absolute criterion (e.g., goal) or the performance of others (Berdardin & Beatty, 1984). Either way, raters are being asked to make performance- based decisions based on human judgment. As such, there is potential that rating errors may occur. Because of this, much of the research on scale formats has attempted to determine what formats are superior (i.e., which ones result in the fewest rater errors) (e.g., Bernardin, 1977; Borman & Dunnette, 1975; Borman, 1979). For example, research has examined specific characteristics of the rating scales, such as: the number of response categories (e.g., Bernardin, LaShells, Smith, & Alvares, 1976), types of anchors (e.g., Smith & Kendall, 1963), the process of assigning values to anchors (e.g., Barnes & Landy, 1979; Silverman & Wexley, 1984), as well as the psychological processes involved when using different formats (Murphy & Constans, 1987, 1988). Borman (1991) lists 12 different types of rating formats that have been examined in both research and practice (e.g., forced choice, critical incidents, behaviorally anchored rating scales , etc; for an extensive review of formats see Bernardin & Beatty, 1984; Whisler & Harper, 1962). Although, research on formats has been extensive and long lasting, much of the current PA research has paid little attention to the question of which format is best (Murphy & Cleveland, 1995). This drop off in interest is mainly due to the results of a review by Landy and Farr (1980). Based on their search of the literature, they concluded that formats had only a minimal effect on the quality of ratings. Additionally,

Full document contains 224 pages
Abstract: Measuring performance in the workplace is an endeavor that has been the central focus of many applied researchers and practitioners. Due to the limited information that objective data provides to decision makers, subjective data are often used to supplement performance ratings. Unfortunately subjective ratings can be biased. Indeed, rating errors frequently bias ratings and have plagued performance evaluations. Much of the performance appraisal (PA) research has focused on ways of eliminating, detecting, or controlling these rater errors. The results from these areas are mixed and insufficient in providing insights and understanding about how to deal with rater errors. This research extends and tests a technique called differential person functioning (DPF; Johanson & Alsmadi, 2002) to the detection of rater bias (specifically leniency/severity) during a performance evaluation, as well as test a goal-based approach for performance evaluations. The DPF technique is used to identify the responses for a given individual that are different for different groups of items. The goal-based approach proposes that individuals' pursuit of different goals is what leads to different ratings. Two studies were conducted to examine these phenomena. The first study was a pilot study to refine the materials and manipulations that were to be used in the main study. Specifically, two different evaluation formats were compared, sex differences were examined, and the manipulation was tested. In the second study (i.e., the main study) the sensitivity and consistency of the DPF technique was compared with two other traditional methods for detecting leniency/severity. Participants completed an actual performance evaluation for a faculty member under one of three different response instructions. The results of the main study indicated that the DPF technique was not more sensitive than the other traditional methods. Indeed all methods examined were insensitive to the manipulation, thus all were ineffective at detecting rater bias. Although the DPF method was ineffective, results provided support for the goal-based approach. Those raters who were responding under different instructions (i.e., goals) provided significantly different ratings. These findings suggest that there was a reasonable opportunity for differential ratings to occur across groups, but yet none of the detection techniques were effective at detecting them. The discussions of these studies provide implications for the findings as well as implications for the DPF technique, the goal-based approach, and other personnel decisions.