The quality of teacher-developed rubrics for assessing student performance in the classroom
TABLE OF CONTENTS
ABSTRACT iii ACKNOWLEDGEMENTS iv LIST OF TABLES vii CHAPTER ONE 1 Introduction 1 Statement of the Problem 1 Hypotheses 5 Definition of Variables 6 Summary 8 CHAPTER TWO 9 Review of Literature 9 Assessment and Accountability 9 Shift from Traditional to Performance-Based Assessment 13 Classroom Assessment 15 Classroom Performance Assessment 20 Rubrics for Assessing Performance and Promoting Learning 23 Developing a “Meta-Rubric” 29 Performance-Based Assessment in Missouri 31
CHAPTER THREE 35 Methods 35 Participants 35 Sampling Plan 36 Instruments 42 Survey Questionnaire 42 Meta-Rubric 42 Design and Analysis 51 CHAPTER FOUR 53 Results 53 Participants 53 Assessment Practices 56 Quality of Teacher-Developed Rubrics 61 Summary 71 CHAPTER FIVE 72 Discussion 72 Summary of Findings 72 Limitations 77 Conclusion 78 REFERENCES 79 APPENDICES 89
LIST OF TABLES
1. Schools Participating Spring 2006 38 2. Surveys and Rubrics Received Spring 2006 38 3. School Participating Fall 2007 40 4. Surveys and Rubrics Received Spring 2006 40 5. Overall Representation of Schools Participating 41 6. Meta-Rubric Assessment Purpose Component 47 7. Meta-Rubric Performance Criteria Component 48 8. Meta-Rubric Scoring System Component 49 9. Meta-Rubric General Qualities Component 49 10. Inter-Rater Reliability Estimates 50 11. Respondents’ Gender by School Level 54 12. Years Teaching & Assessment Training by School Level 55 13. Teacher-Constructed Assessments 56 14. Use of Performance-Based Versus Traditional Assessments 57 15. Teacher-Reported Confidence in Assessment Development 58 16. Teacher-Reported Practices in Assessment Development 60 17. Meta-Rubric Scores by Criteria 63 18. Component Mean Scores by School Level 65 19. Meta-Rubric Scores and Use of Performance Assessment 67
20. Meta-Rubric Scores and Level of Confidence 68 21. Meta-Rubric Scores and Practices in Assessment 69 22. Meta-Rubric Scores and Professional Development 70
Problem Statement The purpose of any assessment is to gain a better understanding of students’ current level of knowledge in a particular area. Assessments can take on many different forms ranging from the subjective judgments made by teachers everyday through classroom observation to nationally standardized measures of student proficiency (Dietel, R.J, Herman, J.L, &Knuth, R.A, 1991). While standards-based assessment has been a focus in education for years, the recent assessment and accountability system employed under the No Child Left Behind Act (NCLB) of 2001 (Pub. L. No. 101-110, 2002), has left many school districts facing issues surrounding the pressure to meet mandated proficiency targets (Linn, 2005). Efforts to “leave no child behind” focus primarily on measuring student success via large scale assessments with little interest in the quality of the assessments teachers develop to assess students success in the classroom. NCLB is not the first legislation to focus efforts on improving student achievement. During the 1980’s, the National Commission on Excellence in Education reported on the quality of American education in the A Nation at Risk: The Imperative for Educational Reform (1983) report. This landmark report led to a fury of reform in the educational arena and was followed in the early 1990’s by the Education Summit’s call for increased academic standards (1991) that went far
beyond traditional subject matter, demanding that students in American schools also posses skills such as creative thinking, decision making, problem solving collaboration and self management. Reform efforts of the 1980’s and 1990’s required those in the field of education to begin developing alternative methods of assessing student performance that would more adequately measure the “complex nature of academic achievement” (Stiggins, 1995, p.138). As a result, the performance-based assessment movement hit the American educational community with a force not seen since the inception of the traditional paper-pencil objective measure developed early in the 1900’s (Stiggins, 1995). With assessment focused on more subjective measures of student performance, many in the assessment community called for guidelines in the development of quality performance criteria (Quellmalz, 1991). The assessment community focused efforts on providing practitioners with an articulation of the “rules of evidence” (Stiggins, 1995, p. 1) for use to judge the quality of performance tasks as well as the scoring system or rubric used to assess student success (Moskal, 2000, 2003; Mertler, 2001). Much of the research and many of the development issues surrounding performance-based assessment have focused primarily on the development of large-scale performance-based measures with little emphasis placed on the quality of the performance-assessment teachers develop to judge students’ progress (Stiggins, 2001). Recently, school districts have been encouraged to cast a wider net in examining school success by also valuing the information provided from
both district tests and teacher-developed classroom measures (Mid-Continent Research for Education and Learning [McREL], 2003). While large-scale, high-stakes assessment has long been the standard with which school improvement has been judged, critics argue that they often lack the sensitivity to classroom instruction, thus remaining disconnected from the environment they are intended to measure (Popham, 2003). This is disturbing considering that teachers spend a significant portion of their professional time assessing student progress and the majority of school assessment takes place in the classroom (Brualdi, 1998), where teachers and students can spend as much as 33% of their time in assessment-related activities (Stiggins, 1991). Classroom assessment literature documents the methods most commonly used to examine key issues. These include surveys of teachers’ attitudes towards testing in general, reviews of teacher-developed assessments, and tests of teachers’ understanding of the principles of measurements. While teachers report a high level of confidence in their ability to produce valid and reliable tests (Wise, Lukin & Roos, 1991), they are generally not the best judges of their own abilities or knowledge in test construction (Boothroyd, McMorris, & Pruzek, 1991). On the other hand, many teachers believe they need strong measurement skills (Boothroyd, et al., 1991) and report a level of discomfort with the quality of their own tests (Stiggins & Bridgeford, 1985). These studies find that teachers often lack understanding of basic measurement concepts which should not be surprising in light of the fact that by 1999 only 33% of states required some training in measurement as a part of receiving a license to teach
In addition, even when training is provided, it usually focuses on the administration and interpretation of large-scales assessments (Stiggins, 2001). Classroom assessment research has focused primarily on teachers’ abilities to construct tradit ional paper-pencil, objectively-scored tests with very little in the research literature focused on teachers’ use and development of subjectively-scored measures of performance. Many offer guidelines and suggestions for implementing performance assessment in the classroom (Arter, 1999; Brualdi, 1998; Moskal, 2000) but very few have examined the quality of such teacher-developed tools. Those studies that have examined teacher-developed performance assessments note teachers’ difficulty in articulating the purpose and clearly articulating the scoring system on which performance is to be judged (Haydel, Oescher, & Banbury, 1995). In addition, while many advocate that performance assessments have a positive impact on student learning (Arter, 1999; Luft, 1997; Goodrich, 1996) there appears to be only qualitative evidence of this relationship (Shepard, 1995). Most studies are limited in their scope by focusing on a small population of teachers, usually confined to one school district (Haydel, Oescher, & Banbury,1995). More comprehensive studies that utilize a larger sample of teacher developed performance tasks and rubrics are needed. This study investigated the quality of teacher-designed rubrics. A meta-rubric was designed in an effort to rate the degree to which teachers follow the common guidelines most often associated with good performance-based assessment rubrics.
Relationships between teachers’ ability to develop quality rubrics and their beliefs / practices in assessment were also examined.
Hypotheses This study tested seven hypotheses related to the quality of teacher-designed performance assessment rubrics, teacher confidence and practices in the development of performance assessments, and the relationship between assessment practices and rubric quality. Specifically, hypotheses are: 1. The overall quality of teacher-designed rubrics used to judge student performance is high for elementary, middle, and high school teachers. 2. The scoring rubrics designed by teachers to measure student performance provide evidence that students utilize higher-order thinking abilities when engaged in the performance task begin measured. 3. Differences in the quality of teacher-designed rubrics to judge classroom performance tasks exist between elementary, middle, and high school teachers. 4. The percent of time teachers report using performance measures in the classroom is positively correlated with the quality of the rubrics they construct. 5. Teachers’ level of confidence when constructing performance tasks and rubrics is positively correlated with the quality of rubrics developed for elementary, middle, and high school teachers.
6. Teachers’ self-reported use of “best practices” employed when constructing performance task and rubrics is positively correlated with the quality of rubrics developed for elementary, middle, and high school teachers. 7. The number of professional development activities teachers have attended which focus on performance based assessment is positively correlated with the quality of the rubrics they construct.
Definition of Variables The primary independent variable for this study is school level (elementary, middle and high school). As the state of Missouri provided the sample for this study, participants were Missouri teachers who teach in 3 rd , 4 th , 7 th , 8 th , 10 th , and 11 th grade classrooms. These grade levels were selected as they corresponded with the benchmarks years for the Missouri Assessment Program (MAP) assessment of communication arts and mathematics from 1994-2005. While additional grade level assessments are now conducted within school districts across the state, per NCLB, the primary focus on performance assessments for the past ten years has been at these grade levels. Missouri offers a unique population for examining the quality of teacher- developed rubrics used in the assessment of student performance in the classroom. Missouri teachers have been provided extensive training on the development of classroom performance assessments for nearly ten years with the state spending in
excess of $53 million for the MAP. Therefore, Missouri teachers should have a foundation in the principles of quality performance tasks and the rubrics used to judge students’ performance. In addition to school level, demographic data such as years of teaching, and the number of professional development trainings in assessment- related activities is also examined here. The dependent variables of interest in this study were (1) a measure of teacher confidence in their ability to produce reliable and valid performance assessments, (2) a measure of the degree to which teachers engage in performance assessment “best-practices” noted in the literature on classroom assessment, and (3) the quality of scoring guides / rubrics they submit for analysis. In order to examine the quality of rubrics teachers construct, a rating system was developed by the researcher. This rating system is based on research-based principles of classroom performance-based assessment (Mertler, 2001; Moskal & Leydens, 2000, Moskal, 2003a, 2003b; Tierney & Simon, 2004).
Summary After almost two decades of research examining performance-based assessment issues, nearly all studies have focused on the development of quality large-scale performance measures. However, if the goal as stated in the No Child Left Behind Act of 2001 legislation truly is success for all students, then we must heed the call to examine success in school from a variety of sources, including, but not limited to those measures that classroom teachers develop to assess student performance
(McREL, 2003). In the past, research on teacher constructed tests has focused on traditional paper-pencil methods of assessment. However, with the emergence of performance assessment practices and the emphasis placed on more authentic means of assessing students, examining the quality of these teacher-developed measures is of utmost importance.
REVIEW OF LITERATURE
Assessment and Accountability The No Child Left Behind Act (NCLB) of 2001 (Pub. L. No. 101-110, 2002), is often praised for the attention it has placed on closing the achievement gap and enhancing teacher quality. NCLB mandates rely on state assessments and accountability efforts as the primary mechanisms for improving student achievement. The legislation’s most ambitious goal may be the requirement that all students must demonstrate proficiency in reading and math by the year 2014. However, NCLB has been a challenge for many as variability in content and performance standards exist across states, leading to differences in what is considered level of proficiency from state to state (Linn, Baker, & Betebenner, 2002). Adding to this discrepancy is the fact that the very term “proficient” is not yet clearly defined (Linn, 2005). These and other issues associated with performance standards were recently highlighted in the summer 2005 policy brief published by the National Center for Research on Evaluation, Standards, and Student Testing (CRESST). CRESST specifically compared the differences in the percent of students in Colorado and Missouri scoring proficient and above on the 2003 National Assessment of Educational Progress (NAEP) as well as the percent of students scoring proficient and above as determined by each state’s assessment program. While differences were noted in eighth grade math NAEP scores, with 34% of Colorado students scoring proficient or above
compared to 28% of Missouri students, these differences are modest compared to the differences in the percent of students deemed by the state tests to be proficient. In Colorado, 67% of students scored proficient or above on their state test but, in Missouri only 21% of student s were categorized as proficient or above. Discrepancies such as these serve to highlight the vast differences that exist in the level of rigor associated with state performance standards (Linn, 2005). In addition, when findings such as these are made public, reactionary approaches soon follow. The Missouri State Board of Education, for example, recently changed their definition of “proficient” from student performance that is above grade level to student performance that is at grade level. This change has led to reports chiding Missouri for “lowering the bar on the state assessment so that more students would be able to score higher” (http://www.kansascity.com/mld/kansascity/new ). This is but one example among the litany across the country of the reality that many states face as a result of the pressure to meet mandated proficiency targets. While the standards-based assessment movement, which now has spread its roots far beyond NCLB, was intended to refocus school districts on the use of assessment results to monitor progress (Linn, 2001), some critics argue that large- scale assessments used as the primary criterion on which schools’ progress is determined are not sensitive to the instruction that takes place in the classroom (Popham, 2003) allowing “the practice of sound assessment to remain disconnected from the day-to-day practice of instruction” (Stiggins, 2001, p. 5). Arguments abound, with many contending that assessment s should be the driving force for sound
instructional practices; however, others agree that the reverse may be true. Many low achieving schools are often disadvantaged by the policies surrounding Adequate Yearly Progress (AYP), because they must address progress immediately as well as face achievement targets that require them to make greater gains than higher achieving schools (Linn, 2003). In such instances, many lower achieving schools may resort to instructional practices that focus on knowledge level skills and less on complex or higher order tasks in efforts to meet the AYP requirement. Thus, high- stakes testing may be negatively impacting student learning (Kornhaber, 2004). Given these arguments, it would seem logical to expect that a state’s NCLB assessment should be instructionally sensitive by (1) providing clear descriptions of the assessment targets, (2) focusing on specific curricular objectives, and (3) providing results that can be used to inform instructional practice. Tests that are sensitive to the improvements in classroom instruction provide better understanding of student performance (Popham, 2003). Yet, the problem still exists that even the best large-scale assessment of student progress provides only a snapshot of performance, and the question has been posed as to whether or not “intimidation by assessment will lead to more effective schools” (Stiggins, 1999b, p. 191). Perhaps school districts should be encouraged to move beyond the examination of single test score obtained on a yearly basis to gathering data on student performance from a variety of sources that provide a more in-depth understanding of learning, including district-wide assessments aligned to state standards as well as teacher-developed classroom assessments (McREL, 2003).
While the NCLB is the nation’s most recent comprehensive mandate to place accountability at the fore of educational reform, it is not the first. Assessment and accountability issues have been focal points for decades. The past two decades, for example, have seen a flurry of reform efforts. During the 1980’s Secretary of Education, T.H. Bell created the National Commission on Excellence in Education whose primary responsibility was to report on the quality of education in America. In its open letter to the American people A Nation at Risk: The Imperative for Educational Reform (1983), the committee stated “…the educational foundations of our society are presently being eroded by a rising tide of mediocrity that threatens our very future as a Nation and a people” (http://www.ed.gov/pubs/NatAtRisk/index. html ). Findings from this landmark report led to a flurry of education reform which still exists today. Nearly ten years after the Nation at Risk report, President George Bush, Sr. met with the Nation’s governors for an Education Summit (1991) that led the way for what later became known as GOALS 2000 (1994). During this meeting which focused on improving education in America, the need for academic standards to reach far beyond the traditional subject area domains was outlined. Summit participants determined that in order for students to be successful, they must possess skills such as creative thinking, decision making, problem solving, collaboration and self-management. Students were no longer viewed as passive in their learning, but as active in their construction of knowledge. The 1991 Summit rhetoric evolved into the more current debate that maintains assessments must focus on the application of
student knowledge to performances that take place in “authentic” situations that are relevant and meaningful to the learner.
Shift from Traditional to Performance-Based Assessments During the reform efforts of the 1980’s and the 1990’s a shift occurred in assessment practices that mirrored the views from cognitive psychology which suggested that students learn best when they actively construct their own knowledge. Students were no longer viewed as passive recipients of knowledge but as active participants in the learning process. While traditional methods of assessment often separated process from product, alternative assessment methods were designed to focus on both the process and product as components to understanding student learning. In addition, assessment was no longer viewed as a way to document on the discrete and isolated skills, but rather a way to focus on the facilitation of inquiry- based learning (Anderson, 1999). As such, educators became increasingly aware of the need to focus on more so-called “authentic” measures of student learning that would engage students in meaningful assessment tasks. Driving the argument for performance-based assessment practices included the notion that performance tasks elicit complex thinking and deeper understanding of content (Baker, 1997). Supporters of authentic measures of performance posited that assessment’s primary purpose is to support the needs of learners. (Wiggins, 1990). As a result, educators began moving away from the traditional paper-pencil method of assessing student performance that had dominated the field for over 60 years and, instead, began
relying more heavily on performance assessment techniques requiring observation and professional judgment to make decisions regarding student achievement (Stiggins, 1995). The argument for this shift from the more traditional paper-pencil tests to performance assessments stems from a desire within the education community to measure more adequately the “complex nature of academic achievement” (Stiggins, 1995, p. 138) which often includes complex learning targets such as reasoning and/or the demonstration of a specific skill or ability. Proponents of performance assessment believe their measures to be more “authentic when we directly examine student performance on worthy intellectual tasks” (Wiggins, 1990, p. 1). As such, performance assessments are intended to “represent a set of strategies for the …application of knowledge, skills, and work habits through the performance tasks that are meaningful and engaging to students” (Hibbard et al., 1996, p.5). As the development of performance measures began to take hold, there were warnings from the assessment community regarding the reliability and validity of poorly constructed subjective measures (Dunbar, Kortez & Hooever, 1991). These warnings prompted assessment specialists to begin developing “rules of evidence” (Stiggins, 1991) for educators in the field to follow in order to provide support for the quality of the performance assessments implemented. Linn, Baker & Dunbar (1991) outlined eight criteria to be addressed in order for performance assessments to meet standards of validity. These criteria include the (1) consequences of the assessment, (2) test fairness, (3) transfer of performance to other domains, (4) cognitive complexity of tasks, (5) quality of the task, (6) content
coverage, (7) meaningfulness to students, and (8) efficiency in scoring. Quellmalz (1991) added to these criteria five distinct guidelines for the development of high quality performance measures. These focused on examining the significance of the assessment, generalizability of results, developmental appropriateness of the task, accessibility of the measure and the utility of results. Most of these guidelines and the research on the development of sound performance measures have focused primarily on large-scale assessments. Little research has been conducted that has focused on the performance measures teachers construct for use in their own classrooms even though researchers suggest that the classroom is “where most all school assessment takes place” (Stiggins, 1991, p. 538). Those in the field of classroom assessment continue to advocate the classroom assessment is perhaps the best way to understand and guide improvements in student learning (Guskey, 2003).
Classroom Assessment Classroom assessment continues to struggle to find its place within the field of educational measurement. It has yet to be given the attention it needs for the study of relationships across variables, including but not limited to student achievement. The fact that classroom assessment has been missing from much of the measurement research “highlights the difference between the test specialist’s emphasis on scientific measurement and the teacher’s practical measurement needs” (Stiggins & Bridgegord, 1985, p. 272). The discrepancy that exists today between the focus on
large-scale assessment and the lack of focus on classroom assessment may have its roots in the emergence of standardized testing over sixty years ago. As Stiggins points out (2001), “Professor Robert Scates of Duke University (1943) warned of dire consequences for school effectiveness if we permit the art of classroom assessment to be overwhelmed by the newly (at that time) emerging science of standardized testing” (p. 7). Differences do exist between large-scale, standardized measures and the measures typically employed by teachers in the classroom. These differences include the number of observations, the achievement targets examined, methods used, and the type of feedback provided (Stiggins, 2001). Unfortunately, the focus that has been placed on standardized assessment has residually impacted classroom assessment by perpetuating the assumption that such assessments have little value in determining improvement in student learning. This narrow focus on the psychometrics of standardized test development has lead many to ignore the remaining 99% of assessments a student will encounter in his/her educational career (Stiggins, 1999b). As we look back over the last century, we see the evolution of schools in which we have permitted sound assessment to remain disconnected from the day-to-day practice of instruction. It is as if someone in the distant past decided that teachers would teach, and they would need to know nothing about accurate assessment. On the other hand, measurement experts would develop and conduct our assessments and would need to know little about the day-to-day life in classrooms or the connection of assessment and instruction (Stiggins, 2001, p.5).
Adding to the divide between classroom assessment and large-scale, standardized measures is the fact that classroom teachers typically place a higher value on the tests they’ve constructed themselves as a better measure of student performance (Boothroyd, McMorris & Pruzek, 1991). Teachers often report a high level of confidence in their ability to produce valid and reliable measures (Wise, Lukin & Roos, 1991), sometimes overestimating their understanding of the skills and abilities associated with quality test construction (Boothroyd, et al, 1991). Teachers regularly construct assessments in an effort to measure student progress as well as to determine if previously taught information has been learned (Brualdi, 1998). In fact, teachers spend approximately 33% of their professional time devoted to the assessment of students in their classrooms (Stiggins, 1991) and construct an average of 54 tests of performance each school year (Marso & Pigge, 1988). However, teachers receive little training in assessment as a part of their teacher preparation programs. The majority of states do not require training in assessment as a part of the teacher certification process and pre-service teacher education programs, in turn, often do not require courses in measurement for graduation (Boothroyd, et al. , 1992; Stiggins, 1991). In 1983, only 22% of states required training in assessment as a part of teacher licensure, increasing to only 30% by 1999 (Stiggins, 2001). Even when training in assessment issues is available to teachers “it typically fails to provide the kinds of knowledge and skills needed to produce assessment literates” (Stiggins, 1991, p. 535) and focuses primarily on large-scale test administration and interpretation of results rather than how to develop quality classroom measures
(Stiggins & Bridgeford, 1985). As Stiggins (2001), points out in his review of the status of classroom assessment, there is little evidence to suggest that much has changed. The impact of such inadequate training in assessment practices has no doubt placed the quality of teacher-made assessments in question. Studies examining the tests that teachers develop have yielded some alarming results. Stiggins (2001) reported on two such studies conducted during the early eighties which suggested that teachers have a difficult time writing test items that require higher-order thinking skills (Carter, 1984) and that teachers tend to favor multiple-choice type formats (Fleming & Chambers 1983) leading to “superficial recall of facts” (p 8). In response to concerns, the National Council on Measurement in Education (NCME) along with the National Education Association (NEA) and the American Federation of Teachers (AFT) developed the Standards for Teacher Competence in Educational Assessment of Students (1990). The Standards, outline seven skills areas for examining teacher competence in assessment strategies: (1) Selecting an appropriate assessment methods based on the instructional decisions to be made, (2) Developing assessments that appropriately gather the information required to make such instructional decisions, (3) Understanding how to administer, score and interpret results of both standardized and classroom assessments, (4) Using assessment results to make decisions regarding students, the curriculum and school improvement, (5) Developing valid grading procedures, (6) Communicating assessment results to students, parents, school personnel as well as those outside of the educational