A comparison of automated scoring engines and human raters on the assessment of English essay writing

Chan, Kin Yee (2012) A comparison of automated scoring engines and human raters on the assessment of English essay writing. PhD thesis, James Cook University.

PDF (Thesis)
Download (13MB) | Preview
View at Publisher Website: https://doi.org/10.25903/n4vp-4087


Essay scoring operates both in the classroom and in high-stakes testing and the results of essay scoring in high-stakes assessment impact on the students' academic development. Thus, teachers, students and parents are under considerable pressure in the educational system in Hong Kong.

This research investigates how effective a new Automated Essay Scoring (AES) system, the Lexile Analyzer, is in measuring essay writing ability as scored by trained, professional human raters. To do this, an estimation of rater, genre, student and rubric effects must be undertaken. Then, the estimates from the recently developed the Lexile Analyzer is modeled against those of Intelligent Essay Assessor (IEA), an AES widely used in high-stakes testing in the United States. Finally, a concurrent co-calibration of the scores of human raters, the Lexile Analyzer and IEA needs to be undertaken.

Automated Essay Scoring (AES) employs computer technology to evaluate and score written prose in the place of the usual human grading. It has become more widely accepted as an educational technology for both assessment and classroom instruction, but little research has been done on validating the performance of the AES models in an applied educational context, and only some foundational work has been performed in their calibration against human ratings (Chung & Baker, 2003). This research attempts to apply the Rasch model to co-calibrate the scales of AES systems and human raters for grading essay writing and to implement the Many-facets Rasch Measurement (MFRM) to examine the fairness and consistency of raters and adjust students' final essay scores for the measured differences in prompt difficulty and rater severity in the essay writing. Rasch measurement models can provide a framework for obtaining objective and fair measurements of writing ability which are statistically invariant over raters, writing tasks and other aspects of the writing assessment process.

In this research, there are two sets of essay writing data; one from the United States and a second from Hong Kong. All the students were administered the writing prompts from the United States National Assessment of Educational Progress (NAEP) and essays were human raters scored by using the NAEP holistic essay marking rubrics covering the narrative, informative and persuasive genres. For the data set of students in the United States, 3453 essays were collected from 589 Grades 4 to 12 students in one school district in north-central Mississippi of the United States. All the students were administered four to six writing prompts. Every essay was rated by four of the nineteen independent paid trained raters from an established US testing company. Moreover, a sub-sample of essay writing of the students in the United States - selected to be the best matched to the levels of essay writing of the samples of Hong Kong students - were scored by two of the four experienced Hong Kong teachers of English from one local primary school. For the set of Hong Kong student essays, 408 essays were written by 137 Grade 6 Hong Kong students who were from the same local primary school as the Hong Kong raters. The students were administered two to three writing prompts, each essay was scored against the same rubric by two of the four Hong Kong English teachers. The US essays were scored using the Lexile Analyzer and IEAscoring engines; the HK essays by the Lexile Analyzer alone.

The MFRM analyses revealed that all human raters including 19 raters from the United States and 4 Hong Kong teachers of English scored essays consistently, but these two groups of raters scored essays very differently from each other according to the infit and outfit statistics. Moreover, the Lexile Analyzer and IEA scored consistently with the trained and professional human raters in this research. For the essay writing performance of students, most of the misfitting cases were students who performed erratically, produced off-topic essays or wrote essays that were scored by the most severe or lenient rater(s) in this research. For example, the students misinterpreted the instructions of the prompt; as a result, the ratings for that prompt were far worse than the students' average performances on the other prompts.

This research was conducted to calibrate the reliability of a new AES system, so teachers, students and parents are provided insight into the nature of the essay scoring process. Most importantly, the research probes issues about the fairness and consistency of essay scoring when the Lexile Analyzer is used as an assessment tool and highlights the time- and stress-saving potential of this tool for Hong Kong language teachers by overcoming problems of validity and reliability and by encouraging students' independence during their English language learning.

Undoubtedly, the results might be significant for teachers, students and parents elsewhere. The limitations of this research are related to the need for a larger set of data on student essays, scorings of human raters and AES systems. If a standard scale to measure the relative validity and reliability of the various AES systems and human raters is established, future research might provide a reference guide for the stakeholders who are involved in English essay writing assessment.

Item ID: 23841
Item Type: Thesis (PhD)
Keywords: AES systems; Automated Essay Scoring systems; automated essay scoring; computer essay scoring; educational assessment tools; educational assessment; educational systems; English essay assessment; essay scoring; high-stakes assessment; high-stakes testing; high-stakes tests; IEA; Intelligent Essay Assessor (IEA); Lexile Analyzer; Many-facets Rasch Measurement; MFRM
Date Deposited: 14 Nov 2012 01:31
FoR Codes: 13 EDUCATION > 1302 Curriculum and Pedagogy > 130204 English and Literacy Curriculum and Pedagogy (excl LOTE, ESL and TESOL) @ 100%
SEO Codes: 93 EDUCATION AND TRAINING > 9302 Teaching and Instruction > 930203 Teaching and Instruction Technologies @ 100%
Downloads: Total: 1574
Last 12 Months: 5
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page