- Type
- Single Paper
- Time
- 09:00 - 10:30
- Room
- SM O1.11 (Lecture Room)
Session Information
This page shows the session details and the presentations assigned to this session.
#Diff2Score - Identifying textual characteristics of "Difficult-to-Score texts"
Abstract
Difficult-to-score texts are texts that reduce inter-rater agreement (Wolfe et al., 2016) or have poor model-fit-statistics on the essay level (Wind et al., 2017). In this study, we follow the second approach, and ask: To which degree are textual characteristics of L1 German texts associated with poor rating quality?To investigate textual characteristics, we measure, for example, text length and lexical diversity (Wolfe et al. 2016; Freundberger et al., 2018). To investigate rating quality, we use a variation of a Many-Facet-Rasch model (MFRM) by Eckes (2005), integrating raters, criteria, prompts, and text types as facets into the model. The model-fit-statistics are interpreted as indices for rating quality und used in a correlational analysis with the measures of essay characteristics. All analyses are run in R. Data stem from an Austrian-nationwide writing assessment. As all fourth graders produced handwritten texts in their L1 (Austrian German), all texts had to be digitized. In this study, 186 student texts responding to eight prompts across four text types (e. g., descriptive texts) were scored by a panel of 161 trained raters. Each rater scored three texts with a text-type specific rating scale covering criteria in four dimensions (e. g., structure).To date, a manual error correction has been conducted and textual characteristics were measured. Preliminary results indicate substantial variation in text length among the texts, with an average length of 105 words and a range of 41-336 words; our presentation will report further results. Findings may improve criteria-based feedback in schools and inform the design of future rater training programs in assessments. Eckes, T. (2005). Evaluation von Beurteilungen. Psychometrische Qualitätssicherung mit dem Multifacetten-Rasch-Modell. Zeitschrift für Psychologie, 213 (2), 77–96.Freunberger, R., Breit, S. & Illetschko, M. (2018). Beurteilerübereinstimmung und schwer zu beurteilende Texte im Vergleich. In G. Sigott (Ed.), Language Testing in Austria taking Stock. Lang, 373–388.Wind, S. A., Stager, C., & Patil, Y. J. (2017). Exploring the relationship between textual characteristics and rating quality in rater-mediated writing assessments. AW, 34, 1–15. Wolfe, E.W.; Song, T. & Jiao, H. (2016). Features of difficult-to-score essays. In AW, 27, 1–10.
How context and purpose shape assessment: methodological considerations for measuring text quality
Abstract
This paper argues that methods for measuring text quality in writing research should be anchored in the specific context and intended purpose of the stakeholders participating in the respective project. Project context and purpose can lead to different priorities and weightings for aspects such as construct validity, efficiency, and the amount of pedagogical information gained (Knoch, 2021; Weigle, 2002). We will show how we designed assessments for three projects, discussing the advantages and disadvantages of the methods in relation to the context, the stakeholders’ goals, and the effect of the studies on writing practices.In the first project, we combined human rating and corpus-based assessment to create writing ability profiles in vocational schools, providing teachers with data-informed pedagogical recommendations (Konstantinidou & Liste Lamas, 2023). In the second study, we conducted an intervention to measure the effectiveness of scenario-based reading and writing education in vocational schools. Text quality was assessed using human rating and consensus scoring (Konstantinidou et al., 2022). In the third project, we developed a diagnostic writing test for engineering students. Based on the results, students with weak written communication skills are recommended additional communication courses. Assessment relied on machine-learning methods using linguistic features from corpora and AI-applications that explain human ratings.While the first study prioritised the quantity of information obtained, the second prioritised validity. The third project focused on efficiency, as more than 700 students are tested twice a year.Reflecting the assessment methods in their specific contexts should contribute to the design of text quality assessments that are informed by context and purpose, especially in research projects with implications for writing practice.Konstantinidou, L. & Liste Lamas, E. (2023). Schreibkompetenz-Profile in der beruflichen Bildung: heterogen, individuell und schwer interpretierbar?. Osnabrücker Beiträge zur Sprachtheorie, 101, 133-150.Konstantinidou, L., Madlener-Charpentier, K., Opacic, A., Gautschi, C. & Hoefele, J. (2022). Literacy in vocational education and training: scenario-based reading and writing education. Reading and Writing, 36(4), 1025-1052Knoch, U. (2021). Assessing writing. In G. Fulcher & L. Harding (Eds.), The Routledge handbook of language testing (2nd ed., pp. 236–253). Routledge. Weigle, S. C. (2002). Assessing Writing. Cambridge University Press.
Monitoring Rater-reliability in Decentralized Organizations
Abstract
Reliability relates to the fairness and consistency of assessment. With 158 Goethe Institutes in 98 countries worldwide and 390 exam partners for the exam administration, the question of a suitable Human Resource Development Program for raters and quality management concerning rating and grading of the test section “Writing” in a decentralized system with its approximately 5,000 trained raters worldwide arises. As each test taker’s performance is rated by two raters individually in situ, the inter-rater reliability, respectively the consistency between the two raters needs to be ensured. Without training, rating and grading of the same students’ performances lead to a great variety and variance in grades (Weiss 1965, Birkel and Birkel 2002). Lumley (2005) even claims that not the rating criteria are at the heart of the correct assessment, but the rater training as the rater is crucial and central to the rating process. Whether the rating scale or the criteria are adequate, respectively the fair grade was given, is not at issue. Rather, the issue is: How reliable do the raters apply a given rating scale? As a measure of agreement for a same sample with different raters different concordance coefficients can be determined. To exemplify the methodology, the following null hypothesis can be deduced:H0: The inter-rater reliability of two trained raters for each exam administration is insufficient if the respective value is equal to or smaller than a pre-determined threshold value. As the Goethe-Institut’s rating scales are criterion-based and either ordinal or interval scales, the Null Hypothesis is tested and checked for robustness by analyzing five concordance coefficients with the aim of a generalizability theory. The study was conducted by means of the example of the Goethe-Zertifikats B1 at selected test centres. The initial results are very satisfactory: Inter-rater reliability was substantial, as evidenced by Krippendorff’s alpha (α = .848), Intra-Class-Correlation (ICC(2) = .83), and Spearman’s rank correlation (ρ = .85). Cohen’s kappa indicated moderate agreement (κ = .527), whereas Gwet’s AC2 suggested almost perfect agreement (AC2 = .90). Further specifications will be provided within the detailed analysis.