- Type
- Single Paper
- Date
- Tuesday June 2, 2026
- Time
- 11:00 - 12:30
- Room
- SM O1.11 (Lecture Room)
Session Information
This page shows the session details and the presentations assigned to this session.
Evaluating Writing Quality of Engineering Student Reports using Natural Language Processing Tools
Abstract
Research topic, area of investigation and aimIn higher education, writing instructors evaluate the quality of student texts and provide formative feedback on their writing. This laborious work could be supported using automatic Natural Language Processing (NLP) tools. Much research on the indices produced by NLP tools and the quality of writing has focused on essay writing. However, little research has explored report writing in science and engineering domains. To address this gap, this study investigates the association between the NLP indices and holistic human ratings of academic reports written by English as a Second Language (ESL) students in a master’s level computer science course.Methodological designData consists of 100+ academic reports (average length approx. 2800 words, excluding references), which were evaluated by writing instructors. Multiple regression analyses were conducted to identify NLP indices that predict the holistic instructor ratings of student reports.FindingsThe preliminary findings indicate that a regression model combining TAACO (Crossley et al., 2019), TAALED (Kyle et al., 2021), TAALES (Kyle et al., 2018) and TAASSC (Kyle, 2016) indices predicts nearly 45% of variance in holistic ratings.Relevance to domain of writingThe findings of this study extend earlier writing research to a new context and genre, i.e., longer engineering texts, and offers insights into the usability of NLP tools in writing instruction.ReferencesCrossley, S. A., Kyle, K., & Dascalu, M. (2019). The Tool for the Automatic Analysis of Cohesion 2.0: Integrating Semantic Similarity and Text Overlap. Behavioral Research Methods 51(1), pp. 14-27. https://doi.org/10.3758/s13428-018-1142-4Kyle, Kristopher, “Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-Based Indices of Syntactic Sophistication.” Dissertation, Georgia State University, 2016. https://doi.org/10.57709/8501051Kyle, K., Crossley, S. A., & Berger, C. (2018). The Tool for the Analysis of Lexical Sophistication (TAALES): Version 2.0. Behavior Research Methods 50(3), pp. 1030-1046. https://doi.org/10.3758/s13428-017-0924-4Kyle, K., Crossley, S. A., & Jarvis, S. (2021). Assessing the Validity of Lexical Diversity using Direct Judgements. Language Assessment Quarterly 18(2), pp. 154-170. https://doi.org/10.1080/15434303.2020.1844205
RATE THE RATER - Rater Agreement in English and German Text Assessments
Abstract
Grades play a crucial role in shaping students’ academic paths, influencing their self-confidence, future educational opportunities, and career prospects. Given this significance, it is essential to ensure that marking practices are fair, consistent, and reliable (Grausam, 2018; McNamara, Knoch, & Fan, 2019; Kunnan, 2000; Xi, 2010). This article investigates rater behaviour in the context of standardized competence assessment conducted by the Federal Institute for Quality Assurance in the Austrian School System (IQS) in Austrian secondary schools, focusing on the evaluation of written texts in English and German collected as part of the 2025 IKMPLUS assessments. The analysis combines evaluations of percentage agreement on multiply rated texts with statistical indices such as Cohen’s Kappa and intraclass correlation to quantify consistency and detect systematic rater effects. Additionally, the study explores how demographic and professional characteristics relate to rating accuracy and rater effects. Preliminary findings reveal that rater agreement on assigned marks falls below 80% for some texts, even with structured training, detailed rating guides, and expert support. While this may appear concerning, it reflects a well-documented international challenge: writing tasks are inherently complex to assess, and inter-rater reliability often remains problematic despite analytic or holistic scoring systems (Schipolowski & Böhme, 2016; Bouwer et al., 2024). Many-facet Rasch analyses confirm persistent rater effects such as severity, leniency, and central tendency bias, which can compromise fairness (Wind & Guo, 2021; Li, 2022). Importantly, the IQS addresses these challenges proactively. The IKMPLUS framework incorporates rigorous quality assurance measures and applies statistical scaling to compensate for rater variability, ensuring that reported results remain fair and comparable across students. These high standards position Austria among systems that prioritize equity and validity in large-scale assessments. Nevertheless, the findings have implications for classroom practice. Teachers often rely on non-standardized criteria and diverse training backgrounds, which may lead to inconsistencies in everyday grading. In subjects like German and English, where written performance is central, this raises questions about the validity of marks used for high-stakes decisions. Aligning classroom assessment practices more closely with standardized approaches – through updated training, clearer rubrics, and collaborative moderation – could strengthen fairness and transparency.