

BERT was finetuned on sets of essays in order to classify, or predict, rubric-based scores.

The deep learning method used a model called BERT (Bidirectional Encoder Representations from Transformers) that was initially trained to predict a masked word and to predict whether a sentence followed a prior sentence. Solution Given these developments, this study sought to examine the robustness of one deep learning method to a traditional automated scoring method on a set of gaming responses. One potential promise of deep learning models is that they are more robust to gaming behaviors because they consider word use in context and therefore may not require filters or may require fewer filters. Importantly, these models are designed to include sequence – i.e., word order – in the modelling process and thereby are thought to better model language than bag of words methods. While older models used feature-based approaches whereby experts wrote algorithms to create features thought relevant to item scoring and predicted scores using these weights applied to these features, newer approaches learn features alongside the predictive model using very large, multi-layered neural networks (often called deep learning). The state-of-the-art in machine learning scoring has evolved in recent years to achieve gains in accuracy in a number of predictive tasks. As a result, almost every operational automated essay scoring engine uses filters to identify aberrant responses, either flagging them as such or routing them for human review and scoring.

Additionally, automated scoring of essays can be viewed negatively by the public in part because of how they identify and score unusual responses. Engines have been found to be susceptible to these responses, but the impact of such responses varies by item and engine design. The predictions of the AES system will be discussed in relation to previous studies that have investigated the role of AES in assessing writing in education, as well as implications for the use of AES in the context of language testing, and zoom in on challenges? when working with data of young learners.The Problem One of the frequent criticisms of automated essay scoring is that engines do not understand language and therefore can be ‘tricked’ into giving higher scores than they should. The second deep learning approach relies on Dutch state-of-the-art pre-trained language models (Delobelle et al., 2020), which have been fine-tuned on the task of AES. For the first approach, all writing was processed with T-Scan (Pander Maat et al., 2014) to derive linguistic text characteristics including lexical and syntactic measures. We experimented with two flavours of machine learning: a traditional feature-based approach and a deep learning one. This assessed corpus was used to train machine learning models and, as such, create a first AES system to assess Dutch writing products of learners in the first stage of secondary education. This exploratory study investigates the possibilities of AES to obtain reliable scoring for Dutch-speaking learners in the first stage of secondary education.Ī corpus of 5,110 writing products of 2,613 pupils aged 13-14, based on six prompts, was holistically scored by 852 in-service and pre-service teachers, using pairwise comparison. While research has shown AES can be used to assess writing, previous research has focused almost exclusively on one particular language (English) and genre (essays) mostly written for higher education purposes (Strobl et al., 2019). AES systems automatically score a text using machine learning and by extracting linguistic characteristics from a text (Allen et al., 2016). This study focuses on the possibilities of Automated Essay Scoring (AES) to consistently, fairly and practically assess a large number of writing products. Recent educational policies have set out that Flemish education will implement large-scale government-lead standardized testing, including Dutch reading comprehension and writing.
