AAAL 2026: Invited Colloquium
Convened by Andrea Révész and Shungo Suzuki
Artificial Intelligence in Applied Linguistics: Applications, Promises and Challenges
Conveners:
Shungo Suzuki, Lancaster University
Colloquium Abstract
Recent advancements in artificial intelligence (AI) technologies such as generative AI have sparked significant interest within the field of applied linguistics. Researchers across various subfields are exploring its potential applications, carefully evaluating both the opportunities it offers and the challenges it presents. In this colloquium experts on computational sociolinguistics, corpus linguistics, language teaching, language assessment, and second language acquisition consider the ethical use and/or potential negative consequences of AI use in their area. Each colloquium paper draws upon the colloquium presenters’ contributions published in the latest issue of the Annual Review of Applied Linguistics (ARAL), focusing on relevant AI technologies and their intersections with applied linguistics. In their work, the authors explored the role of AI through theoretical analysis or empirical research, taking a critical lens to identify constructive and effective pathways for applying AI in applied linguistics.
The colloquium will open with a brief introduction by the conveners, followed by five 15-minute presentations. Each contribution will conclude with the speakers reflecting on the potential applications, promises, and challenges of AI based on their research. The event will end with an open discussion between the audience and the panelists.
Developing and assessing second language listening and speaking: Does AI make it better?
Christine Goh, Nanyang Technological University, Singapore
Vahid Aryadoust, Nanyang Technological University, Singapore
This paper explores the transformative potential of Artificial Intelligence (AI), particularly Generative AI (GenAI), in supporting the teaching, learning, and assessment of second language (L2) listening and speaking. It examines how AI technologies, such as spoken dialogue systems and intelligent personal assistants, can refine existing practices, offer innovative solutions, and address challenges related to spoken language competencies, as well as drawbacks they present. It highlights the role of GenAI, explores its capabilities and limitations, and offers insights into the evolving role of GenAI in language education. The paper discusses actionable insights for educators and researchers, outlining practical considerations and future research directions for optimizing GenAI integration in the learning and assessment of listening and speaking.
NLP-powered quantitative verification of the English Grammar Profile’s structure level assignment
Daniela Verratti-Souto, University of Tübingen, Germany
Nelly Sagirov, University of Tübingen, Germany
Xiaobin Chen, University of Tübingen, Germany
Since its inception, the Common European Framework of Reference (CEFR) has become increasingly influential in the field of second language (L2) education. In an effort to define the grammatical structures that English learners acquire at each CEFR level, the English Grammar Profile (EGP) provides a list of 1,200 structure-level mappings derived from largely manual analysis of learner corpora. Though highly valuable for the design of didactic materials and examinations, the EGP lacks comprehensive quantitative methods to verify the acquisition levels it proposes for the grammatical structures. This paper presents an approach for revisiting the EGP structure-level mappings with empirical statistics. The approach utilizes automatic grammatical construction extraction, a large learner corpus, and statistical testing to empirically determine the level of each structure. The structure-level mappings resulting from our approach show limited agreement with that of the original EGP proposals, suggesting that frequency data alone likely does not provide enough evidence for the acquisition of the grammatical structures at the levels proposed by the EGP.
The capacity of ChatGPT-4 for L2 writing assessment: A closer look at accuracy, specificity, and relevance
Aysel Saricaoglu, University of Ankara, Turkey
Zeynep Bilki, TED University, Turkey
This study examined the capacity of ChatGPT-4 to assess L2 writing in an accurate, specific, and relevant way. Based on 35 argumentative essays written by upper-intermediate L2 writers in higher education, we evaluated ChatGPT-4’s assessment capacity across four L2 writing dimensions: (1) Task Response, (2) Coherence and Cohesion, (3) Lexical Resource, and (4) Grammatical Range and Accuracy. The main findings were: (a) ChatGPT-4 was exceptionally accurate in identifying the issues across the four dimensions; (b) ChatGPT-4 demonstrated more variability in feedback specificity, with more specific feedback in Grammatical Range and Accuracy and Lexical Resource, but more general feedback in Task Response and Coherence and Cohesion; and (c) ChatGPT-4’s feedback was highly relevant to the criteria in the Task Response and Coherence and Cohesion dimensions, but it occasionally misclassified errors in the Grammatical Range and Accuracy and Lexical Resource dimensions. Our findings contribute to a better understanding of ChatGPT-4 as an assessment tool, informing future research and practical applications in L2 writing assessment.
Automatic scoring of a German written elicited imitation test
Anastasia Drackert, Ruhr-University Bochum, Germany
Ronja Laarmann-Quante, Ruhr-University Bochum, Germany
We present an approach to the automated scoring of a German Written Elicited Imitation Test, designed to assess literacy-dependent procedural knowledge in German as a foreign language. In this test, sentences are briefly displayed on a screen and, after a short pause, test-takers are asked to reproduce the sentence in writing as accurately as possible. Responses are rated on a 5-point ordinal scale, where grammatical errors typically result in lower scores than lexical errors. We compare a rule-based model that implements the categories of the scoring rubric through hand-crafted rules, and a deep learning model trained on pairs of stimulus sentences and written responses. Both models achieve promising performance with quadratically weighted kappa (QWK) values around .87. However, their strengths differ: the rule-based model performs better on previously unseen stimulus sentences and at the extremes of the rating scale, while the deep learning model shows advantages in scoring mid-range responses, for which explicit rules are harder to define.
Ethical AI for language assessment: Principles, considerations and emerging tensions
Carla Pastorino Campos, Cambridge University Press and Assessment
Evelina Galaczi, Cambridge University Press and Assessment
Many language assessments – particularly those considered high-stakes – have the potential to significantly impact a person’s educational, employment and social opportunities, and should therefore be subject to ethical and regulatory considerations regarding their use of AI in test design, development, delivery and scoring. It is timely and crucial that the community of language assessment practitioners develop a comprehensive set of principles that can ensure ethical practices in their domain of practice as part of a commitment to relational accountability.
In this chapter we contextualise the debate on ethical AI in L2 assessment within global policy documents, and identify a comprehensive set of principles and considerations which pave the way for a shared discourse to underpin an ethical approach to the use of AI in language assessment. Critically, we advocate for an “ethical-by-design" approach in language assessment that promotes core ethical values, balances inherent tensions, mitigates associated risks and promotes ethical practices.