Let's automate: Natural language processing tools and their applications

Organizer: Scott Crossley, Georgia State University

Abstract

Natural Language Processing (NLP) focuses on computer programs that analyze large corpora of natural data for linguistic features (e.g., lexical, syntactic, and cohesion features). The use of NLP in applied linguistics research is steadily increasing as larger corpora and more effective and robust NLP tools become available. This colloquium brings together specialists in the development and application of NLP tools that specifically tackle issues related to language acquisition, use, and education. The colloquium will provide an overview of current trends and themes in NLP from a variety of linguistic perspectives as well as introduce available NLP tools and discuss their use in language research.

For instance, Scott Jarvis will discuss how a new NLP tool that measures lexical diversity (LD) can help provide construct definition for LD in both first and second language (L2) writers. Kristopher Kyle’s presentation will provide an overview of lexical sophistication features that can be used to predict L2 lexical proficiency, writing quality, and speaking proficiency, as well as introduce a new NLP tool to assess lexical sophistication. Lastly, Xiaofei Lu will discuss how new NLP approaches to syntactic complexity can incorporate functionally appropriate uses of linguistic features within writing contexts.

Overall, this colloquium will provide a synopsis of current NLP trends and tools in applied linguistics as well as provide information about how NLP tools can be used to assess language constructs, the functional effectiveness and limitations of NLP tools, implications for NLP tools in language teaching, and opportunities for NLP tools in future applied linguistic research.

Presenter: Scott Jarvis, University of Utah
Title: Automated tools for investigating lexical diversity: Exploring what writers do differently when they try to increase their LD

Lexical diversity (LD) refers to the variety of words found in samples of speech and writing. LD is of interest to applied linguists because it has been found to serve as a useful proxy for constructs such as language ability (Yu, 2010) and language dominance (Treffers-Daller, 2009). However, existing measures of LD have important shortcomings, and recent research has concentrated on defining the construct (Jarvis, 2013a, 2017) and developing and validating measures that are consistent with the construct definition (Fergadiotis, Wright, & West, 2013; Jarvis, 2013b). The purpose of the present paper is twofold: (a) to contribute to the construct definition of LD by determining which lexical properties of a text change when writers intentionally try to increase its LD, and (b) to introduce a new, automated LD tool that is particularly suited to this purpose.

The writers of the texts analyzed in this study included 26 students (15 native and 11 nonnative) at a midwestern American university. They were shown an eight-minute segment of a silent Chaplin film and asked to write a description of the film in English. Then, their essays were collected and they were asked to rewrite the same story while deliberately trying to increase its LD. The term lexical diversity was not defined for them; they were asked to write in accordance with however they understood this term. The texts were later lemmatized, tagged, and analyzed with the use of the new LD tool. The results show that only about half of the LD-enhanced texts have more tokens or types than their original counterparts, nearly all of them use less-frequent or more specific words than the original texts, and several also involve the use of nominalizations and alternative grammatical constructions. Importantly, the study highlights the relevance of meaning and grammar to the construct of LD.

Presenter: Kristopher Kyle, University of Hawai’i at Manoa
Title: Automatically assessing multiple features of lexical sophistication with TAALES

Lexical sophistication is commonly understood as the use of “advanced” words. Sophistication has most often been defined with regard to the proportion of infrequent words in a text (e.g., Laufer & Nation, 1995; Read, 2000), under the generally accepted hypothesis that highly frequent words will be learned earlier and more easily than less frequent words (e.g., Ellis, 2002). While frequency is undoubtedly an important feature of sophistication, a number of recent studies have demonstrated that lexical sophistication is most accurately modeled when multiple complementary features are used (e.g., Kim, Crossley, & Kyle, 2018; Kyle & Crossley, 2015; Kyle, Crossley, & Berger, 2018). Automated text analysis tools such as the Tool for the Automatic Analysis of Lexical Sophistication (TAALES) have facilitated these multivariate approaches.

In this presentation, a review of recent literature will highlight the importance of a number of features of lexical sophistication in predicting second language (L2) productive lexical proficiency, writing quality, and speaking proficiency. Particular focus will be given to highlighting the features of contextual distinctiveness, lexical access/entrenchment, psycholinguistic word features (e.g., concreteness), semantic networks (e.g., hypernymy), word use (e.g., n-gram association strength), and word neighborhoods.

Additionally, the most recent version of TAALES (2.8), which calculates the features above (among others), will be introduced. Key features of the tool will be described, and opportunities for future research in a variety of subfields of applied linguistics will be outlined. Potential limitations and pitfalls will also be discussed.

Presenter: Xiaofei Lu, Pennsylvania State University
Title: Towards a functional turn in L2 writing syntactic complexity research

Syntactic complexity (SC) is commonly construed as the range and degree of sophistication of the syntactic structures used in language production (Ortega, 2003). With the advent of multiple tools for automating syntactic complexity analysis using various coarse- and fine-grained measures (Biber et al., 1999; Kyle, 2016; Lu, 2010; McNamara et al., 2014), numerous quantitative studies have examined and generated valuable insights into SC features predictive of second language (L2) writing quality (e.g., Biber et al., 2016; Kyle & Crossley, 2018; Yang et al., 2015). The focus on linguistic features divorced from function, however, fails to capture the fact that the functionally appropriate use of linguistic features underlies quality writing in the writing construct, not the presence and frequency of linguistic features alone. It may also negatively impact L2 writing learning (e.g., learners trying to plug in desirable features in functionally inappropriate ways).

In this talk, I argue for the need for a functional approach to SC research that systematically examines the genre appropriateness and functional effectiveness of syntactically complex structures and illustrates the resources and insights functional SC research can generate using findings from a recent project. Using several commonly adopted operationalizations of SC (e.g., sentence length, subordination, left embeddedness, nominalizations) and a modified version of Swales’ (2004) Create A Research Space (CARS) model for rhetorical functional analysis, this study systematically aligns the SC features identified with the rhetorical functions they are deployed to realize in a corpus of social science research article introductions. I conclude with a discussion of the implications of the functional approach for L2 writing pedagogy and assessment and the possibility of tapping into success in emerging research on automating rhetorical function annotation to move functional SC research forward.