1. Introduction
As interest in study abroad (SA) programs for students across the U.S. grows, it becomes increasingly important to understand the benefits of these programs for students in terms of second language (L2) development. The Open Doors 2023 report on U.S. study abroad data reports that in 2021/22, the most common length of a SA program was a short-term summer program, making up 49% of the total number of students studying abroad. One-semester (3-4 month) programs were a close second with 32.9% of students choosing this option (U.S. Department of State, 2023). Research has indicated that both short and long-term immersion in a foreign language is beneficial for second language acquisition (SLA) (Marijuan & Sanz, 2018; Serrano et al., 2012). However, students often approach the SA experience with unrealistic expectations of their oral proficiency development. Students must actively immerse themselves in the target language in order to achieve the improvement they are aiming for, as SA itself is not a guarantee for substantial language development (Moore et al., 2021). When it comes to the extent of these oral proficiency gains, results are mixed. Many previous studies have focused on short-term programs, tracking language development across this short amount of timeand have found that three to four weeks of studying abroad can lead to language learning gains (e.g., Issa et al., 2020; Martinsen, 2008; Zalbidea et al., 2020). Yang (2016) conducted a systematic meta-analysis of SA research and concluded that short-term immersion provided more practical linguistic benefits. In contrast, other studies find support for long-term programs (1 full academic year) being more conducive to changes in language development (Dwyer, 2004; Serrano et al., 2012), calling on SA researchers to conduct more longitudinal studies to determine the deeper attributes of language development while abroad.
When discussing oral proficiency attributes, previous research has consistently investigated three units of analysis: fluency, accuracy, and complexity (e.g., Leonard & Shea, 2017; Mora & Valls-Ferrer, 2012; Serrano et al., 2012). The general consensus is that SA programs improve learners’ fluency, but studies often show mixed results in terms of grammatical accuracy. In order to address these disparities in findings, SA researchers have recently begun to turn away from traditional group analyses and toward individual factors (Coleman, 2013; Kinginger, 2015). These researchers have posited that individual differences between the SA participants are seemingly strong indicators of their proficiency development, calling on studies to examine results at the individual level rather than or in addition to conducting group analysis (e.g., Anderson, 2014; Iwasaki, 2019). The present study explores group and individual oral proficiency development in order to shed some light on the variability in outcomes that L2 learners experienced in terms of complexity, accuracy, and fluency, during a 4-month long SA program.
2. Literature Review
2.1. Previous CAF Studies
The combination of complexity, accuracy, and fluency (CAF) as markers of language proficiency has received a lot of attention since the nineties. SLA research supports SA opportunities for facilitating L2 learners’ oral fluency development, but conclusions vary more regarding the development of grammatical accuracy and complexity. Take Mora & Valls-Ferrer (2012), who compared an at-home formal instruction cohort to a SA cohort of 30 L2 learners of English participating in a 3-month SA program. Participants completed an oral guided interview at the start and end of the program, in which they were paired with another student and given seven fixed questions about their life to elicit relatively free speech while controlling the content. They found that fluency was the area that saw the greatest gains as the result of SA, although accuracy did also improve substantially. Similarly, Lara (2014), who explored changes in CAF for a group of 47 English L2 students studying abroad in an English-speaking country, also found a notable improvement in oral fluency paired with a positive move towards target-like use for lexical complexity. Nonetheless, participants’ syntactic complexity and accuracy displayed little to no change from pre-to posttest. Furthermore, in this study, the individual level factor of learners’ initial proficiency (also of relevance to the present study) played a substantial positive effect in post-SA outcomes. Recently, Guo (2024) contributed to this body of CAF research by assessing English-speaking learners of Chinese language development during their 10-month stay abroad in China. Once again, L2 learners displayed strong gains in fluency as the result of studying abroad, which were accompanied by a positive increase in syntactic complexity (in terms of length and subordination) and in lexical sophistication. However, L2 learners’ accuracy decreased from pre- to posttest. The findings from these studies contrast with those of Leonard and Shea (2017), who conducted a CAF analysis with 39 L2 learners of Spanish in a 3-month SA program in Argentina and found that not only did learners improve their accuracy over time, but their accuracy learning gains were larger than their gains in fluency.
The contrast in findings between these studies reflects different outcomes when it comes to the development of these oral proficiency variables. Jensen and Howard (2014) hold that in their study, exploring gains in complexity and accuracy, results pointed to considerable individual variation within and between individuals and that this variation made it hard to capture oral proficiency learning gains in a neat linear pattern in group results over time. Case studies (Iwasaki, 2019; Kinginger, 2008, among others) have played a decisive role in helping SA researchers identify how individual factors can affect L2 learners’ linguistic development over the course of their stay abroad. Sometimes a student might connect with local students and build a strong local social network, and another might move through different host families until they find a comfortable environment, and all of these factors can help explain why they display different oral proficiency outcomes (See Isabelli-García, 2006). Nonetheless, SA studies tend to opt for either group analysis or case studies, and few are the studies that combine these two approaches. Take for example Anderson (2014), who investigated how several L2 learners’ individual differences (i.e., cognitive and affective aptitude) could elucidate the uneven oral proficiency outcomes L2 learners exhibited after one semester abroad. By triangulating these measures, and using a combination of quantitative and qualitative methods she was able to identify four learner profiles and speculate why some students displayed greater gains than others, providing a more precise depiction of students’ complex oral proficiency and fluency growth after four months abroad. More studies of this nature are needed to shed some light on the CAF mixed results found in previous SA studies. As Kinginger (2009) remarks, “individual differences are absorbed in the effort to document group differences, yet their presence must be noted”, (p.52). The present study aims to fill this gap and examines how individual factors (i.e., initial proficiency and L2 contact hours) might modulate L2 learners’ oral development over time, adopting a group but also an individual analysis that combines quantitative and qualitative methods.
2.2. Individual Level Factors and their impact on L2 Development Abroad
Previous SA research has considered a multitude of individual level factors that have the potential to enhance or hinder L2 learners’ language development abroad (e.g., length of stay, type of accommodation, willingness to communicate, among others) (Iwasaki, 2019), but the present study will focus exclusively on two individual factors: initial global proficiency and contact with the L2.
Initial Proficiency Pre-study Abroad
Several SA studies point to initial proficiency level impacting language-specific outcomes (e.g., Davidson, 2010; Golonka, 2006; Mora & Valls-Ferrer, 2012). Overall, these studies present two divergent perspectives. Whereas some observe that proficient L2 learners exhibit more noticeable gains due to their superior ability to process form and meaning in communicative interactions (Golonka, 2006; Leonard & Shea, 2017); other studies find that learning gains are larger among those learners who start with a lower proficiency (Baker-Smemoe et al., 2014; Llanes & Muñoz, 2009). Lara (2014) also observed that initial proficiency level was robustly impactful on posttest outcomes. In her study, initial proficiency was established using pre-test CAF scores. Correlation analyses revealed that some initial fluency, complexity, and accuracy measures positively correlated with higher scores in the posttest, suggesting that initial proficiency level may be a strong predictor for overall CAF gains during SA. More recently, Zalbidea and colleagues (2020) tackled this question directly with a well-established global oral proficiency measure to elucidate if and how initial oral proficiency can be a contributing factor in L2 grammar development abroad. Thirty-five L2 Spanish learners in a 5-week summer (short-term) SA program in Spain completed an Elicited Imitation Task (EIT), measuring initial oral proficiency through sentence repetition, and two oral production tasks in week 1 and 5. After 5 weeks, participants’ increases in complexity and accuracy were greater for learners with higher initial global L2 proficiency, providing further evidence that proficiency at the onset of their stay abroad was an important determining factor for grammar advancement. In an early attempt to explore the relevance of initial proficiency on overall L2 oral proficiency development, Issa and Zalbidea (2018) uncovered that both lower and higher-level proficiency students show gains in oral fluency and accuracy, but higher proficiency learners may show greater linguistic development, albeit subtle.
Contact with the L2
The second individual-level factor examined in the present study is L2 learners’ contact hours with the L2 (also known as L2 engagement, Mitchell, 2021). This is usually gauged via a questionnaire commonly known as the Language Contact Profile (LCP), in which L2 learners log the number of hours they spend a week (or every two weeks) interacting with or in the L2 during their stay abroad. Overall, most studies have found that L2 language use does not seem to be a significant contributor for oral proficiency or L2 grammar development (e.g., Issa et al., 2020; Martinsen, 2008; Segalowitz & Freed, 2004; Zalbidea et al., 2020). Martinsen (2008) used a modified version of the LCP and uncovered that oral use of Spanish with another person abroad did not predict students’ oral proficiency. Segalowitz and Freed (2004) compared oral proficiency gains from a group of L2 Spanish speakers who studied abroad and a group that stayed at home (AH). They found that the SA group made greater gains than the AH group, but these gains were not a function of how many L2 contact (out-of-class) hours the SA group had, supporting previous findings that L2 contact hours cannot exclusively explain oral proficiency gains abroad. Recent studies that also employed a modified version of the LCP questionnaire (Faretta-Stutenberg & Morgan-Short, 2018; Issa et al., 2020; Issa & Zalbidea, 2018) noted that L2 contact cannot account for variability in morphosyntactic (competence and processing[1]) and lexical development during students’ stay abroad. Nonetheless, a few studies still find that contact with the L2 can sometimes have a significant impact on L2 learners’ language development abroad (Hernández, 2010), and that in some cases, learners’ proficiency might influence the type of language use learners engage in and this can subsequently modulate the oral proficiency learning gains L2 learners exhibit. For example, Freed (1990) initially encountered that the amount of time students spent on out-of-class language use was unrelated to gains in oral proficiency, but a closer analysis uncovered that lower proficiency learners engaged in more interactive language use and that this helped develop their language skills greatly, whereas more advanced learners benefitted more from non-interactive language use. To our knowledge, no study within the framework of CAF has closely investigated how L2 contact hours can modulate oral proficiency abroad and given the mixed results found in the SA literature, more research in this area is needed.
The Present Study
The current study focuses on a Spanish semester-long SA program that has a strong focus on language immersion and has successfully operated for more than 30 years. Not only do we investigate the effects of two individual factors, self-reported L2 contact hours and initial proficiency level, but we examine group and individual developmental patterns to obtain a more precise picture of how Spanish L2 learners’ complexity, accuracy, and fluency might evolve over the course of four months abroad. By employing a quantitative and qualitative methods design, we hope to shed some light on why the variable of accuracy and complexity yield mixed results. We posit the following research questions:
-
Does Spanish L2 learners’ oral proficiency improve during one semester abroad as it pertains to…
a. Oral Fluency?
b. Grammatical Accuracy?
c. Language Complexity?
-
Does initial Spanish proficiency (as measured by an EIT task) modulate this improvement?
-
Does the average number of L2 contact hours that students report modulate this improvement?
3. Methods
3.1. Participants and the Study Abroad Program
A total of 20 (17 female, three male) Spanish L2 learner university students in the U.S. took part in this study during a one-semester study abroad program in Spain (Table 1). Of these 20 participants, 12 were Spanish majors and eight were Spanish minors. Thirty percent of participants had taken two lower-level courses (part of the university foreign language requirement), 90% had taken two bridge courses, and only 25% had taken two mid-level or one seminar-level courses prior to the start of their SA semester in Spain. We asked participants to self-report their English and Spanish overall proficiency and by skill because in addition to the interview used to elicit oral speech in this study, participants also had to complete a series of complementary language tasks that were part of a larger study that tapped into different language abilities such as reading and listening. The Spanish ability that received the lowest ratings was speaking, an ability for which SA research has shown to be very helpful (e.g., Martinsen, 2008; Zalbidea et al., 2020).
Seventy percent of the participants had previously traveled abroad to a Spanish speaking country prior to this semester abroad. Unlike in this SA program, their stay was short (e.g., 1 week family trip) and had a non-academic focus. This data fits the student profile of this private liberal arts university, where students tend to be of a higher socioeconomic status and travel internationally with frequency.
During their 15-week stay abroad, participants lived with Spanish host families, took a variety of Spanish courses that focused primarily on Spanish culture, literature, grammar, and linguistics[2] with Spanish-speaking local professors or the resident professor coming from the US campus. Students were also given the chance to complete an internship[3] in an area of interest to them that often matched the students’ other major or minor. As part of the study abroad experience, and more specifically as part of a required course they all were enrolled in, they had to participate in several cultural activities[4] and take part in three five-day-long academic trips organized only for them, in which they explored Spanish architecture, culture, and history.
The overall goal of this study abroad program was to provide students with an immersive language and cultural experience and the opportunity to use Spanish in the classroom, at home, and during these academic trips. In fact, students were required to speak Spanish when they were at the university’s SA center in Spain and when traveling or participating in cultural activities together. To show their commitment towards their immersive experience, participants were asked to sign a document where they stated that they promised to do so upon completion of their onsite orientation meeting on day one. Failure to respect these norms was punished with a formal warning first and repeated violations of this norm could result in the student being expelled from the SA center (in Spain) or losing a significant amount of participation points in the Spanish course that had the trips and cultural activities as embedded modules with an assigned grade. The on-site staff team they engaged with on a regular basis was formed by the resident professor from the U.S. university (a Spain native), and two Spanish-speaking staff members that reside in this town of Spain permanently. They accompanied participants on trips and in the numerous cultural activities that took place during their sojourn to guide them through the learning experience and to ensure that students communicated in Spanish as much as possible.
3.2. Oral Interviews
Participants completed two oral proficiency interviews (OPI) as part of a semester-long Spanish intensive course that all students were required to take while in Spain: one during their second week and another one during the last two weeks of the semester. They were conducted by a certified ACTFL[5] OPI rater and took place in person at the SA center the university has in Spain. Unlike standard interviews with a set list of questions, OPI interviews are personalized interviews in which the interviewer uses the personal interests the interviewee mentions in the first 5 minutes to develop a series of elicitation prompts that will help the interviewer test the interviewee’s communicative abilities according to ACTFL oral proficiency guidelines.
Due to the proficiency level of the participants varying greatly between Intermediate Mid and Advanced Mid, the interviews lasted between 18-25 minutes. One of the advantages of using OPI interviews is that participants are asked to only produce speech that is adequate to their proficiency ability or slightly above – to test if they can fully perform certain communicative functions (i.e., narrating or hypothesizing) – and that are largely based on topics interviewees are familiar and interested in. Nonetheless, this level of individualization makes comparison among participants a challenge.
Therefore, for the present study, we only included participants’ speech that answered the interviewer elicitation prompts targeting intermediate level functions (e.g., description). The researcher who was OPI-certified identified the prompts that elicited description using the ACTFL Oral Proficiency Interview Assessment Workshop Participant Handbook (2018) and another researcher, who was a trained undergraduate research assistant majoring in education and Spanish, double-checked that the excerpts selected for transcription and further analysis did indeed prompt students to describe. As a result, the parts of the interview that were examined in the present study ranged from 4-6 minutes. Two sample interviewer questions used to elicit the communicative function of description with two different participants can be found in Table 2 below.
3.3. Global Proficiency
In addition to the OPI interviews, participants completed an oral Elicited Imitation Task (EIT) at the beginning and at the end of their sojourn. We opted for this type of proficiency task because it utilizes oral modality aligning with the oral data examined in the current study, and because previous SA studies have consistently employed this same task when exploring the role of initial proficiency in oral language development over time in an immersive context (Zalbidea et al., 2020). This type of repetition task has been argued to tap into basic language cognition (Bowden, 2016), and provides an objective assessment of global L2 proficiency that is valid, reliable, and well-researched (Solon & Park, 2024; Wu et al., 2022; Yan et al., 2016). Moreover, EITs, as a global measure of proficiency, have been recently used in SA research, and several studies point to L2 learners’ overall proficiency at the onset of the study abroad semester as a potential modulator of L2 learners’ grammatical development over the course of short-term study abroad (Issa & Zalbidea, 2018; Zalbidea et al., 2020).
Completing an EIT involves participants listening to, and then attempting to repeat sentences in Spanish. Our participants completed a practice module in English, then listened to a total of 30 Spanish sentences that increased gradually in both length and complexity. After each sentence had played, there was a 2-second pause followed by a 0.5-second tone sound. Following this tone sound, participants tried their best to repeat the sentences word-for-word. Their responses to each item were recorded, transcribed, and coded by two research assistants, utilizing the same transcription and coding guidelines outlined in previous work with these tasks (e.g., Issa et al., 2020; Ortega, 2000). Two different versions of this EIT were used to administer it at the beginning and the end of the semester. Its presentation was counterbalanced so that half of the participants completed version A at the beginning and version B at the end, and vice versa.
3.4. L2 contact survey
During their stay abroad, participants were asked to fill out a biweekly survey that aimed to collect data on participants’ self-reported number of contact hours with the L2 to examine L2 learners’ engagement with Spanish. Every two weeks, participants received an email with a Google Form that prompted them to reflect on and report the number of hours they had listened to, read, written, and spoken in the L2 in a domestic and academic environment in the last week. Completion of this biweekly survey was not obligatory, but participants were incentivized to complete it by receiving extra monetary compensation if they filled out at least five out of the seven surveys they were asked to complete. Ninety percent of the participants completed a minimum of five surveys. We calculated their average number of hours of L2 contact in general and by skill (See Table 3). This information was included in the statistical model to examine if the average number of contact hours (i.e., L2 contact) could be one of the factors affecting the longitudinal oral proficiency development of participants.
3.5. Data Annotation and Analysis
We created a longitudinal oral corpus containing a total of 18,451 words from all the data that was studied in the present study. The OPI certified researcher and another member of the research team listened to the audio files to identify the parts of the interviews where the interviewer prompted the participants to elicit descriptions. Once the time intervals were identified and confirmed by the OPI certified researcher, the audio files were clipped to the appropriate time intervals using Audacity (Audacity Team, 2024), and the research team proceeded to the transcription of the audio files following MacWhinney’s CHAT conventions (2000). Data annotation was conducted primarily by two members of the research team and the third member, who is an experienced Corpus data researcher, revised all transcriptions to ensure data annotation consistency across the board. The transcriptions and audio files were then entered into the annotation software EXMARaLDA (Schmidt & Wörner, 2014) to annotate the variables of fluency and grammatical accuracy.
Complexity
Linguistic complexity is a broad and multi-dimensional construct (Kuiken, 2023). In the present study, we chose to focus on two dimensions that assess different areas of linguistic complexity: syntactic and lexical complexity.
Syntactic complexity is defined as the L2 learner’s ability to use a variety of syntactic forms and structures, both basic and more sophisticated, when performing tasks that require language production (Ortega, 2003). Furthermore, the acquisition of syntactic complexity can be understood as the ability of a learner to make longer sentences as they learn the target language. To measure this, we calculated the mean length of utterances (MLU) using words as a unit (MacWhinney, 2000). Thus, the realization of longer utterances would denote greater linguistic complexity and by extension the mobilization of more cognitive resources.
In the present study, our utterance coding was guided by CHILDES conventions usually adopted by community transcription standards (MacWhinney, 2000), but we also took account of the prosodic and temporal characteristics of oral production, following the research design of previous studies that examined oral data (Hilton et al., 2008; Rojas Madrazo, 2020). The observation of simple independent clauses, simple independent clauses followed by a complementizer or another type of subordinate clause (relative, circumstantial, causal, etc.) contributed to the coding of utterances. The most complicated cases to code were those in which several simple propositions were coordinated with the conjunction ‘and’. When this occurred, three conditions were used to determine if there was a change of utterance: (a) a pause of more than 450 ms before the conjunction, (b) a thematic break, and (c) a descend in prosody. In this way, MLU is at the intersection of complexity and fluency.
Regarding the lexicon, SLA studies describe lexical complexity as referring to productive (as opposed to receptive) vocabulary (Laufer & Nation, 1995). One dimension of lexical complexity that we choose to focus on is lexical diversity. McCarthy and Jarvis (2010) define lexical diversity as the quantity and variety of vocabulary used, considering, on the one hand, the number of different items (word types), i.e. diversity, and, on the other hand, the variation of these items in relation to certain properties of the vocabulary (referential vs. semantic content).
One of the most well-known and earliest measures for analyzing diversity is the Type Token Ratio (TTR), which has been criticized because it can be sensitive to text length, making it difficult to compare texts of different lengths as longer texts are less diverse (McCarthy & Jarvis, 2007; Richards & Malvern, 2000; Skehan, 2009). To correct this problem, attempts have been made to correct the formula by calculating the TTR from the extraction of a fixed number of words (consecutive or randomly selected) although much information is lost from the data (Laufer & Nation, 1995; McCarthy & Jarvis, 2010; Richards & Malvern, 2000). One of these TTR corrections is the Mean Segmental Type Token Ratio (MSTTR) that divides the speech into segments of a given length (50 or 100 words) and then calculates the average TTR of these segments. This version is the one we adopted in the present study. We calculated MSTTR as follows: 1,000 samples were taken, each with 100 words selected at random. Each of these 1,000 samples was independent of the others, which means that some words might have appeared in different samples. To calculate the lexical diversity, the number of different (unique) words in each sample was divided by the total number of words in the sample, which in this case was 100. This average represents the overall lexical density in the text, taking into account the variation between the different samples (De Haro, 2023).
Accuracy
Participants’ language accuracy was annotated in the transcription files and later in EXMARaLDA. We divided each participant’s speech into utterances (See previous section for a definition of utterance). We measured the average number of errors per utterance and annotated and tallied three general error categories: morphological, syntactic, and lexical errors. This classification of errors was adapted from MacWhinney (2000) and has also been used in other studies (Hilton et al., 2008). Several rounds of codings led us to identify a series of valuable morphological agreement subcategories based on the type of agreement errors observed. These new subcategories were number agreement, gender agreement or subject-verb agreement errors. Syntactic errors were identified as those in which Spanish word order parameters were violated (e.g., adjective preceding a noun), and lexical errors were identified when participants made the wrong word choice or invented new words to express something for which they did not have the vocabulary. An example for each category can be seen below in Table 4.
Fluency
Following previous research standards (e.g., Di Silvio et al., 2016; Hilton et al., 2008; Mora & Valls-Ferrer, 2012), we investigated a series of units of analysis that are believed to comprise overall utterance fluency. To start, we measured all pauses longer than 200ms. A pause was considered any pause (silent or filled) that lasted over 200ms and only pauses within the interviewee’s speech were included in this category, that is, the pause time between the interviewee’s response to a question and the following question formulated by the interviewer were not considered pause time. The average pause duration for total overall pauses and each type of pause were annotated and gauged using EXMARaLDA. This data was later extracted from this software and transferred to an excel file to examine individual and group patterns.
A total of four units of analysis were examined*. Mean length of Pause (MLP)* was calculated for each participant’s initial and exit interview by dividing their total pause time by the number of pauses in their speech. In addition, we measured the average time for each type of pause examined in previous SA studies: Silent Pauses, Filled Pauses, and Hesitation Groups. Silent Pauses were pauses where no speech was produced. Filled Pauses were pauses in which the interviewee vocalized sounds to fill the silence (e.g., “uhm,” “um,” “ehm”). Finally, Hesitation Groups, pioneered by Hilton (2008), were an unbroken combination of both silent and filled pauses.
4. Results
4.1. Group Results
After all participants’ (pre- and posttest) files were annotated for fluency and accuracy, we calculated participants’ mean length of pauses (MLP) as well as the average length of each type of pause in milliseconds (See Table 5). Additionally, we tallied the participants’ average number of grammatical errors per utterance, as well as per grammatical category (See Table 6). The majority of the errors participants made fell within the morphological category. Most errors in this category consisted of number and gender agreement mistakes, followed by subject-verb agreement mismatches. Lexical errors were the second most prominent type of errors, followed by syntactic errors, which were the less common type with a group average of 0.1 syntactic error per utterance. Complexity was gauged using participants’ MLU and the MSTTR measure. Descriptive statistics for all three variables can be found in Table 5 below.
As a group, participants did not show drastic longitudinal changes in complexity, accuracy, and fluency (See Table 5), but we conducted an inferential statistical analysis to confirm that there were no significant differences in the patterns observed in Table 5, and to examine if the individual factors of initial proficiency and L2 contact hours could have modulated these changes from pre to posttest.
We conducted a total of seven mixed effect linear regression models. For all primary analyses the fixed effect was session (initial, exit interview), and participant was set as random effect. Session 1 was used as the reference level. Individual initial proficiency scores (as measured by the EIT) and average self-reported number of L2 contact hours were added as covariates to explore if they modulated total pause time over time. Alpha was set at .05 for all analysis, p = .05 was treated as significant. The statistical analyses were conducted using R (R Core Team, 2024) with the lme4 package (Bates et al., 2015), and keeping the maximal random effect structure whenever possible (Barr, 2013).
For the variable of fluency, the output of the mixed effects linear regression model that examined MLP confirmed that there was no significant effect of session, estimate = 0.07, SE = 0.08, t = 0.85, p = 0.40, suggesting that participants’ MLP time did not decrease significantly over the course of a semester. Nevertheless, this model yielded a significant main effect for initial proficiency, indicating that participants with a lower proficiency presented with overall higher total pause time, estimate = -0.01, SE = 0.02, t = -2.37, p = .02. None of the models examining the remaining measures of fluency yielded a significant effect of session: average silent pause time, estimate = 0.02, SE = 0.6, t = 0.42, p = 0.77; average filled pause time, estimate = 0.04, SE = 0.09, t = 0.50, p = .61; average hesitation group pause time, estimate = -0.16, SE = 0.17, t = -0.94, p = .35. Nonetheless, initial proficiency almost reached significance in the models containing average silent pause time, estimate = -0.00, SE = 0.00, t = -1.92, p = .07, and average hesitation group time, estimate = -0.00, SE = 0.00, t = -1.97, p = .06.
Similarly, for the variable of grammatical accuracy, the output of the mixed effects linear model that examined average number of errors per utterance did not yield a significant main effect of session, estimate = -0.00, SE = 0.12, t = -0.02, p = .98, also confirming that participants’ average number of errors per utterance did not decrease significantly during four months in an immersion environment.
Finally, for the variable of language complexity, the output of the mixed effects linear model that examined MLU yielded a significant main effect of session, estimate = -15.86, SE = 7.21, t = -2.19, p = .03, revealing that in this case, participants did improve significantly their overall grammatical complexity as measured by the MLU from the initial to the exit interview. In addition, this model also showed a significant main effect of initial proficiency, estimate = 0.30, SE = 0.11, t = 2.72, p = .01, and of L2 contact hours, estimate = -0.23, SE = 0.09, t = -2.43, p = .02, both indicating that participants with a higher initial proficiency showed overall higher grammatical complexity and that those who reported less average L2 contact hours showed lower grammatical complexity. With regards to lexical complexity, the output of the mixed effect linear model examining MSTTR did not yield a significant main effect of session, estimate =1. 38, SE = 1.58, t = 0.87, p = .39. Participants lexical complexity did not seem to increase significantly over the course of a semester abroad. Initial proficiency also approached significance with MSTTR, estimate = 0.05, SE = 0.02, t = 1.96, p = .06.
In order to explore if fluency and accuracy group results might have been affected by individual variation (Jensen & Howard, 2014), leading to a non-significant linear pattern, we also explored individual results using a quantitative and qualitative approach.
4.2. Individual Results
Group results pointed to two individual factors as modulators for our L2 learners’ longitudinal changes in syntactic complexity (as measured by MLU) over the course of 15 weeks abroad. In addition to syntactic complexity being the only variable that saw a positive change, the best fitted model that included initial proficiency and L2 contact hours as covariates revealed that initial proficiency (i.e., initial EIT scores) – included in the model as a covariate (one of the techniques recommended by Zalbidea et al., 2020) – was a positive significant predictor for changes in complexity. Similarly, the average number of L2 contact hours was found to be a negative predictor for syntactic complexity. When it came to syntactic complexity, as measured by MLU, our study found that participants’ individual factors of initial proficiency and self-reported average number of L2 contact hours modulated their development of complexity abroad. This was not the case with participants’ fluency and accuracy, which exhibited no significant changes from pre- to posttest.
Accuracy (with a 70%) and fluency (with a 65%) were the second and third variable that saw the highest percentage of improvement[6] after syntactic complexity (15 out of 20 participants). The fourth and last variable, in which only half of the participants showed a modest increase over time, was lexical complexity. Table 7 below displays individual results for the 20 participants that comprised our participant pool.
Although group analysis did not yield significant fluency and accuracy gains, a high number of participants showed a positive modest improvement in MLP and NEPU. We now turn to an individual analysis for the variables of fluency, accuracy, and lexical diversity, to examine closely why our Spanish L2 learners’ development in these areas did not improve significantly over time.
A closer examination of individual patterns with fluency and accuracy gains allowed us to identify four different participant profiles. Out of the 20 participants in the present study, eight improved in terms of fluency and grammatical accuracy (profile 1), but four participants did not improve in either fluency or grammatical accuracy (profile 2). The remaining eight participants showed two distinct opposing patterns: four improved their fluency, but their grammatical accuracy worsened over time (profile 3), and four participants displayed the opposite trend improving their grammatical accuracy over time with a decrease in fluency (profile 4). An example of each participant profile is presented in Table 8 below.
The self-reported average L2 contact hours reported by the four participants that showed no gains in accuracy and fluency tended to be low (37.3h/week) compared to the group average (46.85h/week). Despite L2 contact hours not being very reliable because participants self-reported and often overestimated them, less contact with the L2 could be one of the reasons why they did not improve in these areas. Additional questionnaires inquiring about participants activities during their free time also revealed that these participants used their free weekends to travel to other countries within Europe with some other members of the SA group, favoring the use of English and having less of a chance to practice their Spanish consistently throughout the week. The only exception was participant 21220 who reported a drastically higher average L2 contact hours probably due to the fact that they took part in an internship with a local doctor’s office for 12 weeks, in addition to their regular Spanish courses, cultural activities, and academic trips. A closer look at this participants’ transcripts and gains per variable type (e.g., fluency) brought to light that this participant had one of the highest MSTTR score (denoting lexical diversity). Overall, this participant had the tendency to use a wide range of new words, but probably took their time retrieving them (causing total pause time to be rather high) and made morphological errors (mostly agreement ones) when implementing them in their speech. On way to interpret this could be that this participant prioritized integrating the vocabulary they learned to their speech and this came at the cost of grammatical precision and fluency.
Something similar occurred with participants within profile 3 and 4. Take participant 21203 as an example. From the beginning, this participant adopted a stream of consciousness approach to communicating that did not leave room for monitoring grammatical accuracy and that incorporated numerous filled pauses making this participants’ mean length pause time the highest in the group. Four months abroad helped this participant to reduce their mean length pause time (from 1.17 to .71 ms). Overall, they produced more words per minute and hesitated less when speaking. Nonetheless, accuracy did not improve. In fact, the opposite happened, and this participant displayed the highest increase in average number of errors per utterance from the whole group (.65 to 1.58). We cannot know what this participants’ communicative strategy was, but their speech sample suggests that participant 21203 (as well as those who fitted profile 3) tended to prioritize fluency over grammatical accuracy by saying more and hesitating less when communicating the message in interview 2 (similar to the approach adopted by some students in Walsh, 1994). On the other hand, participants within profile 4 appeared to do the opposite. For instance, participant 21217 showed a slight increase in mean length pause time (.91s to 1.05s) that was complemented with a positive decrease in average number of errors per utterance. A look at their transcripts revealed that they used more self-correction in the second interview which could easily hinder fluency. This pattern of increased self-correction in the second interview was shared by three out of the four participants within profile 4. Interestingly, these 3 participants were either taking a linguistics or advanced grammar course that fostered focus on form, and this could have easily influenced what areas of their Spanish participants were prioritizing at the time of the second interview or throughout the 15 weeks they spent abroad.
Finally, with regards to lexical complexity, which proved to be one with the lowest number of participants showing gains, we could only identify one individual pattern that could help explain why certain participants showed positive gains in lexical density while others did not. This factor was students’ participation in a Spanish course designed to be a 12-week internship in an area of study relevant for the students’ major. Out of the 11 participants that saw an increase in MSTTR, seven of them enrolled in an internship that depending on the student involved assisting an English teacher in a local elementary school, shadowing a doctor, or helping with various assignments at the local chamber of commerce. As part of this course, participants were asked to write a short narrative of their weekly experience, to keep a glossary of newly acquired terms, and to reflect on cultural practices. The focus on vocabulary development in this course paired with the consistent exposure to Spanish in the workplace might have contributed positively to students’ lexical diversity.
5. Discussion
The present study explored Spanish L2 learners’ longitudinal gains in oral proficiency development during one semester abroad in Spain. We focused on analyzing Spanish L2 learners’ longitudinal proficiency in the areas of complexity, accuracy, and fluency, and contributed to previous CAF research by adopting a group as well as an individual analysis to help explain the differences in outcomes that L2 learners exhibited in these three linguistic variables upon studying abroad.
Our first research question zoomed in on the constructs of complexity, grammatical accuracy, and fluency, pertaining to the overarching concept of oral proficiency development. Based on previous studies, we anticipated an improvement in fluency. Findings for accuracy and complexity were more mixed in previous studies so we did not expect to see positive changes over time. Our group analysis revealed that this group of Spanish L2 learners only displayed significant learning gains in syntactic complexity (assessed via MLU). Group averages pointed towards a positive trend in all three constructs[7], but only syntactic complexity reached significance.
This finding is inconsistent with Mora and Valls-Ferrer (2012), who observed that fluency was the variable that saw the highest gains followed by accuracy, and with Leonard & Shea (2017), whose participants also displayed substantial gains in accuracy and fluency. Instead, our findings align with those from Guo (2024), Jensen & Howard (2014), Lara (2014), and Serrano et al. (2012), all of which found positive changes in syntactic complexity upon study abroad, even when some used a slightly different unit of analysis to gauge grammatical complexity (i.e., T-units). Overall, studies that find a positive effect for grammatical complexity tend to include participants who spent at least 6 months abroad, but participants in our study only spent 4 months in Spain. Perhaps, the immersive nature of this particular SA program, fostering culture and language immersion with host families, weekly cultural events, and academic trips, enhanced our participants’ experience leading to gains in grammatical complexity in a shorter period of time. Nonetheless, this assumption does not seem to hold based on Lara’s (2014) dissertation work suggesting that length of stay did not interact with grammar complexity development for learners that were immersed in the target language during 3 and 6 months. In fact, research on the effects of length of stay on fluency and accuracy development favor a shorter stay (i.e., 3 months) over a longer one (i.e., 6 months) (Lara et al., 2015); but to our knowledge no study has found length of stay to affect development of grammatical complexity.
A great number of CAF studies suggest that fluency is the first construct to exhibit benefits from SA, even if the stay abroad is shorter (e.g., 3 months) (Lara, 2014; Leonard & Shea, 2017; Mora & Valls-Ferrer, 2012; among others). However, this was not the case for the learners in the present study. Although group results yielded a non-significant decrease in mean total length pause time, the average filled pause and hesitation group increased slightly over time for participants as a group. Finally, our participants did not exhibit significant gains in accuracy, providing further evidence for the body of research which finds short-term study abroad or a semester-long stay to not be enough time for L2 learners to see a significant improvement in linguistic form (B. F. Freed, 1995; Rojas Madrazo, 2020; Serrano et al., 2012). In fact, Serrano and colleagues (2012) found that L2 learners’ grammatical accuracy only got better in the second semester of a full year abroad, which could explain why we do not see accuracy gains in our study.
Our individual analysis also allowed us to discern individual participants’ trajectory for accuracy and fluency development, and helped us identify 4 distinct profiles when it came to these two constructs. For the eight participants that display an opposing pattern for fluency and accuracy development, showing improvement in one but not the other and vice versa, a qualitative analysis of their transcripts for their entry and exit interview revealed that learners might have prioritized fluency or accuracy over the other construct and this difference in approach might have been what hindered the development of a linear developmental trajectory in group results. Additional information coming from administrative SA questionnaires played an important role in helping us hypothesize about the potential approaches participants might have adopted during their interviews or in general during their stay abroad. From information about the kind of activities participants participated in during the week and weekend, to a list of the courses they were enrolled in during their stay abroad, obtaining as much information as possible from participants’ experience abroad is paramount to help establish an explanation for the developmental trends they display over time.
Our second and third research question posited whether the individual factors of initial proficiency pre- study abroad and the self-reported average number of L2 contact hours could modulate the gains our L2 learners experience during a semester abroad. We examined this by adding initial proficiency scores and average L2 contact hours as covariates in the mixed effect logistic regression models we run for each construct. Although the model fitted for mean length pause time yielded a main effect of initial proficiency suggesting that, in general, learners with a higher proficiency displayed lower mean length pause time. It was only in the model fitted for the grammatical complexity data that we saw these two factors playing a significant role in L2 learners’ MLU over time. Participants with a higher proficiency pre-SA exhibited higher complexity gains and participants with lower average L2 contact hours tended to show smaller complexity gains. Our findings align with those from several studies (Golonka, 2006; Leonard & Shea, 2017; Zalbidea et al., 2020) noting that participants with a higher proficiency at the beginning of the sojourn are in a better position to develop their oral proficiency in terms of accuracy and sometimes grammatical complexity. Oftentimes, CAF studies used the initial grammatical complexity to establish participants’ proficiency (e.g., Lara, 2014), rather than an independent proficiency measure. Using non-independent measures can present challenges when regressed onto change scores of the same L2 measure because the relationship between the baseline and change scores is expected to be negative (Taraday & Wieczorek-Taraday, 2018). We accounted for this and similar to Zalbidea et al. (2020), we employed an independent and well-established measure of global proficiency to obtain a more precise understanding of how initial proficiency can impact oral proficiency development in an immersion context. Despite our approach being different from the one used in Lara (2014), we also found that initial proficiency is a strong predictor for longitudinal grammatical complexity development abroad. Unlike in Zalbidea et al. (2020) initial proficiency only appeared to modulate grammatical complexity in the present study. This individual factor did not positively interact with accuracy, as in previous studies (Lara, 2014; Leonard & Shea, 2017).
The final individual factor explored in the present study was the average number of L2 contact hours. Many studies employed a modified version of Language Contact Profile (Mitchell, 2021) and often observed that the L2 contact hours is not a significant predictor of overall L2 grammar development (Issa et al., 2020; Faretta-Stuttenberg & Morgan-Short, 2018, among others). Nonetheless, the few studies finding that contact with the L2 can have a significant impact on L2 learners’ language development in an SA, have also discovered that it might not only be a matter of number of contact hours, but also the type of interactions L2 learners engage in. The results from the present study support the findings from this later body of research suggesting that L2 contact hours can have an effect in L2 learners’ linguistic development. Participants in our study with a lower self-reported number of L2 contact hours tended to display lower longitudinal grammatical complexity gains. The descriptive statistics reported on Table 3, with fairly high standard deviation values suggest that L2 learners varied greatly in their estimated self-reported weekly number of L2 contact hours. Similarly, the average number of L2 contact hours reported by some participants also confirms that some of the participants overestimated the number of contact hours frequently. This might have made this variable an unreliable one. Language contact surveys have, admittedly, shown to be fraught with problems such as over or underestimation of language use, memory problems, and differing definitions of activities by participants. Nonetheless, in our study, several modifications were made to avoid the pitfalls associated with the validity of this questionnaire. First, the questionnaire was administered every 15 days rather than once at the end of the semester, to avoid aspects such as the level of attention, active engagement, and emotions to influence the memory of the events’ duration (Brunec et al., 2017). Second, we were able to complement these self-reported L2 contact hours with other information in students’ interviews and short narratives at the beginning, end, middle of their stay abroad, and with administrative surveys containing relevant information about students’ engagement with the L2 outside of class time (Kinginger, 2008; Taguchi, 2015). A possible explanation for the inconsistency in reported L2 contact hours in the present study could be that the survey was administered biweekly, but participants were asked to reflect on their contact with the L2 in the previous week. Using L2 contact hours surveys has its limitations, and given that the few studies that find a positive effect for L2 contact hours associate these gains with the type of interaction (e.g., interactive or context-dependent) rather than the amount of time learners are in touch with the L2, a potential alternative to it could be using social networks to estimate L2 contact hours by considering the individuals participants are in contact with regularly and the ways and contexts in which they communicate with them (McManus, 2019; Mitchell, 2021; Strawbridge, 2023). Take as an example the participants in the present study that participated in an internship. Unlike other students in the group, they had regular interaction with Spanish locals in a professional setting and this seemed to have had a positive effect on their lexical density development over time.
In sum, the present study adopted a group and individual analysis approach to investigate if and which individual-level factors can help explain the mixed results observed in L2 learners’ longitudinal gains in complexity, accuracy, and fluency upon studying abroad. Similar to previous studies, we found that initial proficiency, and surprisingly, average L2 contact hours, can modulate gains in grammatical complexity as measured by MLU. A thorough exploration of participants’ entry and exit interview transcripts in combination with information from SA administrative questionnaires helped us determine what could be reasons behind L2 learners not displaying significant learning changes in fluency and accuracy after four months abroad. Looking ahead, research on SLA, and more precisely on SA, should aim to triangulate their data collection using a mix of quantitative and qualitative methods to better investigate the complex linguistic and personal journey that L2 learners experience during their stay abroad with the ultimate goal to help establish which extralinguistic factors have a direct impact on L2 oral development. The SA program described in the present study provides a unique opportunity to adopt such an approach because it is small-scale operation and one of the researchers that conducted the study was able to accompany the students during their sojourn, allowing them to have a more comprehensive understanding of students’ linguistic and personal development over the course of four months abroad.
Faretta-Stuttenberg & Morgan-Short (2018) investigated behavioral changes using an offline and online (ERPs) paradigm. L2 contact with the L2 did not play a role in either participants’ behavioral or processing performance over time.
Students could choose a total of 4 courses in addition to a Spanish obligatory course that included the academic trips and was meant to be an intensive grammar review (during the first two weeks) and an introduction to different aspect such as Spanish history, society, economy, geography, etc. that students would further explore in their selected courses. The courses students could choose from could be divided in four different subareas: advanced grammar, literature, culture, medical Spanish and business Spanish. All of these areas were necessary for students to complete their Spanish major/minor and in some occasions their Spanish concentrations.
These internships consisted of four practical hours a week and the completion of weekly journals that required a short narrative summarizing their experience and the listing of specific terminology learned during this session, plus a final reflection paper. Out of the 20 participants in this study, nine participated in an internship. Four of them shadowed a doctor, three of them helped with English lessons in a public local school, and two of them were interns in the local Chamber of Commerce.
As part of taking this obligatory course, students had to complete a minimum of four cultural activities. The type of activities offered included a cooking course, a local graffiti tour, going to the movies to watch a Spanish movie, a local legends tour, among others.
American Council on the Teaching of Foreign Languages
With only one participant difference.
The only exception was fluency, which despite showing a decrease in mean length of total pause time, the fluency subcategories of filled pauses and hesitation group exhibited a slight increase in the unexpected direction.