The architecture of automatic scoring systems for non-native English spontaneous speech: A systematic literature review

Un I. Kuok

Article ID: 10078
Vol 9, Issue 2, 2025

VIEWS - 601 (Abstract)

Abstract


Given the heavy workload faced by teachers, automatic speaking scoring systems provide essential support. This study aims to consolidate technological configurations of automatic scoring systems for spontaneous L2 English, drawing from literature published between 2014 and 2024. The focus will be on the architecture of the automatic speech recognition model and the scoring model, as well as on features used to evaluate phonological competence, linguistic proficiency, and task completion. By synthesizing these elements, the study seeks to identify potential research areas, as well as provide a foundation for future research and practical applications in software engineering.


Keywords


automatic scoring system; automatic speech recognition; L2 English speaking; spontaneous speech; assessment and evaluation

Full Text:

PDF


References

  1. Aksënova, A., Chen, Z., Chiu, C.-C., et al. (2022). Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data (arXiv:2205.08014). arXiv. https://doi.org/10.48550/arXiv.2205.08014
  2. Alharbi, S., Alrazgan, M., Alrashed, A., et al. (2021). Automatic Speech Recognition: Systematic Literature Review. IEEE Access, 9, 131858–131876. https://doi.org/10.1109/ACCESS.2021.3112535
  3. Anderson‐Hsieh, J., Johnson, R., Koehler, K. (1992). The Relationship Between Native Speaker Judgments of Nonnative Pronunciation and Deviance in Segmentais, Prosody, and Syllable Structure. Language Learning, 42(4), 529–555. https://doi.org/10.1111/j.1467-1770.1992.tb01043.x
  4. Arslan, L. M., Hansen, J. H. L. (1996). Language accent classification in American English. Speech Communication, 18(4), 353–367. https://doi.org/10.1016/0167-6393(96)00024-6
  5. Benkerzaz, S., Elmir, Y., Dennai, A. (2019). A Study on Automatic Speech Recognition. Journal of Information Technology Review, 10(3). https://doi.org/10.6025/jitr/2019/10/3/77-85
  6. Bhat, S., Yoon, S.-Y. (2015). Automatic assessment of syntactic complexity for spontaneous speech scoring. Speech Communication, 67, 42–57. https://doi.org/10.1016/j.specom.2014.09.005
  7. Brown, P. F., Della Pietra, V. J., deSouza, P. V., Lai, J. C., & Mercer, R. L. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–480. Retrieved from https://aclanthology.org/J92-4003/
  8. Chambers, F. (1997). What do we mean by fluency? System, 25(4), 535–544. https://doi.org/10.1016/S0346-251X(97)00046-8
  9. Chen, M., Zechner, K. (2011). Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; June 2011; Portland, OR, USA. pp. 722–731
  10. Chen, Y., Hu, J., Zhang, X. (2019). Sell-corpus: An Open Source Multiple Accented Chinese-english Speech Corpus for L2 English Learning Assessment. In: Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 12–17 May 2019; Brighton, UK. 7425–7429. https://doi.org/10.1109/ICASSP.2019.8682612
  11. Cheng, J., Chen, X., Metallinou, A. (2015). Deep neural network acoustic models for spoken assessment applications. Speech Communication, 73, 14–27. https://doi.org/10.1016/j.specom.2015.07.006
  12. Cheng, Z., Wang, Z. (2022). Automatic Scoring of Spoken Language Based on Basic Deep Learning. Scientific Programming, 2022, 1–14. https://doi.org/10.1155/2022/6884637
  13. Cucchiarini, C., Strik, H., Boves, L. (2000). Quantitative assessment of second language learners’ fluency by means of automatic speech recognition technology. The Journal of the Acoustical Society of America, 107(2), 989–999. https://doi.org/10.1121/1.428279
  14. Cucchiarini, C., Strik, H., Boves, L. (2002). Quantitative assessment of second language learners’ fluency: Comparisons between read and spontaneous speech. The Journal of the Acoustical Society of America, 111(6), 2862–2873. https://doi.org/10.1121/1.1471894
  15. Défossez, A., Mazaré, L., Orsini, M., et al. (2024). Moshi: A speech-text foundation model for real-time dialogue. arXiv. https://doi.org/10.48550/arXiv.2410.00037
  16. Deng, Y., Li, X., Kwan, C., Raj, B., & Stern, R. (2007). Continuous feature adaptation for non-native speech recognition. International Journal of Computer and Information Engineering, 1(6), 1701–1707. https://doi.org/10.5281/zenodo.1329829
  17. Dufour, R., Estève, Y., Deléglise, P. (2014). Characterizing and detecting spontaneous speech: Application to speaker role recognition. Speech Communication, 56, 1–18. https://doi.org/10.1016/j.specom.2013.07.007
  18. Ellis, R. (2009). Task-based language learning and teaching (7th print). Oxford University Press.
  19. Fendji, J. L. K. E., Tala, D. C. M., Yenke, B. O., Atemkeng, M. (2022). Automatic Speech Recognition Using Limited Vocabulary: A Survey. Applied Artificial Intelligence, 36(1), 2095039. https://doi.org/10.1080/08839514.2022.2095039
  20. Franzke, M., Kintsch, E., Caccamise, D., et al. (2005). Summary Street®: Computer Support for Comprehension and Writing. Journal of Educational Computing Research, 33(1), 53–80. https://doi.org/10.2190/DH8F-QJWM-J457-FQVB
  21. Fu, J., Chiba, Y., Nose, T., Ito, A. (2020). Automatic assessment of English proficiency for Japanese learners without reference sentences based on deep neural network acoustic models. Speech Communication, 116, 86–97. https://doi.org/10.1016/j.specom.2019.12.002
  22. Gabler, P., Geiger, B. C., Schuppler, B., Kern, R. (2023). Reconsidering Read and Spontaneous Speech: Causal Perspectives on the Generation of Training Data for Automatic Speech Recognition. Information, 14(2), 137. https://doi.org/10.3390/info14020137
  23. Ge, Z. (2015). Improved accent classification combining phonetic vowels with acoustic features. In: Proceedings of the 2015 8th International Congress on Image and Signal Processing (CISP); 14–16 October 2015; Shenyang, China. pp. 1204–1209. https://doi.org/10.1109/CISP.2015.7408064
  24. Georgakis, C., Petridis, S., Pantic, M. (2016). Discrimination Between Native and Non-Native Speech Using Visual Features Only. IEEE Transactions on Cybernetics, 46(12), 2758–2771. https://doi.org/10.1109/TCYB.2015.2488592
  25. Gerosa, M., Giuliani, D., & Narayanan, S. (2006). Acoustic analysis and automatic recognition of spontaneous children’s speech. Interspeech 2006, 519–522. https://doi.org/10.21437/Interspeech.2006-519
  26. Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of automated writing evaluation. Journal of Technology, Learning, and Assessment, 8(6). Retrieved June 20, 2024, from http://www.jtla.org
  27. Hayashi, Y., Kondo, Y., Ishii, Y. (2024). Automated speech scoring of dialogue response by Japanese learners of English as a foreign language. Innovation in Language Learning and Teaching, 18(1), 32–46. https://doi.org/10.1080/17501229.2023.2217181
  28. Heeman, P. A., & Allen, J. F. (1999). Speech repairs, intonational phrases, and discourse markers: Modeling speakers’ utterances in spoken dialogue. Computational Linguistics, 25(4), 527–572. Retrieved from https://aclanthology.org/J99-4003/
  29. Housen, A., Kuiken, F. (2009). Complexity, Accuracy, and Fluency in Second Language Acquisition. Applied Linguistics, 30(4), 461–473. https://doi.org/10.1093/applin/amp048
  30. Johnson, M., Charniak, E. (2004). A TAG-based noisy channel model of speech repairs. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics; 21–26 July 2004; Stroudsburg PA USA. p. 33-es. https://doi.org/10.3115/1218955.1218960
  31. Kang, B. O., Jeon, H., Lee, Y. K. (2024). AI‐based language tutoring systems with end‐to‐end automatic speech recognition and proficiency evaluation. ETRI Journal, 46(1), 48–58. https://doi.org/10.4218/etrij.2023-0322
  32. Kang, O., Johnson, D. (2018). The roles of suprasegmental features in predicting English oral proficiency with an automated system. Language Assessment Quarterly, 15(2), 150–168. https://doi.org/10.1080/15434303.2018.1451531
  33. Kat L. W., Fung, P. (1999). Fast accent identification and accented speech recognition. In: Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258); 15–19 March 1999; Phoenix, AZ, USA. pp. 221–224. https://doi.org/10.1109/ICASSP.1999.758102
  34. Knill, K. M., Gales, M. J. F., Manakul, P. P., Caines, A. P. (2019). Automatic Grammatical Error Detection of Non-native Spoken Learner English. In: Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 12–17 May 2019; Brighton, UK. pp. 8127–8131. https://doi.org/10.1109/ICASSP.2019.8683080
  35. Kobayashi, Y., & Abe, M. (2016). Automated scoring of L2 spoken English with random forests. Journal of Pan-Pacific Association of Applied Linguistics, 20(1), 55–73. Retrieved from https://eric.ed.gov/?id=EJ1110804
  36. Lease, M., Johnson, M., Charniak, E. (2006). Recognizing disfluencies in conversational speech. IEEE Transactions on Audio, Speech and Language Processing, 14(5), 1566–1573. https://doi.org/10.1109/TASL.2006.878269
  37. Lee, H.-S., Chen, P.-Y., Cheng, Y.-F., et al. (2022). Speech-enhanced and Noise-aware Networks for Robust Speech Recognition. arXiv. http://arxiv.org/abs/2203.13696
  38. Lennon, P. (1990). Investigating Fluency in EFL: A Quantitative Approach. Language Learning, 40(3), 387–417. https://doi.org/10.1111/j.1467-1770.1990.tb00669.x
  39. Li, K.-C., Chang, M., Wu, K.-H. (2020). Developing a Task-Based Dialogue System for English Language Learning. Education Sciences, 10(11), 306. https://doi.org/10.3390/educsci10110306
  40. Li, S., Ouyang, B., Liao, D., et al. (2021). End-To-End Multi-Accent Speech Recognition with Unsupervised Accent Modelling. In: Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 6–11 June 2021; Toronto, ON, Canada. pp. 6418–6422. https://doi.org/10.1109/ICASSP39728.2021.9414833
  41. Liu, J., Wumaier, A., Fan, C., Guo, S. (2023). Automatic Fluency Assessment Method for Spontaneous Speech without Reference Text. Electronics, 12(8), 1775. https://doi.org/10.3390/electronics12081775
  42. Livescu, K., Glass, J. (2000). Lexical modeling of non-native speech for automatic speech recognition. In: Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100); 5–9 June 2000; Istanbul, Turkey. 1683–1686. https://doi.org/10.1109/ICASSP.2000.862074
  43. Ma, Z., Song, Y., Du, C., et al. (2024). Language Model Can Listen While Speaking. arXiv. http://arxiv.org/abs/2408.02622
  44. Miao, Y., Metze, F. (2017). End-to-End Architectures for Speech Recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. R. (editors). New Era for Robust Speech Recognition, Springer International Publishing. pp. 299–323. https://doi.org/10.1007/978-3-319-64680-0_13
  45. Molenaar, B., Tejedor-Garcia, C., Cucchiarini, C., Strik, H. (2023). Automatic Assessment of Oral Reading Accuracy for Reading Diagnostics. INTERSPEECH, 2023, 5232–5236. https://doi.org/10.21437/Interspeech.2023-1681
  46. Mørch, A., Engeness, I., Cheung, K.-W. (2017). EssayCritic: Writing to learn with a knowledge-based design critiquing system. Educational Technology & Society, 20(2), 213–223.
  47. Nakamura, M., Iwano, K., Furui, S. (2008). Differences between acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance. Computer Speech & Language, 22(2), 171–184. https://doi.org/10.1016/j.csl.2007.07.003
  48. Nunes, A., Cordeiro, C., Limpo, T., Castro, S. L. (2022). Effectiveness of automated writing evaluation systems in school settings: A systematic review of studies from 2000 to 2020. Journal of Computer Assisted Learning, 38(2), 599–620. https://doi.org/10.1111/jcal.12635
  49. Page, M. J., McKenzie, J. E., Bossuyt, P. M., et al. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ, n71. https://doi.org/10.1136/bmj.n71
  50. Qian, Y., Wang, X., Evanini, K., Suendermann-Oeft, D. (2016). Self-Adaptive DNN for Improving Spoken Language Proficiency Assessment. Interspeech, 2016, 3122–3126. https://doi.org/10.21437/Interspeech.2016-291
  51. Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8), 1270–1278. https://doi.org/10.1109/5.880083
  52. Sainath, T. N., Weiss, R. J., Wilson, K. W., et al. (2017). Raw Multichannel Processing Using Deep Neural Networks. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. R. (Eds.). New Era for Robust Speech Recognition, Springer International Publishing. pp. 105–133. https://doi.org/10.1007/978-3-319-64680-0_5
  53. Saon, G., Chien, J.-T. (2012). Large-Vocabulary Continuous Speech Recognition Systems: A Look at Some Recent Advances. IEEE Signal Processing Magazine, 29(6), 18–33. https://doi.org/10.1109/MSP.2012.2197156
  54. Shriberg, E. E. (1999). Phonetic consequences of speech disfluency. Proceedings of the International Congress of Phonetic Sciences, 1(2), 619–622. Retrieved from https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS1999/papers/p14_0619.pdf
  55. Tamazin, M., Gouda, A., Khedr, M. (2019). Enhanced Automatic Speech Recognition System Based on Enhancing Power-Normalized Cepstral Coefficients. Applied Sciences, 9(10), 2166. https://doi.org/10.3390/app9102166
  56. Tao, J., Ghaffarzadegan, S., Chen, L., Zechner, K. (2016). Exploring deep learning architectures for automatically grading non-native spontaneous speech. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 20–25 March 2016; Shanghai, China. pp. 6140–6144. https://doi.org/10.1109/ICASSP.2016.7472857
  57. Trentin, E., Gori, M. (2001). A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing, 37(1–4), 91–126. https://doi.org/10.1016/S0925-2312(00)00308-8
  58. Van Bergem, D. R. (1995). Perceptual and acoustic aspects of lexical vowel reduction, a sound change in progress. Speech Communication, 16(4), 329–358. https://doi.org/10.1016/0167-6393(95)00003-7
  59. Wang, Y., Gales, M. J. F., Knill, K. M., et al. (2018). Towards automatic assessment of spontaneous spoken English. Speech Communication, 104, 47–56. https://doi.org/10.1016/j.specom.2018.09.002
  60. Wang, Y.-J., Shang, H.-F., Briody, P. (2013). Exploring the impact of using automated writing evaluation in English as a foreign language university students’ writing. Computer Assisted Language Learning, 26(3), 234–257. https://doi.org/10.1080/09588221.2012.655300
  61. Wills, S., Bai, Y., Tejedor-Garcia, C., et al. (2023). Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications. Slate. https://doi.org/10.4230/OASICS.SLATE.2023.11
  62. Wilson, J., Czik, A. (2016). Automated essay evaluation software in English Language Arts classrooms: Effects on teacher feedback, student motivation, and writing quality. Computers & Education, 100, 94–109. https://doi.org/10.1016/j.compedu.2016.05.004
  63. Yoon, S. Y., & Bhat, S. (2012). Assessment of ESL learners' syntactic competence based on similarity measures. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 600–608). Association for Computational Linguistics. https://doi.org/10.3115/v1/D12-1055
  64. Yoon, S.-Y., Bhat, S. (2018). A comparison of grammatical proficiency measures in the automated assessment of spontaneous speech. Speech Communication, 99, 221–230. https://doi.org/10.1016/j.specom.2018.04.003
  65. Yu, D., Deng, L. (2015). Automatic Speech Recognition: A Deep Learning Approach. Springer London. https://doi.org/10.1007/978-1-4471-5779-3
  66. Zechner, K., Evanini, K., Yoon, S.-Y., et al. (2014). Automated scoring of speaking items in an assessment for teachers of English as a Foreign Language. In: Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications; June 2014; Baltimore, Maryland. pp. 134–142. https://doi.org/10.3115/v1/W14-1816


DOI: https://doi.org/10.24294/jipd10078

Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Author(s)

License URL: https://creativecommons.org/licenses/by/4.0/

This site is licensed under a Creative Commons Attribution 4.0 International License.