Automated Software Testing through Large Language Models: Opportunities and Challenges
Main Article Content
Abstract
The rapid evolution of software systems and the increasing complexity of modern applications have driven the need for innovative testing methodologies. Traditional techniques, often burdened by manual intervention and limited scalability, are increasingly challenged by the demands for speed, flexibility, and comprehensive quality assurance. Large Language Models (LLMs) offer a promising avenue for transforming automated software testing. This research examines the opportunities and challenges of integrating LLMs into testing workflows, supported by extensive literature, empirical surveys, and real-world case studies.Key opportunities include automated test-case generation, program repair, test-oracle creation, debugging assistance, and the development of self-healing testing systems. For instance, surveys show that approximately 51% of studies focus on test-case generation, while 26% address program repair, 11% focus on test-oracle creation, and 12% on debugging assistance. Concurrently, several challenges hinder the seamless adoption of LLM-based testing. These include the risk of hallucinated outputs, context window limitations, security and privacy issues, high computational costs, integration complexities, and potential biases in generated tests.This paper comprehensively reviews the theoretical underpinnings and practical implementations of LLM-driven test automation while proposing pathways for future research and development. Ultimately, LLMs are positioned as transformative tools whose effective integration could greatly enhance testing efficiency and reliability. However, careful design and human oversight remain critical to overcome the inherent challenges.
Article Details
Issue
Section

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
How to Cite
References
A. Yuan, A. Coenen, E. Reif, and D. Ippolito, “Wordcraft: Story Writing With Large Language Models,” in International Conference on Intelligent User Interfaces, Proceedings IUI, 2022. doi: 10.1145/3490099.3511105. DOI: https://doi.org/10.1145/3490099.3511105
F. Agbavor and H. Liang, “Predicting dementia from spontaneous speech using large language models,” PLOS Digital Health, vol. 1, no. 12 December, 2022, doi: 10.1371/journal.pdig.0000168. DOI: https://doi.org/10.1371/journal.pdig.0000168
T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large Language Models are Zero-Shot Reasoners,” in Advances in Neural Information Processing Systems, 2022.
F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” 2022. doi: 10.1145/3520312.3534862. DOI: https://doi.org/10.1145/3520312.3534862
Z. Wang, J. Wohlwend, and T. Lei, “Structured pruning of large language models,” in EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2020. doi: 10.18653/v1/2020.emnlp-main.496. DOI: https://doi.org/10.18653/v1/2020.emnlp-main.496
S. M. Almufti, R. Boya Marqas, and R. R. Asaad, “Comparative study between elephant herding optimization (EHO) and U-turning ant colony optimization (U-TACO) in solving symmetric traveling salesman problem (STSP),” Journal of Advanced Computer Science & Technology, vol. 8, no. 2, p. 32, Aug. 2019, doi: 10.14419/jacst.v8i2.29403. DOI: https://doi.org/10.14419/jacst.v8i2.29403
N. Jain et al., “Jigsaw: Large Language Models meet Program Synthesis,” in Proceedings - International Conference on Software Engineering, 2022. doi: 10.1145/3510003.3510203. DOI: https://doi.org/10.1145/3510003.3510203
N. Jain, “Survey versus interviews: Comparing data collection tools for exploratory research,” 2021. doi: 10.46743/2160-3715/2021.4492. DOI: https://doi.org/10.46743/2160-3715/2021.4492
J. N. Oswald et al., “A Collection of Best Practices for the Collection and Analysis of Bioacoustic Data,” 2022. doi: 10.3390/app122312046. DOI: https://doi.org/10.3390/app122312046
U. Girgin et al., “Conversation Analysis Methodology: Validity, Reliability, and Ethical Issues in Data Collection and Analysis Procedures,” Hacettepe Egitim Dergisi, vol. 37, no. 1, 2022, doi: 10.16986/HUJE.2020063458. DOI: https://doi.org/10.16986/HUJE.2020063458
M. K. Alam, “A systematic qualitative case study: questions, data collection, NVivo analysis and saturation,” Qualitative Research in Organizations and Management: An International Journal, vol. 16, no. 1, 2021, doi: 10.1108/QROM-09-2019-1825. DOI: https://doi.org/10.1108/QROM-09-2019-1825
A. E. Lewis Presser et al., “Exploring Preschool Data Collection and Analysis: A Pilot Study,” Educ Sci (Basel), vol. 12, no. 2, 2022, doi: 10.3390/educsci12020118. DOI: https://doi.org/10.3390/educsci12020118
J. Draper, Y. Liu, and L. Young, “Research methods, data collection, and data analysis in meetings, expositions, events, and conventions journals,” Journal of Convention and Event Tourism, vol. 22, no. 5, 2021, doi: 10.1080/15470148.2021.1906373. DOI: https://doi.org/10.1080/15470148.2021.1906373
M. Khoubnasabjafari, M. R. A. Mogaddam, E. Rahimpour, J. Soleymani, A. A. Saei, and A. Jouyban, “Breathomics: Review of Sample Collection and Analysis, Data Modeling and Clinical Applications,” 2022. doi: 10.1080/10408347.2021.1889961. DOI: https://doi.org/10.1080/10408347.2021.1889961
J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” in Advances in Neural Information Processing Systems, 2022.
T. Wu, M. Terry, and C. J. Cai, “AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts,” in Conference on Human Factors in Computing Systems - Proceedings, 2022. doi: 10.1145/3491102.3517582. DOI: https://doi.org/10.1145/3491102.3517582
E. Kasneci et al., “ChatGPT for good? On opportunities and challenges of large language models for education,” 2023. doi: 10.1016/j.lindif.2023.102274. DOI: https://doi.org/10.1016/j.lindif.2023.102274
J. Hoffmann et al., “Training Compute-Optimal Large Language Models,” in Advances in Neural Information Processing Systems, 2022.
Z. Lyu, Z. Jin, R. Mihalcea, M. Sachan, and B. Schölkopf, “Can Large Language Models Distinguish Cause from Effect?,” UAI 2022 Workshop on Causal Representation Learning, 2022.
A. M. Abd El-Haleem, M. M. Eid, M. M. Elmesalawy, and H. A. H. Hosny, “A Generic AI-Based Technique for Assessing Student Performance in Conducting Online Virtual and Remote Controlled Laboratories,” IEEE Access, vol. 10, 2022, doi: 10.1109/ACCESS.2022.3227505. DOI: https://doi.org/10.1109/ACCESS.2022.3227505
M. Ferrer-Benítez, “Online dispute resolution: can we leave the initial decision to Large Language Models (LLM)?,” Metaverse Basic and Applied Research, vol. 1, 2022, doi: 10.56294/mr202223. DOI: https://doi.org/10.56294/mr202223
S. MacNeil, A. Tran, D. Mogil, S. Bernstein, E. Ross, and Z. Huang, “Generating Diverse Code Explanations using the GPT-3 Large Language Model,” in ICER 2022 - Proceedings of the 2022 ACM Conference on International Computing Education Research, 2022. doi: 10.1145/3501709.3544280. DOI: https://doi.org/10.1145/3501709.3544280
S. Sarsa, P. Denny, A. Hellas, and J. Leinonen, “Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models,” in ICER 2022 - Proceedings of the 2022 ACM Conference on International Computing Education Research, 2022. doi: 10.1145/3501385.3543957. DOI: https://doi.org/10.1145/3501385.3543957
X. Yang et al., “A large language model for electronic health records,” NPJ Digit Med, vol. 5, no. 1, 2022, doi: 10.1038/s41746-022-00742-2. DOI: https://doi.org/10.1038/s41746-022-00742-2
A. K. Kovalev and A. I. Panov, “Application of Pretrained Large Language Models in Embodied Artificial Intelligence,” Doklady Mathematics, vol. 106, 2022, doi: 10.1134/S1064562422060138. DOI: https://doi.org/10.1134/S1064562422060138
M. K. Singh, W. M. Fernandes, and M. S. Rashid, “Robust UI Automation Using Deep Learning and Optical Character Recognition (OCR),” in Advances in Intelligent Systems and Computing, 2021. doi: 10.1007/978-981-15-7234-0_4. DOI: https://doi.org/10.1007/978-981-15-7234-0_4
S. M. Almufti, R. B. Marqas, Z. A. Nayef, and T. S. Mohamed, “Real Time Face-mask Detection with Arduino to Prevent COVID-19 Spreading,” Qubahan Academic Journal, vol. 1, no. 2, pp. 39–46, Apr. 2021, doi: 10.48161/qaj.v1n2a47. DOI: https://doi.org/10.48161/qaj.v1n2a47
S. Banerjee, N. C. Debnath, and A. Sarkar, “An Ontology-Based Approach to Automated Test Case Generation,” SN Comput Sci, vol. 2, no. 1, 2021, doi: 10.1007/s42979-020-00420-8. DOI: https://doi.org/10.1007/s42979-020-00420-8