Article Details
Scrape Timestamp (UTC): 2025-03-19 08:41:06.601
Source: https://www.theregister.com/2025/03/19/llms_buggy_code/
Original Article Text
Click to Toggle View
Show top LLMs buggy code and they'll finish off the mistakes rather than fix them. One more time, with feeling ... Garbage in, garbage out, in training and inference. Researchers have found that large language models (LLMs) tend to parrot buggy code when tasked with completing flawed snippets. That is to say, when shown a snippet of shoddy code and asked to fill in the blanks, AI models are just as likely to repeat the mistake as to fix it. Nine scientists from institutions, including Beijing University of Chemical Technology, set out to test how LLMs handle buggy code, and found that the models often regurgitate known flaws rather than correct them. They describe their findings in a pre-print paper titled "LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code." The boffins tested seven LLMs – OpenAI's GPT-4o, GPT-3.5, and GPT-4, Meta's CodeLlama-13B-hf, Google's Gemma-7B, BigCode's StarCoder2-15B, and Salesforce's CodeGEN-350M – by asking these models to complete snippets of code from the Defects4J dataset. Here's an example from Defects4J:version:10b;org/jfree/chart/imagemap/StandardToolTipTagFragmentGenerator.java: OpenAI's GPT-3.5 was asked to complete the snippet consisting of lines 267-274. For line 275, it reproduced the error in the Defects4J dataset by assigning the return value of p1.getPathIterator(null) to iterator2 rather than use p2. What's significant about this is that the error rates for LLM code suggestions were significantly higher when asked to complete buggy code – which is most code, at least to begin with. "Specifically, in bug prone tasks, LLMs exhibit nearly equal probabilities of generating correct and buggy code, with a substantially lower accuracy than in normal code completion scenarios (eg, 12.27 percent vs. 29.85 percent for GPT-4)," the paper explains. "On average, each model generates approximately 151 correct completions and 149 buggy completions, highlighting the increased difficulty of handling bug-prone contexts." So with buggy code, these LLMs suggested more buggy code almost half the time. "This finding highlights a significant limitation of current models in handling complex code dependencies," the authors observe. What's more, these LLMs showed that there's a lot of echoing of errors rather than anything that might be described as intelligence. As the researchers put it, "To our surprise, on average, 44.44 percent of the bugs LLMs make are completely identical to the historical bugs. For GPT-4o, this number is as high as 82.61 percent." 44 percent of the bugs LLMs make are completely identical to the historical bugs The LLMs thus will frequently reproduce the errors in the Defects4J data set without recognizing the errors or setting them right. They're essentially prone to spitting out memorized flaws. The extent to which the tested models "memorize" the bugs encountered in training data varies, ranging from 15 percent to 83 percent. "OpenAI’s GPT-4o exhibits a ratio of 82.61 percent, and GPT-3.5 follows with 51.12 percent, implying that a significant portion of their buggy outputs are direct copies of known errors from the training data," the researchers observe. "In contrast, Gemma7b’s notably low ratio of 15.00 percent suggests that its buggy completions are more often merely token-wise similar to historical bugs rather than exact reproductions." Models that more frequently reproduce bugs from training data are deemed to be less likely to "to innovate and generate error-free code." The AI models had more trouble with method invocation and return statements than they did with more straightforward syntax like if statements and variable declarations. The boffins also evaluated DeepSeek's R1 to see how a so-called reasoning model fared. It wasn't all that different from the others, exhibiting "a nearly balanced distribution of correct and buggy completions in bug-prone tasks." The authors conclude that more work needs to be done to give models a better understanding of programming syntax and semantics, more robust error detection and handling, better post-processing algorithms that can catch inaccuracies in model outputs, and better integration with development tools like Integrated Development Environments (IDEs) can assist with error mitigation. The "intelligence" portion of artificial intelligence still leaves a lot to be desired. The research team included Liwei Guo, Sixiang Ye, Zeyu Sun, Xiang Chen, Yuxia Zhang, Bo Wang, Jie M. Zhang, Zheng Li, and Yong Liu, affiliated with Beijing University of Chemical Technology, the Chinese Academy of Sciences, Nantong University, Beijing Institute of Technology, Beijing Jiaotong University, and King's College London.
Daily Brief Summary
Research conducted by scientists across multiple institutions shows that large language models (LLMs) often replicate bugs in the code they are completing.
During tests using code snippets from the Defects4J dataset, revered AI models like OpenAI’s GPT series and Google’s Gemma continued the errors instead of correcting them.
LLMs like GPT-4 generated error-ridden code nearly as often as correct code, underpinning the challenges in AI-driven code completion.
An alarming 44.44% of the reproduced bugs by these models were identical to historical bugs, with OpenAI’s GPT-4o hitting a reproduction ratio of 82.61%.
Some models demonstrated less dependency on historical errors, indicating variations in how different LLMs handle bug replication.
The efficiency and reliability of these models decreased notably when dealing with method invocation and return statements, as opposed to simpler syntax tasks.
Further research is urged to improve AI’s understanding of code semantics and syntax and to refine error detection and handling in development tools.
The study highlights significant limitations in current AI capabilities concerning complex code dependencies and error memorization.