Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
As large language models (LLMs) continue to improve in coding, the benchmarks used to evaluate their performance are steadily becoming less useful.
That’s because even as many LLMs have similar high scores on these benchmarks, understanding which ones to use on specific software development projects and enterprises can be difficult.
A new paper by Yale University and Tsinghua University presents a novel method to test the ability of models to tackle “self-invoking code generation” problems that require reasoning, generating code, and reusing existing code in problem-solving.
Self-invoking code generation is much more similar to realistic programming scenarios and provides a better understanding of current LLMs’ ability to solve real-world coding problems.
Self-invoking code generation
Two popular benchmarks used to evaluate the coding abilities of LLMs are HumanEval and MBPP (Mostly Basic Python Problems). These are datasets of handcrafted problems that require the model to write code for simple tasks.
However, these benchmarks only cover a subset of the challenges software developers face in the real world. In practical scenarios, software developers don’t just write new code—they must also understand and reuse existing code and create reusable components to solve complex problems.
“The ability to understand and subsequently leverage one’s own generated code, namely self-invoking code generation, plays an important role for LLMs to leverage their reasoning capabilities to code generation that current benchmarks fail to capture,” the researchers write.
To test the ability of LLMs in self-invoking code generation, the researchers created two new benchmarks, HumanEval Pro and MBPP Pro, which extend the existing datasets. Each problem in HumanEval Pro and MBPP Pro builds on top of an existing example in the original dataset and introduces additional elements that require the model to solve the base problem and invoke the solution to solve a more complex problem.
For example, the original problem can be something simple, like writing a function that replaces all occurrences of a given character in a string with a new character.
The extended problem would be to write a function that changes occurrences of multiple characters in a string with their given replacements. This would require the model to write a new function that invokes the previous function it generated in the simple problem.
“This evaluation of self-invoking code generation offers deeper insights into the programming capabilities of LLMs, extending beyond the scope of single-problem code generation,” the researchers write.
LLMs perform poorly at self-invoking code generation
The researchers tested HumanEval Pro and MBPP Pro on more than 20 open and private models, including GPT-4o, OpenAI o1-mini, Claude 3.5 Sonnet, as well as Qwen, DeepSeek, and Codestral series.
Their findings show a significant disparity between traditional coding benchmarks and self-invoking code generation tasks. “While frontier LLMs excel at generating individual code snippets, they often struggle to effectively utilizing their own generated code for solving more complex problems,” the researchers write.
For example, with a single generation (pass@1), o1-mini achieves 96.2% on HumanEval but only 76.2% on HumanEval Pro.
Another interesting finding is that while instruction fine-tuning provides significant improvements on simple coding tasks, it shows diminishing returns on self-invoking code generation. The researchers note that “current instruction-based fine-tuning approaches are insufficiently effective for more complex self-invoking code generation tasks,” suggesting that we need to rethink how we train base models for coding and reasoning tasks.
To help advance research on self-invoking code generation, the researchers propose a technique to automatically repurpose existing coding benchmarks for self-invoking code generation. The approach uses frontier LLMs to generate self-invoking problems based on the original problems. They then generate candidate solutions and verify their correctness by executing the code and running test cases on them. The pipeline minimizes the need for manual code review to help generate more examples with less effort.
A complex landscape
This new family of benchmarks comes at a time when old coding benchmarks are quickly being conquered by frontier models. Current frontier models such as GPT-4o, o1, and Claude 3.5 Sonnet already have very high scores on HumanEval and MBPP as well as their more advanced versions, HumanEval+ and MBPP+.
At the same time, there are more complex benchmarks such as SWE-Bench, which evaluate models’ capabilities in end-to-end software engineering tasks that require a wide range of skills such as using external libraries and files, and managing DevOps tools. SWE-Bench is a very difficult benchmark and even the most advanced models are showing modest performance. For example, OpenAI o1 is inconsistent on SWE-Bench Verified.
Self-invoking code generation sits somewhere between the simple benchmarks and SWE-Bench. It helps evaluate a very specific type of reasoning ability: using existing code within a module to tackle complex problems. Self-invoking code benchmarks can prove to be a very practical proxy for the usefulness of LLMs in real-world settings, where human programmers are in control and AI copilots help them accomplish specific coding tasks in the software development process.
“HumanEval Pro and MBPP Pro are positioned to serve as valuable benchmarks for code-related evaluations and to inspire future LLM development by shedding light on current model shortcomings and encouraging innovation in training methodologies,” the researchers write.