Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
As soon as AI agents have showed promise, organizations have had to grapple with figuring out if a single agent was enough, or if they should invest in building out a wider multi-agent network that touches more points in their organization.
Orchestration framework company LangChain sought to get closer to an answer to this question. It subjected an AI agent to several experiments that found single agents do have a limit of context and tools before their performance begins to degrade. These experiments could lead to a better understanding of the architecture needed to maintain agents and multi-agent systems.
In a blog post, LangChain detailed a set of experiments it performed with a single ReAct agent and benchmarked its performance. The main question LangChain hoped to answer was, “At what point does a single ReAct agent become overloaded with instructions and tools, and subsequently sees performance drop?”
LangChain chose to use the ReAct agent framework because it is “one of the most basic agentic architectures.”
While benchmarking agentic performance can often lead to misleading results, LangChain chose to limit the test to two easily quantifiable tasks of an agent: answering questions and scheduling meetings.
“There are many existing benchmarks for tool-use and tool-calling, but for the purposes of this experiment, we wanted to evaluate a practical agent that we actually use,” LangChain wrote. “This agent is our internal email assistant, which is responsible for two main domains of work — responding to and scheduling meeting requests and supporting customers with their questions.”
Parameters of LangChain’s experiment
LangChain mainly used pre-built ReAct agents through its LangGraph platform. These agents featured tool-calling large language models (LLMs) that became part of the benchmark test. These LLMs included Anthropic’s Claude 3.5 Sonnet, Meta’s Llama-3.3-70B and a trio of models from OpenAI, GPT-4o, o1 and o3-mini.
The company broke testing down to better assess the performance of email assistant on the two tasks, creating a list of steps for it to follow. It began with the email assistant’s customer support capabilities, which look at how the agent accepts an email from a client and responds with an answer.
LangChain first evaluated the tool calling trajectory, or the tools an agent taps. If the agent followed the correct order, it passed the test. Next, researchers asked the assistant to respond to an email and used an LLM to judge its performance.
![](https://i0.wp.com/venturebeat.com/wp-content/uploads/2025/02/langchain-benchmark-tooling-screenshot-1.png?resize=900%2C582&ssl=1)
![](https://i0.wp.com/venturebeat.com/wp-content/uploads/2025/02/Langchain-benchmark-tooling-screenshot-2.png?resize=900%2C687&ssl=1)
For the second work domain, calendar scheduling, LangChain focused on the agent’s ability to follow instructions.
“In other words, the agent needs to remember specific instructions provided, such as exactly when it should schedule meetings with different parties,” the researchers wrote.
Overloading the agent
Once they defined parameters, LangChain set to stress out and overwhelm the email assistant agent.
It set 30 tasks each for calendar scheduling and customer support. These were run three times (for a total of 90 runs). The researchers created a calendar scheduling agent and a customer support agent to better evaluate the tasks.
“The calendar scheduling agent only has access to the calendar scheduling domain, and the customer support agent only has access to the customer support domain,” LangChain explained.
The researchers then added more domain tasks and tools to the agents to increase the number of responsibilities. These could range from human resources, to technical quality assurance, to legal and compliance and a host of other areas.
Single-agent instruction degradation
After running the evaluations, LangChain found that single agents would often get too overwhelmed when told to do too many things. They began forgetting to call tools or were unable to respond to tasks when given more instructions and contexts.
LangChain found that calendar scheduling agents using GPT-4o “performed worse than Claude-3.5-sonnet, o1 and o3 across the various context sizes, and performance dropped off more sharply than the other models when larger context was provided.” The performance of GPT-4o calendar schedulers fell to 2% when the domains increased to at least seven.
Other models didn’t fare much better. Llama-3.3-70B forgot to call the send_email tool, “so it failed every test case.”
![](https://i0.wp.com/venturebeat.com/wp-content/uploads/2025/02/Screenshot-2025-02-11-at-4.42.09%25E2%2580%25AFPM.png?resize=900%2C535&ssl=1)
Only Claude-3.5-sonnet, o1 and o3-mini all remembered to call the tool, but Claude-3.5-sonnet performed worse than the two other OpenAI models. However, o3-mini’s performance degrades once irrelevant domains are added to the scheduling instructions.
The customer support agent can call on more tools, but for this test, LangChain said Claude-3.5-mini performed just as well as o3-mini and o1. It also presented a shallower performance drop when more domains were added. When the context window extends, however, the Claude model performs worse.
GPT-4o also performed the worst among the models tested.
“We saw that as more context was provided, instruction following became worse. Some of our tasks were designed to follow niche specific instructions (e.g., do not perform a certain action for EU-based customers),” LangChain noted. “We found that these instructions would be successfully followed by agents with fewer domains, but as the number of domains increased, these instructions were more often forgotten, and the tasks subsequently failed.”
The company said it is exploring how to evaluate multi-agent architectures using the same domain overloading method.
LangChain is already invested in the performance of agents, as it introduced the concept of “ambient agents,” or agents that run in the background and are triggered by specific events. These experiments could make it easier to figure out how best to ensure agentic performance.