Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
OctoTools, a new open-source agentic platform released by scientists at Stanford University, can turbocharge large language models (LLMs) for reasoning tasks by breaking down tasks into subunits and enhancing the models with tools. While tool use has already become an important application of LLMs, OctoTools makes these capabilities much more accessible by removing technical barriers and allowing to developers and enterprises to extend a platform with their own tools and workflows.
Experiments show that OctoTools outperforms classic prompting methods and other LLM application frameworks, making it a promising tool for real-world uses of AI models.
LLMs often struggle with reasoning tasks that involve multiple steps, logical decomposition or specialized domain knowledge. One solution is to outsource specific steps of the solution to external tools such as calculators, code interpreters, search engines or image processing tools. In this scenario, the model focuses on higher-level planning while the actual calculation and reasoning are done through the tools.
However, tool use has its own challenges. For example, classic LLMs often require substantial training or few-shot learning with curated data to adapt to new tools, and once augmented, they will be limited to specific domains and tool types.
Tool selection also remains a pain point. LLMs can become good at using one or a few tools, but when a task requires using multiple tools, they can get confused and perform badly.

OctoTools addresses these pain points through a training-free agentic framework that can orchestrate multiple tools without the need to fine-tune or adjust the models. OctoTools uses a modular approach to tackle planning and reasoning tasks and can use any general-purpose LLM as its backbone.
Among the key components of OctoTools are “tool cards,” which act as wrappers to the tools the system can use, such as Python code interpreters and web-search APIs. Tool cards include metadata such as input-output formats, limitations and best practices for each tool. Developers can add their own tool cards to the framework to suit their applications.
When a new prompt is fed into OctoTools, a “planner” module uses the backbone LLM to generate a high-level plan that summarizes the objective, analyzes the required skills, identifies relevant tools and includes additional considerations for the task. The planner determines a set of sub-goals that the system needs to achieve to accomplish the task and describes them in a text-based action plan.
For each step in the plan, an “action predictor” module refines the sub-goal to specify the tool required to achieve it and make sure it is executable and verifiable.
Once the plan is ready to be executed, a “command generator” maps the text-based plan to Python code that invokes the specified tools for each sub-goal, then passes the command to the “command executor,” which runs the command in a Python environment. The results of each step are validated by a “context verifier” module and the final result is consolidated by a “solution summarizer.”

“By separating strategic planning from command generation, OctoTools reduces errors and increases transparency, making the system more reliable and easier to maintain,” the researchers write.
OctoTools also uses an optimization algorithm to select the best subset of tools for each task. This helps avoid overwhelming the model with irrelevant tools.
Agentic frameworks
There are several frameworks for creating LLM applications and agentic systems, including Microsoft AutoGen, LangChain and OpenAI API “function calling.” OctoTools outperforms these platforms on tasks that require reasoning and tool use, according to its developers.

The researchers tested all frameworks on several benchmarks for visual, mathematical and scientific reasoning, as well as medical knowledge and agentic tasks. OctoTools achieved an average accuracy gain of 10.6% over AutoGen, 7.5% over GPT-Functions, and 7.3% over LangChain when using the same tools. According to the researchers, the reason for OctoTools’ better performance is its superior tool usage distribution and the proper decomposition of the query into sub-goals.
OctoTools offers enterprises a practical solution for using LLMs for complex tasks. Its extendable tool integration will help overcome existing barriers to creating advanced AI reasoning applications. The researchers have released the code for OctoTools on GitHub.