AWS now allows prompt caching with 90% cost reduction


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


The usage of AI continues to expand, and with more enterprises integrating AI tools into their workflows, many want to look for more options to cut the costs associated with running AI models. 

To answer customer demand, AWS announced two new capabilities on Bedrock to cut the cost of running AI models and applications, that are already available on competitor platforms. 

During a keynote speech at AWS re:Invent, Swami Sivasubramanian, vice president for  AI and Data at AWS, announced Intelligent Prompt Routing on Bedrock and the arrival of Prompt Caching. 

Intelligent Prompt Routing would help customers direct prompts to the best size so a big model doesn’t answer a simple query. 

“Developers need the right models for their applications, which is why we offer a wide set of models,” Sivasubramanian said. 

AWS said Intelligent Prompt Routing “can reduce costs by up to 30% without compromising on accuracy.” Users will have to choose a model family, and Bedrock’s Intelligent Prompt Routing will push prompts to the right-sized models within that family. 

Moving prompts through different models to optimize usage and cost has slowly gained prominence in the AI industry. Startup Not Diamond announced its smart routing feature in July. 

Voice agent company Argo Labs, an AWS customer, said it uses Intelligent Prompt Routing to ensure the correct-sized models handle the different customer inquiries. Simple yes-or-no questions like “Do you have a reservation?” are managed by a smaller model, but more complicated ones like “What vegan options are available?” would be routed to a bigger one. 

Caching prompts

AWS also announced Bedrock will now support prompt caching, where Bedrock can keep common or repeat prompts without pinging the model and generating another token. 

“Token generation costs can frequently rise particularly for repeat prompts,” Sivasubramanian said. “We wanted to give customers an easy way to dynamically cache prompts without sacrificing accuracy.”

AWS said prompt caching reduces costs “by up to 90% and latency by up to 85% for supported models.”

However, AWS is a little late to this trend. Prompt caching has been available on other platforms to help users cut costs when reusing prompts. Anthropic’s Claude 3.5 Sonnet and Haiku offer prompt caching on its API. OpenAI also expanded prompt caching for its API. 

Using AI models can be expensive

Running AI applications remains expensive, not just because of the cost of training models, but actually using them. Enterprises have said the costs of using AI are still one of the biggest barriers to broader deployment. 

As enterprises move towards agentic use cases, there is still a cost associated with users pinging the model and the agent to start doing its tasks. Methods like prompt caching and intelligent routing may help cut costs by limiting when a prompt pings a model API to answer a query. 

Model developers, though, said as adoption grows, some model prices could fall. OpenAI has said it anticipates AI costs could come down soon. 

More models

AWS, which hosts many models from Amazon — including its new Nova models — and leading open-source providers, will add new models on Bedrock. This includes models from Poolside, Stability AI’s Stable Diffusion 3.5  and Luma’s Ray 2. The models are expected to launch on Bedrock soon. 

Luma CEO and co-founder Amit Jain told VentureBeat that AWS is the first cloud provider partner of the company to host its models. Jain said the company used Amazon’s SageMaker HyperPod when building and training Luma models. 

“The AWS team had engineers who felt like part of our team because they were helping us figure out issues. It took us almost a week or two to bring our models to life,” Jain said. 



Leave a Comment