Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Hugging Face has achieved a remarkable breakthrough in AI, introducing vision-language models that run on devices as small as smartphones while outperforming their predecessors that require massive data centers.
The company’s new SmolVLM-256M model, requiring less than one gigabyte of GPU memory, surpasses the performance of their Idefics 80B model from just 17 months ago — a system 300 times larger. This dramatic reduction in size and improvement in capability marks a watershed moment for practical AI deployment.
“When we released Idefics 80B in August 2023, we were the first company to open-source a video language model,” Andrés Marafioti, machine learning research engineer at Hugging Face, said in an exclusive interview with VentureBeat. “By achieving a 300X size reduction while improving performance, SmolVLM marks a breakthrough in vision-language models.”
Smaller AI models that run on everyday devices
The advancement arrives at a crucial moment for enterprises struggling with the astronomical computing costs of implementing AI systems. The new SmolVLM models — available in 256M and 500M parameter sizes — process images and understand visual content at speeds previously unattainable at their size class.
The smallest version processes 16 examples per second while using only 15GB of RAM with a batch size of 64, making it particularly attractive for businesses looking to process large volumes of visual data. “For a mid-sized company processing 1 million images monthly, this translates to substantial annual savings in compute costs,” Marafioti told VentureBeat. “The reduced memory footprint means businesses can deploy on cheaper cloud instances, cutting infrastructure costs.”
The development has already caught the attention of major technology players. IBM has partnered with Hugging Face to integrate the 256M model into Docling, their document processing software. “While IBM certainly has access to substantial compute resources, using smaller models like these allows them to efficiently process millions of documents at a fraction of the cost,” said Marafioti.
How Hugging Face reduced model size without compromising power
The efficiency gains come from technical innovations in both vision processing and language components. The team switched from a 400M parameter vision encoder to a 93M parameter version and implemented more aggressive token compression techniques. These changes maintain high performance while dramatically reducing computational requirements.
For startups and smaller enterprises, these developments could be transformative. “Startups can now launch sophisticated computer vision products in weeks instead of months, with infrastructure costs that were prohibitive mere months ago,” said Marafioti.
The impact extends beyond cost savings to enabling entirely new applications. The models are powering advanced document search capabilities through ColiPali, an algorithm that creates searchable databases from document archives. “They obtain very close performances to those of models 10X the size while significantly increasing the speed at which the database is created and searched, making enterprise-wide visual search accessible to businesses of all types for the first time,” Marafioti explained.
Why smaller AI models are the future of AI development
The breakthrough challenges conventional wisdom about the relationship between model size and capability. While many researchers have assumed that larger models were necessary for advanced vision-language tasks, SmolVLM demonstrates that smaller, more efficient architectures can achieve similar results. The 500M parameter version achieves 90% of the performance of its 2.2B parameter sibling on key benchmarks.
Rather than suggesting an efficiency plateau, Marafioti sees these results as evidence of untapped potential: “Until today, the standard was to release VLMs starting at 2B parameters; we thought that smaller models were not useful. We are proving that, in fact, models at 1/10 of the size can be extremely useful for businesses.”
This development arrives amid growing concerns about AI’s environmental impact and computing costs. By dramatically reducing the resources required for vision-language AI, Hugging Face’s innovation could help address both issues while making advanced AI capabilities accessible to a broader range of organizations.
The models are available open-source, continuing Hugging Face’s tradition of increasing access to AI technology. This accessibility, combined with the models’ efficiency, could accelerate the adoption of vision-language AI across industries from healthcare to retail, where processing costs have previously been prohibitive.
In a field where bigger has long meant better, Hugging Face’s achievement suggests a new paradigm: The future of AI might not be found in ever-larger models running in distant data centers, but in nimble, efficient systems running right on our devices. As the industry grapples with questions of scale and sustainability, these smaller models might just represent the biggest breakthrough yet.