Model minimalism: The new AI strategy saving companies millions

This article is part of VentureBeat’s special issue, “The Real Cost of AI: Performance, Efficiency and ROI at Scale.” Read more from this special issue.

The advent of large language models (LLMs) has made it easier for enterprises to envision the kinds of projects they can undertake, leading to a surge in pilot programs now transitioning to deployment.

However, as these projects gained momentum, enterprises realized that the earlier LLMs they had used were unwieldy and, worse, expensive.

Enter small language models and distillation. Models like Google’s Gemma family, Microsoft’s Phi and Mistral’s Small 3.1 allowed businesses to choose fast, accurate models that work for specific tasks. Enterprises can opt for a smaller model for particular use cases, allowing them to lower the cost of running their AI applications and potentially achieve a better return on investment.

LinkedIn distinguished engineer Karthik Ramgopal told VentureBeat that companies opt for smaller models for a few reasons.

“Smaller models require less compute, memory and faster inference times, which translates directly into lower infrastructure OPEX (operational expenditures) and CAPEX (capital expenditures) given GPU costs, availability and power requirements,” Ramgoapl said. “Task-specific models have a narrower scope, making their behavior more aligned and maintainable over time without complex prompt engineering.”

Model developers price their small models accordingly. OpenAI’s o4-mini costs $1.1 per million tokens for inputs and $4.4/million tokens for outputs, compared to the full o3 version at $10 for inputs and $40 for outputs.

Enterprises today have a larger pool of small models, task-specific models and distilled models to choose from. These days, most flagship models offer a range of sizes. For example, the Claude family of models from Anthropic comprises Claude Opus, the largest model, Claude Sonnet, the all-purpose model, and Claude Haiku, the smallest version. These models are compact enough to operate on portable devices, such as laptops or mobile phones.

When discussing return on investment, though, the question is always: What does ROI look like? Should it be a return on the costs incurred or the time savings that ultimately means dollars saved down the line? Experts VentureBeat spoke to said ROI can be difficult to judge because some companies believe they’ve already reached ROI by cutting time spent on a task while others are waiting for actual dollars saved or more business brought in to say if AI investments have actually worked.

Normally, enterprises calculate ROI by a simple formula as described by Cognizant chief technologist Ravi Naarla in a post: ROI = (Benefits-Cost)/Costs. But with AI programs, the benefits are not immediately apparent. He suggests enterprises identify the benefits they expect to achieve, estimate these based on historical data, be realistic about the overall cost of AI, including hiring, implementation and maintenance, and understand you have to be in it for the long haul.

With small models, experts argue that these reduce implementation and maintenance costs, especially when fine-tuning models to provide them with more context for your enterprise.

Arijit Sengupta, founder and CEO of Aible, said that how people bring context to the models dictates how much cost savings they can get. For individuals who require additional context for prompts, such as lengthy and complex instructions, this can result in higher token costs.

“You have to give models context one way or the other; there is no free lunch. But with large models, that is usually done by putting it in the prompt,” he said. “Think of fine-tuning and post-training as an alternative way of giving models context. I might incur $100 of post-training costs, but it’s not astronomical.”

Sengupta said they’ve seen about 100X cost reductions just from post-training alone, often dropping model use cost “from single-digit millions to something like $30,000.” He did point out that this number includes software operating expenses and the ongoing cost of the model and vector databases.

“In terms of maintenance cost, if you do it manually with human experts, it can be expensive to maintain because small models need to be post-trained to produce results comparable to large models,” he said.

Experiments Aible conducted showed that a task-specific, fine-tuned model performs well for some use cases, just like LLMs, making the case that deploying several use-case-specific models rather than large ones to do everything is more cost-effective.

The company compared a post-trained version of Llama-3.3-70B-Instruct to a smaller 8B parameter option of the same model. The 70B model, post-trained for $11.30, was 84% accurate in automated evaluations and 92% in manual evaluations. Once fine-tuned to a cost of $4.58, the 8B model achieved 82% accuracy in manual assessment, which would be suitable for more minor, more targeted use cases.

Right-sizing models does not have to come at the cost of performance. These days, organizations understand that model choice doesn’t just mean choosing between GPT-4o or Llama-3.1; it’s knowing that some use cases, like summarization or code generation, are better served by a small model.

Daniel Hoske, chief technology officer at contact center AI products provider Cresta, said starting development with LLMs informs potential cost savings better.

“You should start with the biggest model to see if what you’re envisioning even works at all, because if it doesn’t work with the biggest model, it doesn’t mean it would with smaller models,” he said.

Ramgopal said LinkedIn follows a similar pattern because prototyping is the only way these issues can start to emerge.

“Our typical approach for agentic use cases begins with general-purpose LLMs as their broad generalizationability allows us to rapidly prototype, validate hypotheses and assess product-market fit,” LinkedIn’s Ramgopal said. “As the product matures and we encounter constraints around quality, cost or latency, we transition to more customized solutions.”

In the experimentation phase, organizations can determine what they value most from their AI applications. Figuring this out enables developers to plan better what they want to save on and select the model size that best suits their purpose and budget.

The experts cautioned that while it is important to build with models that work best with what they’re developing, high-parameter LLMs will always be more expensive. Large models will always require significant computing power.

However, overusing small and task-specific models also poses issues. Rahul Pathak, vice president of data and AI GTM at AWS, said in a blog post that cost optimization comes not just from using a model with low compute power needs, but rather from matching a model to tasks. Smaller models may not have a sufficiently large context window to understand more complex instructions, leading to increased workload for human employees and higher costs.

Sengupta also cautioned that some distilled models could be brittle, so long-term use may not result in savings.

Regardless of the model size, industry players emphasized the flexibility to address any potential issues or new use cases. So if they start with a large model and a smaller model with similar or better performance and lower cost, organizations cannot be precious about their chosen model.

Tessa Burg, CTO and head of innovation at brand marketing company Mod Op, told VentureBeat that organizations must understand that whatever they build now will always be superseded by a better version.

“We started with the mindset that the tech underneath the workflows that we’re creating, the processes that we’re making more efficient, are going to change. We knew that whatever model we use will be the worst version of a model.”

Burg said that smaller models helped save her company and its clients time in researching and developing concepts. Time saved, she said, that does lead to budget savings over time. She added that it’s a good idea to break out high-cost, high-frequency use cases for light-weight models.

Sengupta noted that vendors are now making it easier to switch between models automatically, but cautioned users to find platforms that also facilitate fine-tuning, so they don’t incur additional costs.

venturebeat

Model minimalism: The new AI strategy saving companies millions

Similar News

U.S. government cuts key hurricane forecasting data from satellites

The rise of prompt ops: Tackling hidden AI costs from bad inputs and context bloat

Model minimalism: The new AI strategy saving companies millions

How runtime attacks turn profitable AI into budget black holes

The inference trap: How cloud providers are eating your AI margins