Case Study: How Flatiron Health Gained Visibility and Control Over Total Platform Costs

The Enigma of AI Cloud Costs: Strategies for Effective Management

AI systems are experiencing a period of rapid growth with global spending on related software, hardware, and cloud services expected to surpass $300 billion in 2026 according IDC Research. The increased utilization of cloud computing by AI systems comes on top of an already massive shift by enterprises from on-prem to public cloud computing. Gartner Research is estimating that the total amount of cloud spend will reach $600 billion this year. By 2025, cloud spend will overtake on-prem reaching $917 billion in projected expenditures. The economics of this market are leading every technology executive to look at ways to reign in rapidly burgeoning AI cloud costs.

Watch the webinar: Managing AI Costs and Maximizing ROI →

Margins and Profitability Require Managing AI Costs Carefully

The “devil may care” cost management attitude of early AI development is now being replaced with an enterprise imperative of achieving gross revenue targets and product margins. AI costs have till now been analyzed after the fact, now they must be predicted, monitored and managed in forecasts. In order to understand the cloud costs of AI/ML applications, the black box enigma of AI operations needs to be broken down and dissected to come up with strategies for reducing and managing them. This begins with dividing the cloud costs of an AI/ML application into two phases of its lifecycle: training and inference. 

Breaking Down the Cloud Costs of AI

In the training phase, AI applications require a massive amount of cloud resources.  This is a one-time, upfront cost for the “creation” of the AI model.  Really massive amounts of data are processed through many levels of algorithmic neurons utilizing huge amounts of expensive GPU processing, cloud storage and distributed computing– all of which adds up to a hefty cloud bill as the model perfects it itself.

In the Inference phase, AI applications go into an every-day production run-time. Cloud inference costs are lower on a per transaction-basis however there are many more transactions leading to a much higher expense than the training phase over the lifetime of the AI model being managed. Deploying a model for real-time processing is expensive. An AI analyst firm confirmed this recently by revealing that ChatGPTs costs are primarily in the compute dedicated to inference. Although individual inference steps are less expensive than training runs, the cumulative cost of inference overtime becomes substantial.  

The Rub of Managing Cloud Costs for AI

As these costs run across the life cycle of the AI application, it begs a need both data science and Machine Learning engineering teams to join their platform engineering and infrastructure teams in managing cloud costs. There are practical and known methodologies for these teams to manage their AI cloud costs. 

We’ll begin by examining the training phase which starts with determining the business priority of the AI application in order to set cost limits and annual budgets. This must be supported with cost tracking using cloud resource tagging so costs are associated with the business or product cost center within your financial ledger. Real-time monitoring and AI-driven cost prediction help control expenses. Prioritizing workloads and leveraging lower-cost resources or engaging in workload arbitrage among cloud vendors can optimize the cost of training. 

As the AI lifecycle shifts to the inference phase, forecasting becomes the most critical function. Inference is associated with in-production products and services, where profitability is scrutinized on a quarterly basis. Engineering departments can “right-size” cloud resource utilization to meet service level guarantees or profit margin targets. Accurate cost tagging and recording remain critical to enabling profit-loss analysis. But now monitoring and managing cost anomalies is the most important. Unexpected cost spikes can come from forgotten services running in the background or bugs in new releases. Cloud cost forecasting, based on usage data, helps identify anomalies and avoid unwelcome surprises. Proactive monitoring ensures that budgets and profit targets are met.

Cloud Cost Management for AI is a Business Requirement

In the end, understanding the cloud costs of AI is vital for effective management. By considering the training and inference stages separately, businesses can develop strategies to optimize compute, storage, and network resources. Prioritizing urgency, monitoring costs, and leveraging forecasting techniques enable proactive cost management, helping companies meet their budgets and profit targets while harnessing the power of AI.

Next: Explore the Yotascale Product →