Webinar Replay

Managing AI/LLM Costs and Maximizing ROI

Overview

This webinar is intended for data scientists, platform engineers and FinOps managers who are responsible for AI-based applications in production and need to manage their public cloud infrastructure costs closely.

Artificial Intelligence and Machine Learning are the new frontier of software development. Massive market demand for AI is pushing up cloud infrastructure costs. Profitable AI products require collaborative management between data science, platform engineering and increasingly FinOps teams. An 80% public cloud cost contribution to the cost of goods sold is no longer viable when launching new digital products.

Join industry leaders, Raja Iqbal of Data Science Dojo and Asim Razzaq of Yotascale as they discuss how to future proof your AI initiatives. Cloud cost management is an essential element of any modern data stack AI project. Raja and Asim will cover challenges of tracking AI cloud expenses and how to move from just managing to planning and forecasting your cloud costs for profitable AI-based products and services.

Length: 1hr 11min

Speakers

Asim Razzaq

CEO and Founder of Yotascale

Asim Razzaq is the co-founder and CEO of Yotascale. Prior to Yotascale, Asim was Senior Director of Platform Engineering (Head of Infrastructure) at PayPal where he was responsible for all core infrastructure processing payments and logins. He led the build-out of the PayPal private cloud and the PayPal developer platform generating multi-billion dollars in payments volume. Asim has held engineering leadership roles at early-mid stage startups and large companies including eBay and PayPal. His teams have focused on building cloud-scale platforms and applications. Asim earned his BS (Honors) in Computer Sciences from the University of Texas at Austin where conducted undergraduate research in the areas of resource management and distributed computing and is a published author.

Raja Iqbal

CEO & Chief Data Scientist at Data Science Dojo

Raja Iqbal, Co-founder of Data Science Dojo, is a visionary entrepreneur who has significantly influenced the landscape of data science education. With a strong commitment to making learning more practical and accessible, he has shaped the mission of Data Science Dojo. Raja’s leadership has not only addressed the growing demand for data science skills but has also positioned the company as a global player, offering training programs and consulting services internationally. As a thought leader, he actively contributes to the community through speaking engagements and conferences, fostering knowledge exchange and networking among data science professionals.

Webinar Transcript

Introduction

Asim Razzaq:

Welcome, everybody. Good evening, good afternoon, and good morning, wherever you are in the world. I am Asim Razzaq, your host, and I’m joined by my good friend Raja Iqbal, the founder and CEO of Data Science Dojo. We’ll introduce ourselves shortly. As you all know, we are in the midst of an LLM generative AI gold rush. You’ve heard a lot about it, and today we want to discuss the ROI and costs, which are often not considered upfront but will become the most important factors in differentiating yourself from competitors and ensuring the longevity of your AI LLM programs.

Before we dive in, I’d love for everyone to type in which city and country they’re from. We love geographic diversity, so please take a couple of seconds to do that. Welcome to all of our live feeds, which are being broadcasted in many places. Let’s see where everyone is joining from.

Raja Iqbal:

Asim, I can tell you on LinkedIn I see people from Sweden, Bogota, Colombia, Poland, Toronto, Jordan, Namibia, Ankara, Maharashtra, India. Quite a spread.

Asim Razzaq:

All right. We also have Tampa, FL. Excellent. Wow, that is quite a spread.

Raja Iqbal:

Yes, I see Maputo, Mozambique, and Mauritius, Belize. Quite diverse.

Asim Razzaq:

Well, thank you guys for taking the time to do that. Let’s dive into our intros here. So just by way of introduction, I’m Asim Razzaq, co-founder and CEO of Yotascale, and I call myself a recovering engineer because my entire career has been spent building and scaling platforms. Prior to starting Yotascale, I was the head of platform engineering at PayPal and eBay. That’s where my love-hate relationship with a lot of these economic challenges began. We all know about the public cloud—you have to spend money there and make sure you’re efficient. Now, this is the next phase of that because a lot of the AI / LLM pieces do cost money. At Yotascale, our mission is to help enterprises and companies of a certain size really understand the economics of building their AI / LLM capabilities and cloud infrastructure. With that, I’ll pass it on to Raja.

Raja Iqbal:

OK, so I’m Raja Iqbal, chief data scientist at Data Science Dojo. I think people have been seeing me quite a lot on webinars for the last few weeks, but for those of you joining from other channels, let me introduce myself. I’ve been doing machine learning and AI—what we used to call pattern recognition—a long time ago. I’ve spent more than half of my life in this space, so it’s been quite a long time. I’m running a tech and services company now called Data Science Dojo. We’ve trained around 10,000 people globally in face-to-face settings. That number might not seem huge in some contexts, but it’s significant for us.

Lately, we’ve been heavily involved in generative AI, both on the solution-building side and the learning side. We’ve launched our first boot camp, perhaps the first in the industry, for building LLM applications. So, Asim, over to you.

Even the Most Successful Companies Will Be Challenged – Massive Supply-Demand Mismatch

Asim Razzaq:

All right. Thank you, Raja. So let’s talk about the big idea here. Right. I think we all know there’s a lot of activity going on in this space, specifically around generative AI and LLMs. There’s an initiative around artificial general intelligence, but that’s not our topic today. Our focus is on LLMs and generative AI.

One of the key insights and challenges that people are not quite realizing is this whole mismatch between supply and demand. By that, I mean the arms race to build models and pivot technology to leverage this new wave of innovation is creating a crunch. The graph you see here is about top and medium-grade GPUs—the supply part of it. If you draw a line through it, you’ll see that every two to three years, the price-performance ratio gets better. You get twice the performance in two to three years. It’s a very linear approach with no immediate innovation in sight to change this.

The critical point is that while the innovation is linear, the demand is exponential. Given where we sit, we are already in a crunch phase. A lot of the demand is being leveraged by larger companies. If you’re a cloud provider or a provider of large-scale models, you really need these capabilities. This means there will be a price challenge and pressure. Even if you get access to the capacity, you might have to pay a decent amount. Therefore, the cost focus has to be front and center and cannot be an afterthought.

I understand that if you’re early in the experimentation phase, maybe cost isn’t a priority. But I can give you some anecdotes. For example, a good friend of mine, the founder of a startup, had to shut down GPUs because they couldn’t afford to run them around the clock. It cost them $2500 when they forgot to switch them off one time. It’s not millions, but for some companies, it’s hundreds of thousands of dollars. When they went to get additional GPUs the next morning, the cloud provider had already put in a form and gates to ensure the supply didn’t evaporate. So there’s already some friction in the making.

Raja, I’d love to get your perspective. What are the ramifications of this supply-demand crunch, and how soon or not are we going to see this turn into more of a crisis rather than things being hunky-dory right now?

Raja Iqbal:

Yeah, so if I look at it from the AI and generative AI angle, companies are definitely beginning to bring in more AI workloads. Machine learning has been happening for quite some time, and we’ve seen quite a bit of development. Let’s actually step back for a moment. There are different kinds of workloads—most machine learning workloads are now moving more towards GPU-based workloads, shifting from the traditional CPU-based workloads.

With generative AI, especially, a lot of it involves fine-tuning models and building models on enterprise data, which is definitely driving demand. I’m more on the demand side of things than the supply side, as a consumer of these technologies. But I can see the increasing demand.

Can I put a number on it like Moore’s Law or Huang’s Law? I don’t have a “Raja’s Law” for that. But clearly, the demand is going to increase because of the inherent nature of how these generative technologies are built. There are ways to harness this without necessarily doing more training.

Let’s step back again. There’s the compute side of it and the rest of the infrastructure—the storage side, where you need efficient storage, search, and vector databases. Some techniques are compute-heavy and will require more compute, while others provide workarounds to avoid heavy compute loads by leveraging efficient retrieval and storage mechanisms.

So, it’s one or the other. The demand is definitely going to increase because this whole thing has just started. Everyone is figuring out how to leverage this in their enterprise settings.

Asim Razzaq:

Early innings and of course, we’ll go through some of the techniques that people can use given the situation. The question is, what’s the solution? Where do I go from here? We’ll be covering a decent amount of that. The rock and the hard place is that you cannot afford to ignore this technology because it’s too good. There’s a lot of hype about many things, right? We’ve all been through those hype cycles—they claim to be the best thing ever. But here, there’s something genuinely innovative and game-changing. Some people have compared it to the advent of the Internet or the iPhone. So, it’s at least on that scale, if not more.

To stay competitive, you really need to make sure you’re experimenting and leveraging this technology within your domain and business context. But just for our audience, remember to keep a close eye on the supply and demand, and make sure you’re thinking through this in your strategy as you have more workloads to train and experiment with.

Cost Implications of Multiple Paths to ML Inference

Asim Razzaq:

Here’s a pretty interesting chart. We were just talking about different methods and the ecosystem. This is a snapshot from Felicis, one of the investors in Yotascale, discussing this evolving ecosystem as of August 7, 2023. Things are changing so fast that people are timestamping their information, noting it’s two weeks old, a month old, or three months old—time flies in this category.

Some of the key points are about GPU cloud providers and their trade-offs. For our audience, if you’re thinking about what you should use, here’s a bit of education. You have the GPU cloud, where you get raw capacity from a cloud provider. This is a step above buying your own GPUs for your data center, which large companies can do, and what cloud providers are doing. Then, it goes up to serverless GPUs, where you don’t worry about infrastructure. This is similar to how serverless works, offering more efficient inference with some acceleration, so the costs aren’t as high. You have models like GPT-4 APIs, where you can access these models for a fee. Finally, there are decentralized GPUs, where providers do some arbitrage to bring low-cost GPU capacity to you.

So, Raja, the question would be, if someone is thinking through this and their head is spinning a bit, what do they do? How do they approach it? What are some scenarios and use cases that would be helpful?

For example, if I’m a pre-Series A company, my biggest challenge is reaching product-market fit. I cannot build my own data center and procure GPUs; I won’t have the CapEx for that. So, in the rush for product-market fit, I might work at the model API level. It would be great if you could run through some scenarios and provide advice on what kind of technology people should consider based on the business problem they’re solving.

Raja Iqbal:

Yeah, I think this is a tough one, Asim, and I’m assuming you’re talking about AI and generative AI specifically.

Right, so cost management in the cloud has been an issue for at least a decade. But specifically in the context of generative AI, knowing your use case is crucial. Let me address this first. You mentioned the need to timestamp information because things are changing so fast. I actually did a post this morning because OpenAI came out with the fine-tuning of GPT-3.5, and the pricing they mentioned seems quite reasonable. It looks impressive because building a model from scratch can cost tens of thousands of compute hours and millions of dollars. Even fine-tuning is quite hard.

Going back to your question, you have to know what you’re trying to do. You need to understand the use case. There are different tradeoffs: cost, latency, accuracy, and time to market.

Asim Razzaq:

And time to market, right.

Raja Iqbal:

Exactly. It’s a very interesting engineering and business problem at the same time. Let’s say you’re building a conversational agent or a recommendation system, and you need to provide real-time answers or recommendations. The kind of infrastructure you need for that is different from what you need for offline processing and gathering insights. A lot of it is still good old engineering. If you’re a good engineer or architect, that’s what is needed. You can be excellent in machine learning, but you also need a good view of the business and do engineering in a way that is efficient.

You can go with fine-tuned models, but there are other considerations too. Are there regulatory or compliance risks if you upload your data to a third-party cloud? Are there any IP issues? Can you do it in-house, and do you have the necessary skills? There are many factors to consider, and the options are there. You just need to know your options. It’s not purely a technology problem. You can’t just throw technology at it, no matter how fascinating it seems. Once you start building non-trivial applications, it becomes very complicated. You have to worry about latencies, accuracy, cost, the number of tokens you’re passing, and all of that.

Asim Razzaq:

Right, that’s maybe… Sorry, go ahead.

Raja Iqbal:

Yeah, I’m saying there are too many things to worry about actually.

Asim Razzaq:

There’s a lot to worry about, but maybe we can simplify it a bit. Let’s go down this thread: given where things are, many people will be comfortable with a model API solution like GPT-4, especially if they’re experimenting and building conversational agents. It’s absolutely true that you have to experiment to understand the guardrails, where things might go off the rails, and where they will stay on track. You can’t just start one day, switch it on, and expect everything to be fine.

On the model API piece alone, what are your thoughts on options like LLaMA 2 and GPT-4? There’s also an open-source component, which you have to manage. You can’t just give it a bunch of tokens and expect it to work perfectly. What’s the factor of complexity that people would run into when trying to bring an open-source model in-house and work with it, as opposed to something out-of-the-box that’s continuously being tuned? This isn’t to say one is better than the other, but what are the scenarios to consider in that particular genre?

Raja Iqbal:

I mean, I don’t see this as any different from having a managed SQL database versus a fully managed hosted SQL database. There might be some nuances, but at a high level, if you bring in your own open-source model, deploy it, and manage it, including fine-tuning it, you need to have some technical understanding of things. When you use something like GPT-4 or GPT-3.5 from OpenAI, which is fully managed, it’s simpler. We’ve used various options: you can directly use OpenAI, Azure OpenAI, or even Amazon’s Project Bedrock, especially in an e-commerce context. Cohere is another option. No matter what you’re using, it’s a trade-off.

Managed services tend to be more expensive, of course. So, if you’re here for a quick proof of concept—like you mentioned, pre-Series A and raising a Series A round—then go and build something using a managed service. Once you have the funding, invest in in-house solutions. The longer-term cost can add up very quickly when bigger workloads come in, even if initially, managed services seem more cost-effective.

Asim Razzaq:

That’s right, that’s right. And I think Frank in the chat mentioned that it costs $100 million to train GPT-3. There’s an element of somebody having already done a bunch of training ahead of time. If a pre-trained model suits your use case well, why would you want to invest that kind of money to do it yourself? It’s not exactly apples and oranges, but it speaks to the cost of getting some of these models ready. There’s definitely a benefit in using model APIs because a lot of CapEx and OpEx has already gone into them.

I like your approach of not building to scale from the start. You need to experiment and understand the technology and the use cases you’re trying to deliver before optimizing. And I love the analogy—again, we’re both engineers at heart—that this is 95% good old engineering. Just because the domain has changed doesn’t mean you’ll do something radically different. You need the same discipline in architecture, design, understanding the business use case, and knowing what problem you’re solving. All of those principles are still very relevant.

Raja Iqbal:

Yeah. So I would like to add something here. The foundation models that OpenAI, Cohere, Google, and Facebook’s LLaMA 2 offer are great, but they are primarily trained on publicly available data sources. If you have proprietary data—such as legal documents or other proprietary information—these models can still help with the language or semantic context.

There are different approaches you can take: you can fine-tune these models by adding your proprietary data, or you can use these models for the language component while bringing the knowledge component from your own private data repository. This way, you use them in conjunction with each other.

So, yes, there are cases where off-the-shelf models may not work perfectly if your data is very specific to your company and not something the foundation model providers are aware of, then of course, you won’t get the what you want out of these models.

Asim Razzaq:

No, that’s fair. And I think the last point I’ll make on this slide is that it’s always about balancing time to value, speed of experimentation, and the cost trade-offs. At the bottom tier, you can go buy your own GPUs, put LLaMA on it, and build the whole stack. If the company is very technical, has the CapEx, and has a very unique use case not covered by existing software, they might need to modify LLaMA. You could start with that as a basis since it’s open source, and that’s one use case.

The other use case is when your competitor is already implementing conversational AI into their offering, and that’s what everyone expects. You have to start somewhere fast and iterate to ensure you don’t lose market share and maintain your competitive edge. These are just two ends of the spectrum of use cases.

Raja Iqbal:

Yeah, OK, good. Asim, there’s an interesting question in the Q&A. If you want to take a look, I can read it out.

Asim Razzaq:

Yes. Yeah. Go ahead.

Raja Iqbal:

So, do we have data on how much Microsoft, Apple, and Google are spending on this race merely from a consumption point of view? There was an article mentioning that Microsoft’s operational costs have gone up quite a bit. This actually reminded me of another company that has benefited the most out of all of this, which is NVIDIA, right?

Yes, all of these companies, at the end of the day, many of these cloud service providers, are in the business of renting infrastructure. Microsoft, Amazon (AWS), and others rent out this infrastructure, and they buy a lot of hardware from NVIDIA to do so.

Their operational costs are going up for two main reasons. First, they are adopting generative AI internally, leading to increased R&D costs. Second, to serve customers—Microsoft now offers Azure OpenAI and AWS offers many of these foundation models—they need that hardware in their data centers. Without a doubt, these companies are placing orders in the tens of billions for NVIDIA GPUs. That’s the mad rush unfolding right now as we speak.

Who Manages Costs as Organizations & Roles Converge

Asim Razzaq:

Yeah. And to address that question, we don’t know the exact numbers, but if I were to guess, we’re talking about at least hundreds of millions of dollars here. If you take what it cost OpenAI as an anchor, this is not going to be cheap, especially for cloud providers operating at such a massive scale.

There’s another question we’ll look at along the way: Is it fair to say that the phenomenal funding AI companies are getting is more like the initial thrust needed for lift-off and may not be an indication of the great applications of Gen AI that they are coming up with?

I think we need to use common sense here, similar to what Raja mentioned earlier about engineering being 95% of this. When there’s transformational technology, a lot of companies will get funding as people try to figure out new innovations. There will be a cottage industry around this whole stack. Companies will claim to do it better, make it more productive, help you deploy your models faster, and there will be tooling around it. In a gold rush, the picks and shovels sell for a lot, right?

However, companies solving real problems with real technology and depth will certainly disrupt industries. I have zero doubt about it. I believe every industry will get disrupted in one way or another. If you’re a laggard or a legacy company, you’ll be in trouble. Let’s take healthcare, for example. It hasn’t changed much over the years, despite many companies trying. I believe that the pace at which this technology is evolving, aside from regulatory and security issues, will lead to companies emerging that solve amazing use cases in healthcare—preventative care, data analysis, predictions on longevity, and so forth.

In any gold rush, many will go in, but that doesn’t mean most of it will be garbage. There will definitely be industry-changing use cases in various fields. We’re already seeing that. Keep the questions coming, they are great. I think there’s another one.

Raja Iqbal:

There are more questions if you’re ready. Let me know.

Asim Razzaq:

Yeah, I think let’s maybe get through a couple of other points and we can leave a decent amount of time for the Q&A. But please, please post your questions because we will go back and go through them. All right. Interesting questions. So I think this slide is simply talking about the emergence of this technology and who owns this infrastructure. Who is responsible for managing the costs?

Traditionally, platform engineering—also known as DevOps—has aimed to make engineering and application teams more productive. They provide the infrastructure, tooling, and are measured by how productive they can make the middle layer. Now, we have MLOps, which has been around for a while, and a somewhat new entrant called LLMOps, which could candidly be called AIOps. We don’t need too many different frameworks. Then, there is the data science team.

The question is, how do these teams interact and who owns the infrastructure and models? Who owns the budget? Raja, are you seeing anything specific as you interact with organizations? I’m happy to provide my feedback on this: how is the organizational aspect of this evolving?

Raja Iqbal:

OK, so it is evolving, as you noticed, in this space. There are actually a lot of interesting areas. Let’s look at LLMOps, right? I agree with you that it is an extension of MLOps with some interesting extensions. For instance, explainability, bias, and fairness were always part of MLOps, right? If you’re a big enough company that cares about PR, should you invest in LLMOps? I would say absolutely, you have to.

It may sound like an area to worry about later, but consider this: your language model could potentially be deceived into responding in ways you don’t intend. For example, if a billion-dollar company has a chatbot deployed using one of the LLM technologies and it gets deceived into revealing sensitive or politically incorrect information, that could cause significant damage.

Most companies, barring some of the top ones, are trying to figure out how to use this technology efficiently. Based on the customers we’ve interacted with on the services side of our business, there are two prominent cases. Some companies want to use LLMs and generative AI because they have specific business needs, while others don’t want to miss out on the trend. Within these specific use cases, as soon as you start building an application that is somewhat non-trivial, the complexity grows rapidly.

It’s easy to go to ChatGPT, construct a clever response, and post it on social media. But when you’re creating a database, querying your CRM system, or building a system that responds to customers in an automated fashion, the challenges are much greater. Integrating structured data sources and building a natural language querying interface, for instance, involves many moving parts and significant complexity.

Asim Razzaq:

Yeah, I mean, I think that begs the question again: platform engineering and whether LLMOps or MLOps is an extension of platform engineering. At the end of the day, it’s a framework with its guardrails, guiding how you go from concept to deployment and then iterate. What does that look like?

The discussion is about whether this is a different group or something that platform engineering will eventually own. Then you have the data science group. To me, this is interesting because you can say data analytics groups will be leveraging this quite a bit. They are certainly customers of this.

It doesn’t make sense for this to fall into a data science group just given the skill set required. Now, that can be controversial, and I’m sure data scientists have their views. For rapid experimentation or Skunk Works projects, sure, but when you’re talking about scale and what the infrastructure needs to do, you need to consider latency, security, and reliability. These elements don’t go away, and they require a team that owns the framework and all its aspects.

What do you think?

Raja Iqbal:

You know, data prediction is there. Reproducibility is a major challenge, and accuracy is a significant challenge as well. When you’re dealing with language, maybe one word being substituted with another doesn’t matter much, but with numbers, 2.5 is different from 2.51. So, when you’re dealing with serious applications, you can’t just substitute one number for another.

Moreover, if you’re substituting one word for another, what if it isn’t a politically correct word? What if it’s offensive? This shows that it’s not purely a technology problem. You need a broader understanding of the business context as well.

So, yeah, there are many complexities to consider beyond just the technical aspects.

Asim Razzaq:

You have to have a framework, right? This ties back to the good old quality part—what is the quality and accuracy of what is coming out on the other side. The fundamental problem might be more complex or different, but people have to deal with these issues in their applications too. They have to verify the data. If you’re a company involved in financial number crunching as a service, your numbers can’t be wrong either.

So, is that more about the quality assurance piece, or is it something else? Again, you need frameworks, test-driven development, and things of that nature. Is that something completely different, or does it fall under the same umbrella?

Raja Iqbal:

I mean, quality assurance is actually a very difficult area, especially when evaluating models in LLMs. Reproducibility and experiment design are major challenges. In classic statistics, when we design experiments, we control some variables. But in this case, how do you control things? How do you control the responses?

A lot of it is probabilistic. With LLMs, your responses can vary for the same prompt from one instance to another. You might ask the same question twice and get different answers each time. This variability poses significant quality assurance issues.

These issues might be acceptable in some domains, but for other applications, this lack of reproducibility may not be acceptable. It really depends on the specific use case and the requirements of the application.

Asim Razzaq:

Yeah, for sure. And I think these will become disciplines and frameworks, right, that things have to go through, similar to how traditionally things go through checks and balances before the final product goes out. Again, it’s an emerging space, so you will have to think through that. At the end of the day, the point of this slide is: who’s responsible for the cost of this stuff, right? Who gets the bill? Who gets yelled at at the end of the day? That’s a little bit unclear at this point.

It reminds me of the initial Wild Wild West on the cloud side. We had this whole emergence of FinOps as a discipline, and the roles and responsibilities became clear. But in the early innings of FinOps in the cloud context, bill shock was happening—someone was just spinning up machines in some part of the company because they could, sometimes swiping a credit card. So I think it’s going to be déjà vu all over again with this until there are some rules around who owns it, who’s got the budget, and how we measure the ROI. We’ll get into some techniques there because there is always a point at which the CFO will have that bill shock moment and come down and ask, “What are we doing, and why is this happening?” With the lack of roles and responsibilities, accountability becomes hard. Who do you go after?

FinOps for LLMs

Asim Razzaq:

So, moving on, we can talk about this in the context of the domain we live and breathe, which is around FinOps in the traditional cloud, but also now for LLM. As we are talking to some of our existing customers, including the likes of Zoom, Okta, Aflac, and others, people are in early experimentation. Because they’ve seen the bill shock in the cloud, they are a little worried about what’s going to happen in this domain.

This goes back to the basics. This isn’t a 70-point slide or anything. It starts with visibility. The first thing you have to understand is where the money is going. We call it attribution. In this space, when you’re talking about LLMs, you have to figure out which group, which team, which business unit owns the model. The sooner you start thinking about this stuff, the better you’ll be in terms of managing and reining in the cost. Attribution can then be broken down into multiple models—how much they are costing, what the data pipelines are costing, what the inference is costing.

If you’re trying to do an ROI and you’re saying, “Well, I’m building this conversational AI,” let’s say it sits within a particular group. If your conversational AI chatbot’s bill ends up being $750K, you need to understand how you racked up that bill. Where did the money go? If you don’t have that level of understanding, there’ll be a world of hurt because then the organizational budgetary issues come down on you.

First, you need to understand the cost of what you are doing, even if you’re in experimentation. You want to have that visibility. It’s hard to budget, but at least have some sense of budget for experimentation. If you go over it and are tracking it, you can take measures and give a heads-up to the powers that be to say, “Look, we’re going beyond what we thought, and here are the reasons why.” That explainability is going to be very important. You can’t really optimize something you can’t measure. That’s the first discipline.

The second is, how do you optimize for value? What is the ROI you’re getting? You can’t just be experimenting for the sake of experimenting. If you’re in a company, you’re running a business, and you cannot be in a Wild Wild West because ultimately this will impact the bottom line and top line of your company. We are in a macroeconomic scenario where everything has shifted from growth at any cost to profitability and margins. Every investor and Wall Street analyst is focused on making sure you have value, margins, and are running the business efficiently. As this line item starts becoming a big part of your overall financial health, having a discipline within the team in the context of FinOps will be very important.

The question of value needs to be discussed strategically and proactively. What we’ve seen is that a lot of this stuff becomes reactive and after the fact, with people scrambling to fix things without setting the right expectations and strategy.

The final one is to operate with agility. There’s an absolute recognition that if you have too many guardrails, too much budgetary stuff, and a lot of bureaucracy, you’re not going to be able to experiment fast. You need to come up with a model where the data is transparent, everyone understands it, and you can do things quickly. You don’t want a FinOps group or finance team always in your way. The more you can make that information democratized and understandable, the better off you’ll be in continuing your experimentation and creating value for the company.

So that’s the larger FinOps concept. We don’t have enough time to go into every single detail because it’s a whole discipline that some folks are familiar with, but the connection I’m making is that the whole LLM generative AI piece is simply an extension of that discipline.

Strategies for Managing AI Costs

Asim Razzaq:

Alright, so with that, we will move on to some more finer-grained strategies. If you are building these models, what do you do? Raja, I’ll hand the baton to you to talk a little bit about what some of the basic things people can think about in this context.

Raja Iqbal:

So I think it is once again the good old engineering or, I would say, good old cloud management cost management. Nothing changes; actually, there is very little that is changing, maybe a few technical things. I would say the fundamentals remain the same. Forget about cloud engineering or cloud architecture. In our life, what we don’t measure doesn’t get managed. Know what the costs are, set up alerts and all of that, right? That’s very important, so you know right away if something goes south. Right-size resources: if you cannot afford a five-bedroom house, move to a condo. Don’t over-allocate resources. Those principles apply here.

There are things like auto-scaling and using containerized implementations; you can’t go wrong with that. Know cloud pricing models; If you can get the same thing at Walmart, don’t go to Nordstrom’s. The idea of spot instances—use them. Use serverless architecture. Know the pricing and your business case.

Some of the sneaky things in the context of clouds and generative AI include storage. Storage can be sneaky—just a few cents per GB, but it adds up. You might end up paying thousands of dollars a month for storing something you didn’t need, like an automated backup someone set up and forgot about. Data transfer costs—optimize your architecture to minimize these.

Cost allocation is key—tagging resources properly so you can hold people accountable. Accountability is important in generative AI. With models, know when to use a 4K token limit versus a 16K, 32K, or even 100K token limit. If you don’t need a bigger context window in the context of an LLM or a foundation model, don’t use it. Understand why you need something before you start using it.

I can keep going, as you said. We could spend an entire day on this.

Asim Razzaq:

Yeah, yeah, that would be great. We could have a follow-up on this. To your point, you have to understand the price drivers because if you don’t, you’ll constantly be in a world of hurt. That’s important. You used a great example about storage, and it’s very similar in generative AI. It starts with cents per gig, and we’ve had customers with a petabyte plus, spending tens of thousands of dollars without knowing why that data was being stored.

Cloud and a lot of these models psychologically give people this sense of endless, bottomless, infinite resources, but that’s just not the case. It’s almost like we’re back in the days when RAM used to be expensive, and you had to think carefully about how much RAM to put in your computer because it could get expensive fast.

The additional point is leveraging products that can help manage this, as FinOps is becoming a key pillar of managing services. You also need people with the right mindset. Performance, reliability, security, and now cost—these are the four key pillars. If you can find people with experience and an understanding that systems need to be designed, tracked, and monitored with respect to cost, you’ll be better off.

It’s not easy to find many people like that today, but you can integrate this into your interview processes when hiring engineers or key people. Check if they have experience in keeping an eye on costs. Those are some things that will be useful. This is an ocean with a lot of things to consider, but you can’t go wrong if you understand the price drivers, do competitive shopping, and track costs. It starts with simple tracking, and then you can go finer-grained from there.

Raja Iqbal:

So, Asim, I would like to add one more thing. At Data Science Dojo, anyone who gets hired goes through a presentation or a series of tutorials on our culture. One key element of our culture is what we call freedom, responsibility, and accountability. As a company using LLMs, you want people to have the creative freedom to do whatever they want to do. But there has to be a sense of responsibility and swift accountability because cloud costs can spiral out of control.

We have experienced this firsthand. People must exercise their freedom very responsibly. Ensure that resources and costs are allocated properly, resources are tagged, resource groups are created, and people are held accountable for cloud usage. If the usage is for innovation, absolutely, by all means, go ahead if your company can afford it. But if it’s because someone is lazy and not cleaning up after themselves, you need to address that.

Asim Razzaq:

And that makes perfect sense. Our experience at YotaScale working with a lot of customers is that ultimately there can be no accountability without attribution at a certain scale. If you don’t know who owns it, you don’t know who to go after. That’s why shining a light on which team is using it and having a transparent culture is very, very important.

Where do we go from here?

Asim Razzaq:

OK, I think with that we can talk a little bit about, before we get into Q&A, where do we go from here? Where can people learn more? What’s the next phase, the next step of this? Just in terms of, OK, yeah, I want to use these technologies, I’m already using them, or maybe I’m a little more advanced. What are some of the ways that people can delve deeper into this topic? I’ll let you start, and then I have some thoughts. We can go to Q&A from there. Yeah, resources and strategy. Obviously, this cannot be an all-encompassing comprehensive guide, and we can do a follow-up down the road. But what would be our parting advice for people? Where can they learn more?

Raja Iqbal:

OK, so I think there are tons of resources out there. For another project I was working on, I found that all the cost optimization strategies are well-documented. One of the skills that is very underestimated is actually doing a quick web search. There are so many wonderful resources available online, and now we can even include tools like ChatGPT in that search. I use it on a daily basis, and I’m not ashamed of that. It helps me put things in the right context as long as I use my common sense. The same goes for Stack Overflow. Use it frequently, but have enough skill to discern the right answer from the wrong one—don’t just copy the wrong answer.

Start with a basic search on cloud cost management. There are plenty of resources that can help you understand the fundamental principles. And Asim, I know you were caught off guard because I should have brought in some specific sources, but there are indeed tons of resources out there. Just the first search on cloud cost management will yield helpful information.

The fundamental principles of cloud cost management are not changing. Maybe a few specific things are different, but the overarching principles of cost management remain the same. Accountability is important, attribution is important, and downsizing is important. All of those principles apply—don’t do things that are obviously wrong.

There are also technical aspects like auto-scaling, spot instances, and reserved instances. The cloud offers different pricing tiers, so you need to compare different offerings and know what you want. Having technical people who can evaluate if using a specific model or infrastructure with a lower cost would work in your scenario is crucial. Some of these common-sense strategies are how I would approach it.

Asim, do you have anything to add?

Asim Razzaq:

Yeah, that’s helpful. That’s fair. And I think it’s interesting that a lot of these conversations are even happening on Reddit and Substack. So, those are additional places to check out. Since we live and breathe this domain—again, this isn’t a sales pitch—we have a resources section on Yotascale’s website at yotascalestg.wpengine.com/resources. If you want to learn about the general principles, you can go there. You can also find similar resources in other places that provide a blueprint, like three to five key things you need to be thinking about.

I guarantee that if you take a little bit of time and get yourself more literate about how to manage these costs, you will be ahead of 95% of people. You will be the evangelist, the go-to person. This is already an existing problem and it’s only going to get worse. People who know how to address it, who have some war wounds and have done it experientially, will be invaluable. Sitting on the sideline and waiting to deal with it when it becomes a major problem isn’t the best approach. This knowledge will significantly enhance your contribution to your company, your team, and your overall vision.

Q and A

I know we’re right about at 11. We can dive right into questions. People, feel free to stick around. We can certainly stick around for a few minutes. Raja, if that’s OK with you.

OK, so maybe we can take some questions from the webinar and some from the live feed.

Raja Iqbal:

Yeah, I think the live feed is also being posted here. So let me see.

Asim Razzaq:

Oh, got it. Got it. OK, that’s great. That’s good.

Raja Iqbal:

Let me start here. We’ll pick up where we left off. So, Maheeda, there’s a question: What are some ways to compensate for bias that is present in the data itself when creating an LLM-based application?

Maheeda, maybe I would say that this question may not be specifically in this context, but let me give a quick 30-second answer to this: You can not. There will always be bias, so you have to be watchful. Bias means different things in different contexts—what is acceptable in North America may not be acceptable in sub-Saharan Africa or South Asia. There are social, cultural, and familial norms to consider. You need to be mindful and come up with something that aligns with core human values, but even that can be subjective. You must adjust your data and ensure it is as fair as possible. There are tools that can help identify and address biases in your data.

Another question is about comparing the cost of fine-tuning LLMs on platforms like Hugging Face, AWS, Google, GitLab, and others. Which one would you suggest for a startup?

Honestly, I do not know. Unless I have explored all of them, I can’t give a definitive answer. You should do some benchmarking and reading to understand their pricing models. Fine-tuning may not always be necessary—sometimes pre-trained models are sufficient. So, consider whether you really need to fine-tune the model.

Let me see if there are other questions.

Asim Razzaq:

Yeah, there’s another question here that I can read off. Have you guys ever had the challenge of trying to marry generative AI and GreenOps, given the high compute consumption of training new models? I’m curious if you have any practical tips, for example, how GCP is using a region like U.S. central one. You’re looking at a very generous region in GCP and also one of the greenest ones that this cloud provider has to offer.

At Yotascale, one of the things we focus on is the environmental aspect of things, such as carbon emissions. It’s amazing how much wastage there still is. To your point, there are cloud providers taking steps to address this. There are regions that are greener than others, and there might be some basic trade-offs. In some cases, you might have to pay a little more, but certainly, GCP is a leader in this area. Microsoft and AWS are also taking similar steps.

We’re trying to build capabilities to make it easier for people to understand the trade-off between the dollar amount and the environmental impact. This is an ongoing effort at Yotascale. In the absence of such tools, you made the great comment that cloud providers do provide information on the environmental footprint of different regions and availability zones. It’s similar to how airlines now provide carbon emissions data for flights. While it’s not at the level of granularity of every single resource, it is detailed enough for you to make environmentally conscious decisions.

I’m personally quite alarmed and concerned about this issue. This race for more data centers and GPUs is only going to expedite environmental degradation. The challenge is that regulations vary by country, and infrastructure can be set up in places with fewer regulations. Larger policy measures are needed. If you’re in a Fortune 100 or Fortune 500 company, ESG (Environmental, Social, Governance) is likely a topic of discussion. Environmental impact is a key part of that, and you can raise awareness within your company to incentivize action.

Hopefully, that answers the question on some of the things you can do to make the planet greener. Raja, your turn to pick a question.

Raja Iqbal:

There was actually one more question from Shadra. This is directed at you, Asim, and maybe I can chime in if needed. Just like we saw DevOps becoming DevSecOps and MLOps, what sort of underlying changes do you see necessary for retooling the observability, telemetry, etc., from both quality stability as well as cost management perspectives? What will be the new form of SLAs and usage-based billing? Very interesting question, Asim. Your thoughts?

Asim Razzaq:

Yeah, so I think predicting the exact acronym is always hard, but the underlying message is clear. DevOps evolved into DevSecOps and MLOps, and they are essentially emphasizing that developers should focus on security and that it should be a shift-left approach—not an afterthought. MLOps surfaces the need for a whole framework and set of tools to effectively do machine learning. It has become a discipline, a genre, and a domain.

We touched on some of this earlier. Raja mentioned the quality of the model, the output, and the need for security and governance, which is being termed LLMOps today. You could insert “Sec” there because security is crucial. There will most certainly be an emergence of frameworks and tooling that cover the operations of generative AI.

The question is, does this sit in platform engineering? Is it an extension of platform engineering? Does it belong in data science? Or the business analytics team? Or is it a completely different group? In my view, the skill sets required are different, but the basics are not going to change. Whether you call it LLMOps or MLOps, you ultimately have to track usage, look at telemetry, and ensure security.

At Yotascale, we’re actively working on building capabilities where you can look at things at a model level and a data pipeline level. SLAs are going to be important. Raja mentioned latency—if you’re dealing with a conversational chatbot, a response can’t wait for 5 minutes. If it’s offline data processing for insights using LLM, it could wait a day. The basics—reliability, performance, cost, and security—will always be there.

This is a fascinating and evolving space. My biggest piece of advice is to keep tabs on it if you’re really interested. When in doubt, bring the fundamentals into the equation. Don’t overcomplicate it.

Anyways, that’s my more than two cents on the matter.

Raja Iqbal:

Yeah. And I think it’s more about the end-to-end lifecycle. If you look at MLOps or DevOps originally, it’s about managing the entire lifecycle of your software or model. Now, with LLMOps, it’s about managing the end-to-end lifecycle of your large language model and any variables that emerge.

You’re still tracking your code, you still have a Git repository, and you still have software versioning. But in addition to that, in MLOps, you might be monitoring the version of your model, its accuracy, precision, recall, and so on. With LLMOps, there are new issues to consider. Think of these in terms of extensions of each other. For example, you’ll be monitoring the versions of the prompts you use in the system. There are also different needs in terms of model explainability, observability, and more.

These are natural progressions of each other, with new layers being added. In LLMOps, you’ll be dealing with unique aspects specific to large language models, but the underlying principles remain the same. The new forms of SLAs will be dictated by business needs and the kind of applications being built.

Certain things will remain constant, while others will vary depending on the specific application.

Asim Razzaq:

Yeah. I mean, you’re always going to have the usual cost versus SLA trade-off. Nothing is free, so you always have to consider that balance.

One of the questions towards the end here is about using cloud services for businesses that produce huge data, like telecom, due to high costs, while on-premise solutions require high expertise. What do you think?

This goes back to the use case. It’s hard to answer because it depends on what the telecom is using this for. You can have a large amount of data, but you can always start building your expertise with the cloud version of these model APIs. You don’t have to feed every single byte of data into it. You can take a small subset of the data, train on it, and learn the process.

When you really need to scale to large amounts of data, that’s when you expand. Hopefully, by that point, you are literate enough in the technology. I wouldn’t recommend going from zero to building everything on-prem from scratch because you have a large amount of data. Even large companies should experiment with a subset of their data that won’t break the bank. Use some of the latest and greatest technology out there before bringing it in-house.

You see this in some cases with repatriation. Companies with a huge cloud footprint found it very expensive. For instance, some storage-centric companies left everything in the cloud initially but brought storage back on-prem. By that point, they had a good understanding of the architecture and other considerations.

So, my thought process is that a company should not chew off more than it can digest. Start small, learn, and then scale up when ready.

Raja Iqbal:

OK, I think we are way over time.

Asim Razzaq:

We are. I think with that, we’ll conclude. We have a recording, and some people asked if that will be available. We’ll certainly make that available to everybody. If you have other questions, reach out to us. You know Raja at Data Science Dojo and myself at Yotascale. I’m sure we have LinkedIn pages and Twitter accounts, so go seek us out. I enjoyed the conversation quite a bit, Raja, and thank you for your time.

Raja Iqbal:

Thank you for your time. Bye.

Asim Razzaq:

It was a great discussion. Take care, everybody.

‍

Table of Contents

Do not remove - this placeholder list is
Automatically populated with headings
On published site

The Hidden Advantage of COGS: The Engineering Leader's Guide to Effective Cloud Spend

Join Yotascale’s Jeff Harris and Joel Pettigrew for a deep dive into COGS (Cost of Goods Sold) and its hidden role in cloud cost management. Learn how CTOs and engineering leaders can track cloud spend, align with finance, and drive smarter investment decisions. Register now for this must-attend webinar!

Developing with GenAI: From Strategy to Implementation

Join the webinar to learn practical steps for evaluating, implementing, and managing GenAI models while keeping costs and privacy in check.

Managing AI/LLM Costs and Maximizing ROI

Save Your Seat

Speaker

Asim Razzaq

Raja Iqbal

Overview

Speakers

Asim Razzaq

Raja Iqbal

Introduction

Even the Most Successful Companies Will Be Challenged – Massive Supply-Demand Mismatch

Cost Implications of Multiple Paths to ML Inference

Who Manages Costs as Organizations & Roles Converge

FinOps for LLMs

Strategies for Managing AI Costs

Where do we go from here?

Q and A

The Hidden Advantage of COGS: The Engineering Leader's Guide to Effective Cloud Spend

Developing with GenAI: From Strategy to Implementation