Case Study: How Flatiron Health Gained Visibility and Control Over Total Platform Costs

The 6-Factor Framework for Managing Cloud Costs

Jeff Harris

The current economy and AI/ML boom has created a macro-environment where companies are noticing increases in their provider bills. Managing cloud costs is now a make or break issue.  While the imperative of managing cloud costs is clear, the road to achieving this is still a challenge.  And putting your engineering team at the center is critical to success.

In this webinar Jeff Harris, Director of Strategy & Operations at Yotascale, will present a practical 6-factor framework for managing your cloud costs. Engineering leaders and executives who want to establish a cloud economics practice will learn about:

  • Developing your “Spidey-Sense” for pre-empting spiking provider costs
  • Establishing shared unit economic metrics between engineering & finance
  • Strategies for motivating engineering teams to manage costs
  • And codifying your business cost strategy in automation for long-term success

Length: 35 minutes

—

Webinar Transcript

Introduction

Jeff Harris:
And thanks for joining everybody. We’ll get started pretty soon. So we’ll skip a few minutes here for people to get into the webinar today and get kicked off.

And welcome to those of you who just joined. We’re going to go ahead and get kicked off here in another minute or so and just give a few more seconds for others to join before we get started.

Excellent. Alright, everyone, we’re going to go ahead and get started this morning, afternoon, evening, depending on where you’re calling in from. We are here today to talk about a six-factor framework for cloud cost management.

I want to give a shout-out to a colleague of mine, Sasi Kanumuri. His work and his blog are what this framework is based on, so credit to him for doing the work and actually going through the process of implementing this in a company that was going through a change of implementing FinOps, a FinOps practice. So definitely check out his blog for more details around his experience. Thanks, and we’ll go through this at a high level today, but there’s certainly more detail worth looking into there.

My name for everybody who’s here now is Jeff Harris. I am the Director of Customer Success at Yotascale.

About Yotascale: Our Customers Get Measurable Results

Yotascale is a B2B SaaS company. I’ll talk a little bit about Yotascale, where we fit in the world, and then we’ll get straight into the contents. Not a whole bunch of time spent on Yotascale, but just to get an intro very quickly, Yotascale is a B2B SaaS company. We provide software that helps other companies manage their cloud spend. We do that through focusing on three core pillars: visibility, prediction, and optimization. All of this is done through a layer of having your business context in mind, really tying that cost back to business entities, business hierarchies, applications, products, however you really want to break it down.

We also focus on collaboration across these three pillars, making sure that the different stakeholders within an organization—finance, executive leadership, and engineering—have access to the same data and can agree on and work with the same language around cost. We also have adaptive automation that underlies all of this as well. It allows for users to come in and provide their feedback, and that feedback gets taken into account and is used to update our forecasting algorithm when you respond to an anomaly alert. So all of these things really combine across these pillars to provide a solution that allows companies that are growing to manage their costs in cloud.

Just as an example, for the types of companies that we’re working with today, they really range from some of the more traditional finance companies like Jefferies to companies like Zoom and Hulu that are more focused on technology.

Macroeconomics Driving Rapid Increases in Cloud Costs

Getting into the content of the subject that we’re here to talk about today, what’s driving all of these companies to go look for cost management solutions? Typically, we see companies start to look for cost management solutions when they get into very high spend. There are tools that are out there that the provider offers that can help you maybe up to $1,000,000 in spend or so. However, at a certain point, you start to see the business either grow or more people get involved, and you lose touch with the individual resources that are out there. For the companies that are in cloud today, some of the macroeconomic trends that we’re seeing, especially right now around LLMSs, AI, ML, there’s been a renewed interest in this subject. I’m seeing personally in products that I use on a day-to-day basis generative AI thrown in all over the place. So all of these experiments or people trying to latch onto these trends are costing money, and that is going to continue to increase as we go through time.

We’ve seen Anderson Horowitz and other industry leaders put out articles about the cost of running AI models, and this is also becoming an environmental concern. So as more and more compute is used, we are also going to see more and more focus on the environmental impact that that has. Additionally, just to the pocketbook, we’re seeing things like Microsoft has recently announced a 10-15% increase across the cloud services in the EU. Microsoft is hosting OpenAI, so all of these things are related, and we are seeing the cost increase for cloud computing as well as the higher demand for cloud computing all at the same time.

Platform Economics is the New KPI

Those trends are leading to the focus of engineering moving from just reliability, performance, and security to also having to look at cost. And that’s where FinOps, as a concept, comes into play. There is the traditional FinOps organization, and there are a lot of different ways to implement a financial operations for cloud practice. With the traditional look at reliability, performance, and security, engineering has been, these are industries that have been around for quite a long time, so there have been a lot of tools and that ecosystem is very large in terms of what types of tools you can go find to help you get information, to help you get visibility into your performance, into your security posture, and cost is now becoming a focus as well.

This industry has been around for maybe 10 years at this point, and Yotascale has been active in it for about 5 years, so we’re sort of a second-generation product in that sense. But the important thing here is that cost is now a requirement when you’re looking at an engineering organization, especially because we’ve moved from this CapEx to this OpEx world. And we’ve moved to a world where it’s more agile and you want your software engineers to have agility to go build things, to go experiment with ChatGPT or OpenAI or any of these other things that are coming out, but you also need to be sure that you’re keeping an eye on cost. One of the ways to do that is to ensure that the engineers who are working on the infrastructure have visibility into that cost.

Cloud Cost Management Continuous Cycle

This is going to lead us into this concept of the continuous cycle of cloud cost management within an organization. Really starting with these four main factors, one being visibility. How do I start by getting monitoring in there? How do I get a Spidey sense? It could be an anomaly detection capability, it could be just providing a dashboard in some sort of visualization tool or BI. All of these have some costs to implementing and maintaining, but the visibility into the cost ultimately will pay off as engineers are able to take that and factor it into their work that they’re doing. And keep in mind how the cost of the infrastructure changes as they make changes to the code or to the way the infrastructure is architected.

So that first piece there is visibility. Then it’s about getting some insights. Here we see sort of the shared cost unit economics. What are the different things that are happening in the environment that you want to be able to track over time? Our costs may be growing in cloud, but are we bringing in more dollars from the services we’re providing? Being able to look at a unit metric like our cost per video hour watched, our cost per data store, whatever it might be, our cost per client service, or cost per insurance adjusted, there should be a business metric that you tie this back to so that you can track changes over time as they relate back to the business and not just looking at cost and saying, “Oh my gosh, it’s increasing, what’s happening?” If it’s increasing and you’re bringing in more money, you’re spending a dollar and getting 10, that’s a good deal. But if that’s changing over time, you want to be able to track that.

The third piece is really creating accountability around these metrics. So who owns that metric and who owns making sure that it’s being adhered to or the goals are being adhered to? You can’t just throw it out there. You really have to integrate directly into the engineering processes that are there. So you need this information to get pushed out to the edges of the organization. Once you’re able to collect it, you agree as an organization that these are the right metrics. You want to deliver that information out to people where they get it in line with their work, so that in flow as they’re going about their day, this information is available and it’s not something that they have to change their habits in order to integrate into their current processes.

The way we see this happening personally from my experience just talking about Yotascale, we see this happen by integrating into engineers’ Slack channels. We can push cost information for this particular team directly to the engineers that are responsible for it. We push anomaly alerts directly to those same channels so that as engineers get the alert that their cost has spiked in some way that hadn’t before, they’re able to click into it and go directly into the product and get more details. So how do we integrate directly into those processes? Maybe that’s getting tickets into JIRA to the right team. Maybe it’s making sure that the teams have visibility into their portion of costs specifically so they don’t have to go sifting through all the company data.

Then the final piece here is automating that. Usually, the first time you go through this, it might be a manual process, a lot of back and forth, maybe some iteration, but once you’ve got that process in place, then you want to just press the button and let it go on autopilot. How do you make sure that as your business changes, as teams move from one business unit to another, an application gets started, how are you tracking those changes and making sure that you are providing that same visibility coming back into the cycle for the new applications, for the new business units? How do new users get access to the information as you bring new engineers on and again continuing to just go through this process and check and say are these still the right metrics? What are we tracking at now? Are we integrated into the engineering processes as they are today? What’s working and what’s not? Continue to adjust and then coming back again to say what new processes have we added or what have we changed and making sure that our automation stays up to date with the way that the organization changes.

I’ll say that there’s a Q&A here. Please, if you have any questions, feel free to reach out. If they’re relevant, I’ll hit them while we’re talking here. Otherwise, we’ll save some for the end.

Factor 1: Cloud Cost Visibility

Alright, so getting directly into the factors. What did we come here for? We wanted to learn about a six-factor framework for cloud cost management. The first factor we’re going to talk about today, one we’ve already kind of hit on, is cloud cost visibility.

What we’re looking for here is visibility at the level of granularity that matters to the individuals, to the stakeholders within your organization. Typically, we see a few different groups of stakeholders. We see engineers, people that are actually working on the infrastructure; they have a certain view that makes sense to them. We see finance, they’re not working on the infrastructure, but the cost is important to them, and they may not understand all the AWS terminology. So seeing, you know, EC2 cost or M3X large cost went up isn’t as relevant. So we want to make sure that we’re providing visibility to the people in the context that matters to them so that they don’t have to spend extra time and do extra work to get to relevant data. Often what that does, if they do have to spend the extra time and work, it takes away from other things they could be doing, or it just doesn’t get looked at. And that’s the worst-case scenario because without visibility, none of the other stuff can happen. We don’t know where we’re spending the money, we’re not attributing it back to the applications, to the teams, to the business units that we work with. There’s not a way for us to optimize, to hold people accountable, and get into some of those factors here as we go along.

Other things to consider in this area. So we’re talking about cloud visibility; that could be AWS, it could be Azure, it could be GCP or whatever you are using. Some teams may be operating multi-cloud, or it may be at an organization where you have different business units or subsidiaries that ended up merging together and had different cloud interests. But for the top-level organization, they need to see cost across all of their clouds and you want to see that all in one place. You want to be able to see how the trends are changing, or if you’re doing a migration from one to the other, how does that compare? Or if you’re running a project involved, how does that compare? Being able to have that on one pane of glass is extremely helpful. The cloud providers have their own billing dashboards, some of them even provide some support for other cloud providers within their own billing dashboards, but the tools will be limited. At a certain point, you’re going to outgrow them if your company continues to grow in cloud.

Another area where we see a lot of challenges is around Kubernetes and microservices. With these, what we see often happening is there’s a big shared platform, and all of the tags on it are for the platform team, but the platform team can’t do much about managing the cost of that because it’s application teams that are deploying pods into the cluster. So how do we do a show back? How do we show back the costs for those teams that are deploying their applications and services in there? If you can get the actual cost of the node the pods are running on, you can get the actual utilization metrics of the pod itself, and you can kind of calculate the cost of a pod and you can aggregate that across pods by name, label, namespace. This type of visibility is extremely important for platform teams.

Getting into some attribution and tagging. Tagging is very important for the low granularity. If you’ve got an account-based structure, if you’ve somehow managed to keep all applications in separate accounts, that can be a good way to start, or all teams or subsidiaries or there’s some level of account-based allocation that can be done. But really getting down to the granularity that you’re going to want to get to, you’re going to need a good tagging system in place. So this is why we have to start here. We have to first say, what are the questions that we want to answer? Then we have to make sure we have a tagging policy that’s in place to support those questions, answering those questions. And then we have to find a way to hold people accountable to that and to track how we’re doing against our tagging policy.

Looking at trends and analysis, I think this is pretty self-explanatory, right? That’s part of the visibility you want this to be a regular recurring thing that you get. So pushing a report on a weekly basis and monthly basis can be extremely helpful, putting that in front of people so that they see, okay, cost is the same as last week, same as last week, when it’s different. Okay, well, now there’s something I can go look at. So I think that regular cadence of getting information about cost and having the visibility served to you is a good way to do this and not relying on people to change their habits and start going into to look at costs on their own. Factor 1. Alright, going to 2.

Factor 2: Cloud Cost Insights

Going to cost insights, right? So we’ve done the visibility, we’ve got that base view into our costs, maybe even gotten to a point where we’re providing access to it for other people. Visibility alone doesn’t help you get to action. How do we take the, okay, cost increased, and what can I do about it? So what are the insights that you can provide? That’s where you get into specific metrics for your organization. We can look at unit economics, coming back to that, what system, how are you trying to get more users into your system? Is it sign-ups? Is it time spent in the product? Is it the number of, I don’t know, software units delivered? How do you think about what it is that your business is providing? Why are you in the cloud? And then how can I tie that back to the cost?

We look at utilization metrics, right, so other insights can be around anomaly detection, pushing an alert out to say, hey, cost has spiked here. A great example that we’ve seen recently, just from my own experience with customers, was at Zoom where we had a cost anomaly get delivered to the engineering team who was responsible for the NAT gateways. It turned out there was a configuration change, went from using an internal IP to an external IP and if the way AWS does network charges this significantly spiked their cost. The challenges that this type of spike was, while significant, wasn’t significant enough to sort of show up on the top-level radar for Zoom’s cost. They wouldn’t have had that level of granularity of view without the tagging policy in place, without having the Slack integration set up to push the alert to the engineer at the right time so that it didn’t get lost in an inbox or just sitting in a web view where that insight couldn’t then become action. So again, find a way to get these alerts to the right teams, make them specific and granular so that it’s easier to take action on them.

Looking at idle resources, there’s an insight there, right? So hey, these are resources you’re not using or could be right-sized. Here’s the action you can take. Here’s the instance type you can change to. That’s great. Then how do I get it to the people? And we’ll get into that here in just a minute as well.

Factor 3: Cost Governance

Cost governance is the third factor here. Governance has a lot of meanings. I think in the case here for cost governance, it’s how do I enable agility without losing control of the budget? How do I balance that? That’s the constant challenge for a FinOps team or for finances. They’re thinking about how do we go into cloud, how do we leverage cloud, how do we get our ROI out of it? You do need to look at the levers that are out there. Is there a way that I can, one, do the chargeback and show back, right? That’s part of governance is making sure that people are aware of what they’re spending. Make sure that you are more. How do I work with my teams? How can you set up regular cadence meetings with leadership to make sure that these topics are important to them as well? How do you drive alignment across finance and engineering to be able to speak the same language in some ways, right? So finance may have their cost center codes, engineering has their applications. How do these things come together and how can engineering talk to finance about what are real opportunities that they can optimize on versus what the cloud provider might be saying you can optimize on but doesn’t make sense in the context of your business or in the context of how your services work.

A good example with this would be around a customer we work with today, NextRoll. They have their engineers respond to the anomaly alerts that come up and finance reviews those to understand are these changes that we’re seeing in cost, these cost anomalies, something that is going to be temporary or long term. So was it we deployed a new cluster, and our cost is now higher or was it that we ran a larger market query and our cost spiked for the day? So that is another way that we see tools like Yotascale used to help with this cost governance stuff. But budgeting, forecasting, making sure you’ve got the allocation right and that the organization all agrees with that are all really important. Then looking at those collaboration opportunities, where can engineering easily provide quick responses to finance so that they can take that information and use it for their own forecasting model.

Factor 4: Cloud Cost Optimization

The fourth factor here is optimization. So we’ve got the visibility, we’re allocating the cost, how do we go in and start making improvements to the services we’re already out there? We see a lot of times companies migrate their infrastructure either from on-prem or from another cloud, whatever it may be. When you’re doing that, the whole goal is just get it over there and get it running, make sure it works. Sometimes you remember to come back after the fact and say, okay, are we on the right instance type? Is the cost profile what we expected? But often you don’t, you just move on to the next fire, the next project. So having something in place that can automatically identify these opportunities, what are the opportunities out there? Then allow your teams to evaluate them. So we identify the opportunity, route it to the right team, allow them to provide their feedback, and take the work off of you of doing all of that triage. That’s where some software products can help, like Yotascale.

From the optimization perspective, you’re also trying to look at what’s the low hanging fruit, what’s going to get me the biggest ROI. So doing that sort of analysis can be extremely helpful as well, either by looking at the total dollar value or by what’s realistically something we can take and do right. So evaluating all the opportunities, ranking them, and then figuring out which ones you take action on.

Other ways that you can optimize on cost though, beyond just right-sizing, looking at resources that aren’t in use anymore. Also thinking about what type of workload am I running? I’m sure most of the people on here are familiar with spot instances or some of these more ephemeral types of instances that the cloud providers have. Can you move your workload to that cause there can be a significant decrease in cost? Can you look at your R&D environment, your dev environment, and see if there’s a way to shut down instances when engineers are not working on them? Not all resources can be. Sometimes there’s tests that are running 24/7, but really taking the time to evaluate that, looking at how much am I spending on my dev environment and can we get that down? That’s a place that we see a lot of opportunity to improve on cost.

We’ll get into some of this in the vendor management section as well. But there is some reservation commitment that you can make. All the cloud providers have some level of commitment that you can make to them to say, hey, I’m going to use this much compute or this much database for the next year, two years, or three years. Typically the terms. So having a strategy around how to leverage those discounts is important. One anecdote I’ll throw in here too, though, while we’re here. You’ve got to be careful when you’re thinking about right-sizing or when you’re thinking about committing to reservations before you’ve done a right-sizing and cleanup of resources. You don’t want to commit to resources that you’re going to end up deleting, so just keep that in mind as well.

Factor 5: Vendor Management

The fifth factor coming into vendor management here. What are we talking about here? If you’re using a cloud provider, if you’re working with AWS, you’re working with even coming into some of these other storage-specific things like Snowflake. There’s pricing. You have to think about how much you’re actually going to use because often the vendors will give you a discount for an amount if you commit to a certain amount of capacity or spend with them. So you do need to still do some level of capacity planning and you need to work collaboratively within the organization to figure out what you can commit safely to these vendors. You need to understand what’s the actual cost going to be. The calculators that are provided can be helpful for a very rough ballpark, but the reality of evaluating how much network traffic is going to be, which ends up being a huge cost, thinking about data storage and how long that’s going to be in place. Thinking about all these costs, the overages, the penalties. You really have to have somebody think through this, especially in a large organization, and really focus on the levers that are available for you to work with these vendors.

The key thing here is really making sure that there is agreement between engineering and finance because this is one of those places where what’s being bought is potentially one of the most expensive things on the budget. People, space, cloud happens to be the order for a lot of tech companies today. How do you make sure that finance is going to be very scared of overages and spending too much and being responsible for that without having any control over it because engineering needs access to be able to do what they need to do? There has to be collaboration, there has to be agreement, and people have to be working together in these organizations to make sure that you’re able to manage the relationship with the vendor. Engineering has to be good at forecasting so that finance can commit to the right amount with the vendors.

Factor 6: Automation Framework

The sixth factor here is the automation framework. We’ve talked about going through the process of all these different things. Some of the factors we talked about there in those five can be automated. With vendor management, maybe not yet at least; we’ll see what chatGPT can do for us in a few years. But today, there are places where we can implement automation, and that’s around how do we take the information that we’re getting and make sure that it’s being pushed out. How do we triage recommendations, triage anomalies? How do we do that automatically?

Another one is really, we’ve talked about visibility, but the reality of today is that organizations are changing quickly, so teams form and disperse every quarter. How do you make sure that the cost data you’re looking at is up to date with the definition of your organization and the way that you’re talking about cost internally? If a new application and new service is spun up, how do you make sure you’re tracking costs for that and sending reports to the right team?

That’s one piece of it. The other piece is there’s some automation you can even put in on the optimization side. Is there some policy you have in place where if a resource is showing up as unused for 30 days, it just automatically gets terminated? Do you have an escalation policy where you can send an email out to the owner of this resource every week and let them know, “Hey, we’re deleting this now. You haven’t responded to these emails, we’re just going to clean it up.” So that’s another area.

Then on the visibility side, one more would be around tagging. Automate the tagging, make that infrastructure as code, make that a uniform policy across the organization. In some cases, we even see people using things like the cloud custodian to clean up resources that aren’t meeting the tag policy. So there’s a lot of different options around there in terms of automation. We’re continuing to see more and more of those show up. There’s a lot of opportunity on the automation side.

FinOps Aligns Finance & Engineering

Being conscious of time here, I coming up to the last factor in our six factors. Just looking back at these, we have cost visibility, cost insights, cost governance, optimization, vendor management, and the automation framework that we just spoke about. I hope this is a good intro to FinOps and thinking about how to implement something like this in your own organization. Thinking about some of the things that might show up or be pitfalls. I really appreciate everybody’s time today. I will stay on for some additional questions if anybody would like to stay on to ask them.

Otherwise, I do want to say thank you. If you want to learn more about Yotascale, you’re welcome to reach out. This content was again based on a blog by Sasi Kanumuri. If you’re interested in that blog and can’t find it, feel free to reach out to me and I’ll send you a link.

Again, thank you all for your time and I’ll hang out here for another few minutes to answer any questions.

Q & A

Question: How do we get started with Yotascale?

Jeff Harris: It’s super simple. Again, reach out to us and you can email me directly and we can set up some time to go through the product and take a look at what you guys are doing. Make sure it’s a good fit. We look at the cost and usage report. That’s our main source of data from AWS for the billing export from Azure, or the same similar thing from GCP. We can pull all that data in and show it to you across all the different clouds. Eval, usually about two weeks. So more on that. Just feel free to reach out.

Question: Is there a way to get a cost spike on a Grafana dashboard?

Jeff Harris: Yes, there certainly is. The cost spikes, I think, and I don’t know if you’re speaking specifically from Yotascale, but we have APIs available where you can pull data out and then display that wherever you’d like to. If you’re looking at building your own solution, you can absolutely take the cost data and get it into Grafana. What I would suggest doing is looking at if you’re looking at AWS, the cost and usage report is the thing to look at. Azure billing export is a similar thing from GCP, but that basically exports a large set of data from the cloud provider, which can be millions of rows depending on how much you’re using. Then you can pull that data into some other tool where you can point Grafana to and have it display that data. So absolutely possible. Or if you wanted to connect to our product directly, you can pull APIs, pull data out of Yotascale, and then have that displayed wherever you want.

Question: Is there a way to show all cloud service providers’ dashboards in one place? If yes, what are the advantages of using it instead of native?

Jeff Harris: The cloud service providers, we focus for Yotascale on GCP, Azure, and AWS today. We can pull all of the data from those cloud providers into one view and look if you’re a business executive and you don’t care what cloud provider it’s coming from, we can translate that all into what business unit the cost is being attributed to. Or if you do care, you can always see it or if you want to drill into just GCP, you can do that and then get to see a cost for one cloud provider versus the other, all very seamlessly within the product. The advantages of using third-party software over native, especially if you’re multi-cloud, are that you don’t have to go to three different cloud providers to get the information in the dashboards. We can provide a layer over top of that that will allow you to merge those three different worlds. If you want to look at just compute costs or you want to look at just database costs and don’t care what provider it is, there’s a way to do that. It’s also a way that you can allocate the cost of all three providers and then send out notifications, alerts, dashboards, etc. Certainly some advantages.

Question: What will be the difference in price for your product and Grafana?

Jeff Harris: Hard to say. I’m not super familiar with Grafana and the pricing, but we have an ROI calculator, so happy to share that with you if you want to reach out. That can help you at least understand how we think about it, and you can always tweak it, right? That’s the idea there. You have to think about the data processing, so pulling the data, storing the data, processing the data, and then hosting the data visualization side of it, and then the maintenance. The idea is that you’re paying for somebody else to do all of that and to keep up with the changes in the data sets that the cloud providers provide. The way that I would evaluate this is to say, how much are y’all spending and how big of a problem is this? If it’s less than $1,000,000, stick with the Grafana dashboard, stick with the internal tools. Honestly, until your organization is getting big and you’re getting, there’s a lot of people, and you need tooling to automate things, you can get by fairly well with the native tools.

Awesome. I appreciate all the questions. Again, happy to answer more should you have them. Feel free to reach out. My email is jeff@yotascale.com, super simple. You can reach out to me directly. Appreciate everybody’s time today and thank you again. Hope to hear from you soon.