Case Study: How Flatiron Health Gained Visibility and Control Over Total Platform Costs

How Hulu and NextRoll foster a culture of spend awareness

Jeff Harris

Many engineers developing Cloud applications, do not have a business-oriented understanding of the Cloud cost of their applications. Hulu and NextRoll have tackled that challenge empowering their engineering teams to foster a culture of Cloud spend awareness and improve the management of their Cloud costs.

Length: 23 minutes

Webinar Transcript

Introduction

Jeff Harris:

So Yotascale is a B2B SaaS software provider. We have a single product today, and it’s multi-cloud cost management. We’re based out of Menlo Park California. Actually, we just this year have been getting some good press. We recently received the Gartner Cool Vendor award. So we’re looking forward to sort of accelerating our position in the market here coming into 2022.

The Hefty Price of Cloud Waste

Jeff Harris:

I’m going to quickly talk about the problem. I think probably everybody here can relate to the extreme increase in cloud spend over the last decade, five years even. And related to that is also an increase in waste. So I’m just speaking to different customers or different companies. There’s never somebody I talked to who says, “I’m spending the right amount on cloud.” There’s always this fear, concern, worrying that they’re spending too much and that they don’t know what they’re spending it on or how to attribute it back to the business.

And for all of you in the CPG world, you’ve got supply chains. You kind of know how much each of your little components cost, but when we go to the cloud, we lose that visibility. And what Yotascale allows you to do is take all of these different data sets from different cloud providers, whether it’s AWS, Azure, GCP, pull that information in and allow you to define how you’re going to view it, whether you want to view it through…

So we get into some more about how we’re doing that with a couple of our companies, but the problem, I think speaks for itself, we’re spending a lot of money on cloud. We’re spending too much. How do we get a handle on this?

The Challenge of Modern Cloud Cost Management

Jeff Harris:

And there are cost management products out there. There’s cloud provider tools that are out there. They really come in and they do the cloud provider tools. They do enough to give you some information, but they’re sometimes aren’t so much to help you understand it in the context of your business as they are to help you pay for it.

So some of the problems that we’ve seen with the cloud cost management tools that are out there today, the problems are driven by this dynamic ownership. So as you go into the cloud and you’re constantly developing new services, applications, trying new things out, creating new projects, how do you track the cost of those over time as things change, as your business definition changes, new business units are created, change, and merge?

How do you keep up with all that and still maintain this ability to track and visualize costs based on that business hierarchy, that business lens? So not being able to track it, leads to this lack of visibility, cloud spend is not linked to your business application context. Companies are trying to solve this by taking this information, putting it into Excel spreadsheets. Typically those Excel spreadsheets make sense to one or two people in the company, but it’s really hard to disseminate that information when you do this manually, when you’re doing it in spreadsheets.

This leads to an inability to collaborate. So when you have just this really centralized team that’s analyzing costs and they’re not able to get that information out in a timely manner, you know, regular way, you can’t use cost information to help build systems. So engineers are data-driven, they are looking at performance, they’re looking at security. The cost information tends to be harder to get in the cloud world.

So what we are able to do is layer in that cost information to deliver that to the engineering team so that they can actually view the information and take action to build their applications and infrastructure in a thoughtful way when it comes to cost.

Trusted by the Best Engineering Teams in the World

Jeff Harris:

Some of the customers that we are helping with today range not just from consumer packaging and entertainment but also in the financial world, the B2B SaaS world, and the medical genetics analysis world. Anybody who’s really spending a lot of money on cloud computing is looking for a solution to this.

And as these companies grow, as they get closer to becoming public, this becomes a bigger and bigger problem for them—really getting that financial house in order, being able to report on margins, being able to report on all of these different factors that relate to the inputs to their costs, the cloud costs here.

So today we’re going to specifically be talking about just a couple of these customers. Now we’re going to talk about Hulu.

Hulu Case Study

Jeff Harris:

We’re all familiar with Hulu; we probably spend a lot of time watching some Hulu content over the last year. What was really key to us was in trying to understand their cloud cost and visibility into it. The challenge they had was that they had been using this sort of Gen. 1 product, this cloud cost management tool that had been out there for several years. It really helped them understand costs at the resource level.

But Hulu made this big move into Kubernetes, into containerized environments, leveraging ECS and EKS from AWS. When they did that, they lost their visibility into what was actually running in terms of cost. They lost visibility into the cost of what was running inside of these very, very large shared clusters. From an infrastructure perspective, all the tags in these resources are saying this is the platform team, this is our Kubernetes platform. But inside of that, you’re actually deploying many, many different applications at a smaller scale.

What we’re able to do for them was pull out the utilization data, the requests to the cluster, and proportionally assign cost to the container based on the node that it was running on, how long it was running for, and how much of the capacity of that node that container was taking up. When you do this, and just for a container, that’s interesting. But we need to aggregate it. So we aggregate that and allow them to have this unified view of containerized and non-containerized cost.

So what does that mean? It’s looking at a dashboard of how much did this particular application or service cost me. That service or application may be using some Kubernetes, maybe using some S3 or storage, it may be even reaching out to Azure for certain things, right? So how do we get all of that data, all of that cost information into one single unified view?

Because this wasn’t available, there wasn’t this unified view, engineering is not aware of the cost. They’re ignoring them. At certain times in companies’ lives, that’s OK. We can kind of get by by just getting everything into the cloud. But ultimately, when you really want to run a business efficiently and you need to start looking at your cost, you need to start figuring out where the money is going so you can decide where to optimize and make changes to your infrastructure.

Their biggest challenge was they were in a time crunch. They needed to get to a point where they could do forecasting at the service level for their future growth projections. In order to do that, they needed this very granular view of cost, across these different resources and services that they were using from the cloud provider. The key benefit here is that they get 6x ROI. So how did they achieve that?

Well, part of what we identified, when doing this analysis, was we actually saw that the capacity that the pods that they were deploying were requesting from the cluster was not being used very efficiently. So you saw a similar situation. If you deploy a particular instance in the cloud and you’re only using 10% of it, but the pod is requesting a certain amount of capacity from the cluster and then you’re using ten, twenty, twenty-five percent of it. Without being able to highlight this problem and put it into dollars and say, “Hey, these are the applications where you’re spending millions of dollars that are going wasted over the year,” it’s hard to bring attention to the problem and get people motivated to make a change about it.

So being able to actually put a dollar amount on how much you could save by making some improvements to how you’re deploying your infrastructure really drove this internal discussion and this entire change in the way that the engineering team viewed their infrastructure. It went from performance and security focused to performance, security, and cost focused. So you can still have performance and security, and you can pay less money to achieve that, but we need that information. You need to get it out to the engineering teams in order for them to be able to act on it.

In fact, the person who had the vision to bring Yotascale in told me that he sees himself as sort of the DevOps enabler. He needs to enable the teams around him to make better decisions. He doesn’t want to be in this position to have to go take a stick and follow up and chase people down.

Cloud Native Architectures

Jeff Harris:

So the way that we get to this—and I’m going to go into a little bit more of the technical aspect of this now—is we are analyzing, we basically deploy an agent into your Kubernetes cluster. We can work with any version of Kubernetes that exists out there today. That agent is actually bringing in the metrics about the pods, how much capacity they are requesting from the cluster, how much they are actually utilizing. And from there, we’re able to take the information about the node that that pod was deployed on, the cost of the node, the instance types that are running, savings plan, etc. We can use that information to proportionally assign cost to the pod.

From there, we are able to aggregate that. So the metrics that we look at today are CPU and memory. There are also some other metrics that we’re looking at adding here. We actually also look at GPU, GPU memory as well. Storage and network are a couple of other metrics that we’re looking at adding here over the next quarter. But for now, those are the key metrics that we look at today.

So the cost of the instance is based on memory and CPU, and that can be weighted differently on the different instance types. We can pull all of that information, look at the utilization over time, and be able to accurately provide a cost allocation for a particular service or application that’s running in Kubernetes. That was the main part that was important to Hulu—getting down to a very specific cost analysis within the Kubernetes clusters.

Business Context Mapping Engine

Jeff Harris:

From there, it’s a little bit busy here, but I think it’s important to show some of the complexity that goes into actually defining what a service or what an application is. All the resources in the cloud have that metadata associated with them, and many of them can be tagged. What Yotascale does is look at all of the billing data. It goes into a cost and usage report, if you’re talking about AWS. It’s the billing export, if you’re talking about Azure. That data is very large with many, many lines of resource data for each hour. There is a cost associated with a resource. You can imagine you’ve got one instance running for a day, that’s going to produce 24 data points, and then expand that to thousands of instances, thousands of pods, etc. This can be a very, very large data problem.

Yotascale is able to take a look at all of that metadata that exists out there and help make some sense of it. Some of the problems we ran into when implementing this were that there are so many different types of tag keys out there. We’re able to normalize that type of data. So, for example, somebody spelled it all lowercase, somebody used one capital, somebody else put it all caps. We still want to respect the information that’s been put out there by the engineers, so we can actually combine all these different types of environment tag keys into one tag category called environment. That allows us to reference that tag category and create a rule-based system that allows us to allocate cost to a particular business context of business entities.

In this case, we’re looking at a particular business unit, team, and service that we’re identifying here. We’re not only looking at the cloud provider tags, but we can also look at things like region, account number, etc., and then we can layer in the Kubernetes world as well to allow us to pull all this information together and provide that cost allocation in a very granular way across all these different resource types. What that ends up resulting in is the ability to navigate, get dashboards, get notifications and alerts about anomalies, cost trends, and analysis, all delivered to individual engineering teams in Slack. You can navigate these context hierarchies in the UI and actually see the cost in an almost real-time manner right inside this SaaS console.

A lot of companies are also enabling their finance teams with this. So finance, instead of having to get into the AWS console or Azure console to look at the cost management dashboard, having to give them IAM roles and security permissions, etc., we can just provide this really simple-to-use SaaS product that lets these teams that aren’t so technical get the cost information they need. We also have APIs that Hulu pulls this information out, provides deep links inside of their own system dashboards, so they can connect directly to Yotascale from their dashboards that their engineers use today. So really acknowledging that adding another dashboard, adding another tool can be a challenge for engineers to adopt. We want to push this information to where they live. So via APIs and via integrations and things like Slack and Teams, we are able to deliver very specific information to very specific individual teams so that the information is relevant, useful, and actionable. That’s the important thing here.

One more quick screenshot here shows what this looks like in the dashboard. We can see a breakdown of the different business units, how much they’ve spent within the month. We get this forecast data and are now able to do forecasting at any one of these nodes within that hierarchy, whether it’s at the business unit level or all the way down to the service level. We can also provide things like anomaly detection alerts, allowing engineers at Hulu to know when their cost has spiked. This information gets delivered to Slack, they can click it and be brought directly into the analysis module.

On the right-hand side, you can see some of the recommendations that are available as well. So going beyond visibility and providing recommendations, but providing recommendations to individual teams as opposed to just saying, “Hey, here’s a list of all the recommendations that exist out there, and we don’t know who owns what.” We don’t know how to get them the information they need to make a change. With Yotascale, we’re able to get down to who owns what resources and what recommendations are there for those resources, so we can provide interesting information and actionable recommendations to the teams so that they can make changes and optimize their infrastructure.

NextRoll Case Study

Jeff Harris:

Next, I want to talk about NextRoll. So, NextRoll came to us looking for some help in enabling the finance team, which is a little bit different than really enabling the engineering team, which was the main focus with Hulu. At NextRoll, the challenge was that the finance team was struggling to keep up with changes that were happening to infrastructure, and they needed to do forecasting and modeling. They also needed to make commitments with the vendors. So, as you’re trying to decide how much can I commit to spending with AWS this year, how much can I commit for my enterprise agreement with Azure this year, finance needs to know what’s going to be changing, what’s going to be there, and when a spike is just a temporary spike or when it’s going to be something that’s ongoing. So they can really have these crisp financial models to be able to make better commitments and make sure they’re not overcommitting or undercommitting and not getting as good of a deal as they could.

At NextRoll, they were spending a decent amount on AWS. The focus here is on that anomaly detection. So, what we had to first do for NextRoll was to build out that hierarchy, understand the different business units that finance thinks about, and cost centers that finance thinks about, so that we could apply that same hierarchy concept through the financial lens. With the hierarchy in place, we’re now able to do anomaly detection, sending anomaly alerts both to engineering and to finance within the Slack channel. Engineering is able to respond to those anomalies, letting finance know whether it’s something that’s going to be ongoing, like it’s a new change to the environment, not really an anomaly, but a new change. A new cluster deployed in a new region. So it’s really facilitating this communication and this collaboration between finance and engineering. So finance knows what’s going on. Engineering is aware of the changes that are happening and is able to just annotate what those changes are and how long they expect them to last.

Through this, we actually did catch some anomalies that were unintentional, that we alerted the engineering team to before costs got out of control before the end of the month. So the engineering team, when they see these, are able to go in and take action, shutting things down, and allowing them to retrieve some costs that they would have incurred had they not had Yotascale in place and not found these things until the end of the month or end of the year when they were looking at the bill and trying to understand. Then, going through the exercise of having to chase it down, go back into the history and the logs from weeks ago and try to make sense of what happened. Yotascale highlights what those problems are.

Here’s a quick screenshot. Down here in the bottom left, we can see what that alert looks like in Slack. We get this link in that Slack message that brings us directly into the anomaly detection module where we can get more details about the anomaly. In this case, we actually see that this was a Redshift node that was deployed and ended up costing approximately about $5000 more than normal. If you drill into the details, you can actually see the resource ID, any other tags associated with it, all in an effort to summarize the data that we have that we can present to the engineers so they can make a decision on whether or not this is an anomaly.

Designed for Modern Architectures at Scale

Jeff Harris:

In summary, we discussed some specific customer use cases today and I wanted to tie this all back to the three pillars that we are focused on here at Yotascale: observability, predictability, and efficiency.

From an observability perspective, what we’re really trying to do is provide that view through the lens that makes sense to your business so that when your CEO, when your CFO, are asking questions in the context of the cost centers that they speak about, the engineering teams that they understand, you can provide really crisp, clean data to them to let them understand what costs are, how costs are trending, what’s changing, and how much you potentially, with some of our B2B customers, they’re actually looking at a cost of goods sold for the software that they’re providing their users and being able to do this sort of margin analysis. So that observation is incredibly important.

Once you have that observability, you can get to this predictability. Once you’ve been able to section out what costs are going to which parts of the business, you can now start doing forecasting. We see this with companies like Hulu, who is really trying to get down to the specific cost of particular services. Look at the number of hours of video that might have been viewed that were driving that service and then make predictions based on the user base and how much they think they’re going to end up spending based on their usual growth estimates as well. That predictability is incredibly important for running a business efficiently.

And then it gets to more of the optimization and efficiency part. OK, I’ve deployed all this stuff into the cloud and I’m running efficiently. Are there changes, maybe changes that I can make that would help recoup costs, help me be able to take dollars back from the cloud provider and put that back into my business, investing in new employees or other products or opportunities that may be out there for you.

So those are the three pillars: observability we talked about with the anomaly detection piece, predictability with our budget and forecasting capabilities, and then efficiency with the recommendations and even partly anomaly detection there as well.

Empowering Engineering To Make Smarter Decisions

Jeff Harris:

A few key things that I’d like to leave you with as you think about Yotascale and how we are really different. We are really focused on this automated cost attribution at scale. When we’re talking about companies like Hulu and NextRoll that have a ton of infrastructure running out there, we have to process a lot of data to be able to get to that level. We also allow teams to call APIs to manage this context hierarchy. So as the business changes, the hierarchy within Yotascale changes as well, and that helps you maintain that cost accuracy.

The other piece that helps with cost accuracy is our tagging module and ability to normalize the data that’s coming in from the tags so that we can make sense of it. It doesn’t have to be perfect. It doesn’t have to be all the same case, all the same spelling, all have the right number of dashes or underscores. We can make sense of all of that mess and help you get to a much more accurate cost attribution model.

Another differentiator for us is our deep technical support. For things like Kubernetes and containers, this is going to expand into other shared costs. Kubernetes was the first choice for tackling the shared cost object because there’s a ton of data available for it. It’s a complex problem, but we can get the data pretty easily, which allows us to provide this very granular cost allocation all the way down to the microservice.

The final piece I’ll mention as a differentiator for us is our recommendation engine. We didn’t talk too much about the recommendations here today, but we do provide recommendations around EC2, RDS, savings plans, or reservations as well. Being able to analyze all this data and provide some cost-saving opportunities and get them to the right teams and allow those individuals to provide their own feedback is crucial. We are running some algorithms on these things, so we don’t have the context, and we really need the engineering team to buy in for this. They buy in more when they have the ability to come and say, “No, this recommendation is not valid. We’re going to stay on these M5 four X larges for X reason.” That information gets stored in Yotascale, and now the central teams and upstream the DevOps team are all able to come in, review that information, and even push it back to the engineering team saying, “This one actually looks like it would be a good opportunity, and we’d like to take it.” So we’re facilitating that collaboration around the organization about cost optimization and cost management.

We also can get started really, really fast. It takes about 30 minutes to get connected to the billing data, and then we process that or provide that cost analysis as far back as you have billing data for. Typically, we do a two-week evaluation if anybody is interested. I’d be happy to provide more information on that if you’d like to reach out.

Again, just highlighting some of the customers that we have today, we talked about Hulu and NextRoll, but we’ve got a lot of great stories. You can find more about these customers on our website. Thank you for your time, and we appreciate it. Happy to answer your questions now.