Case Study: How Flatiron Health Gained Visibility and Control Over Total Platform Costs

Beyond the Numbers: Strategic Cloud Cost Management the Okta Way

Managing cloud cost efficiently is the top priority of every company today. At a time when utilization of cloud infrastructure is increasingly strategic and central to a company’s operation, managing cloud cost whether from direct infrastructure usage or passive consumption of foundational applications like cloud databases, is more important than ever.Okta software architect William Howell discusses his company’s journey in understanding their cloud usage patterns and optimizing their spend while delivering a world class security service with Yotascale CEO, Asim Razzaq.

Dive into Okta’s firsthand experience with cloud cost management, utilizing the power of Yotascale. Explore how Okta moves beyond just reducing costs to strategically allocating resources for maximum efficiency and impact. From balancing API tests to understanding the nuances of focusing on key areas, learn from Okta’s approach to making data-driven decisions that yield sustained savings. This session offers insights into the real-world challenges faced, solutions implemented, and the benefits reaped.

Join us to glean from Okta’s journey and learn how to spend smartly and effectively in the realm of cloud management.

Length: 58 mins

Webinar Transcript

Welcome

Asim Razzaq:

Alright, we’re at the 10:00 AM Pacific mark. Welcome, everybody. Really excited to welcome you to our webinar “Beyond the Numbers: Strategic Cloud Cost Management, the Okta Way.” I’m going to be joined by Will Howell, who is a lead software architect at Okta. Really looking forward to this discussion. Will, given that I’m a recovering engineer, I love talking to other engineers like you, and we’ll get into intros and the meat of the discussion shortly.

A couple of housekeeping items: You will see a Q&A access for you guys at the bottom or wherever your webinar settings are, so please feel free to have the questions come in along the way. We will definitely take time towards the end to answer as many of these questions as we possibly can.

With that out of the way, I’d like to do a quick icebreaker. Here at Yotascale and I’m sure at Okta, we love diversity. So, if folks can in the chat, put in which city they are joining from and if necessary, the country, that would be good as well. We can take a minute and see where folks are joining in from. I’m here in the San Francisco Bay Area and Will is in Seattle. Why don’t we take a minute and do that?

Alright. Any takers on which city? Yes, someone raised their hand. If you have a question, you can put it in the chat. Ah, OK. Looks like the chat is disabled. Sorry about the technical difficulty. I am seeing… OK.

William Howell:

I’d be more scared if this went off perfectly.

Asim Razzaq:

That’s right. Put it in the Q&A. It looks like the Q&A is working. Put it in there. I’m sure Will, every single build that you worked on always works perfectly.

William Howell:

Right, every time.

Asim Razzaq:

Zero errors. Alright, we got answers rolling in here: Princeton, NJ; Montreal, Canada; Seattle, WA; San Francisco Bay Area. Cool. And I think Kevin, you can pipe in from the live stream. We’ll call out some of these along the way.

Speaker Introductions

Asim Razzaq:

OK. I think with that, we will dive right into our discussion here with Will. Just by way of introductions, I’m Asim Razzaq. I’m the co-founder and CEO of Yotascale. As I was mentioning, I am a recovering engineer. Prior to starting Yotascale, which does cloud cost management, I was the head of platform engineering at PayPal and eBay. Most recently, when I was there, everything that had to do with login or payments in terms of APIs was under the purview of my teams. I joined PayPal a number of years ago to help with the third wave of growth, which was the developer platforms. As you guys know, PayPal went through consumer, merchant, and developer was the third wave of growth. I had the opportunity to build all the infrastructure from the ground up. We scaled it to multi-billion dollars in payments volume and I think that’s where I really appreciated how important developer productivity is. Enabling developers is key and critical. I had other roles at PayPal, including running big data, some of the AI stuff when it was all the data between PayPal, eBay, and Skype. My background has primarily been what is now known as platform engineering. It has had different names along the way, like any piece of technology. Yotascale is really an outcome of my love-hate relationship with the public cloud and increasingly SaaS infrastructure, which is all OpEx, subscription-based. We are quite lucky and grateful to have customers like Okta, Hulu, and others. They have some of the best engineering teams in the world doing amazing things. So, Will, I’ll pass it on to you to introduce yourself.

William Howell:

Sure. As Asim says, I’m Will Howell. My background is primarily in security and threat analysis. I got into build and deployment as part of my work around that. I’ve worked at a number of places, including HP, McAfee, Amazon, and Rackspace, among others. I primarily worked on threat intelligence and automated remediation of malware through various means. That grew into needing to have a pipeline to build this stuff and looking at how source is built and the tools around that, as well as some of the egregious horrors that go into making modern software. Somebody thought I could do well in this space, so I moved into build and deploy systems. I ran AWS and Amazon.com’s internal build and deploy system for a while, working on a few projects like ECR and Code Pipelines, but mostly ran their internal build and deploy system. Eventually, I moved to Okta, where I’ve been helping the team build out common infrastructure for build, test, and package. That’s how we ended up here today.

Asim Razzaq:

Alright, well, it looks like you know a thing or two about scale and build and deploy here. Excellent. I think I’ll let you introduce Okta, just so we have the background and context. Of course, the nuts and bolts of this are going to be cloud cost management, which we’ll get into shortly.

Okta Background

William Howell:

Yeah, so hopefully most people are familiar with Okta. We’re an identity and access management company, traditionally defined as the broker for security software, including authentication and authorization, which rely heavily on identity. We’ve added access management onto that. Our rise to prominence was thanks to a tiny virus of the human variety, COVID-19. When COVID-19 was in full swing and people were being asked to stay home and reduce travel, a lot of companies were trying to figure out business continuity. Our product was very well-positioned to help out at that time. One thing I love about Okta is we focus on the good we can do for our communities, not just our customers. During COVID, Okta did a lot of work with small and medium businesses to help them get their workforces up and running without necessarily having to go through large-scale, costly changes. They felt it was more important to get people back to work and able to do their jobs. Okta continues to do a lot of work in this area now. A lot of what I really resonated with when I worked in the security space is helping people that are in a situation that’s obviously not the greatest place to be in. Nobody wants to be at the end of the phone call talking about how their hard drive just got encrypted or their money just got stolen. So I really I really love when companies give back.

We’re currently on the rise as far as valuation goes. We picked up another company called Auth0 and are working on integrating their products. We offer a wide range of ways to plug into your business and facilitate the right things happening at the right time.

Asim Razzaq:

That’s great context and background, Will. I think with that, maybe you can quickly go through the different capabilities that Okta provides. This is important because when you talk about cloud cost management, one big challenge is complexity. There are many different services, groups, business units, and product lines. This will segue pretty well into the rest of the discussion.

Platform Engineering Solution, Platform Engineering Paragon

William Howell:

Yeah, so at Okta, we want to empower anyone to safely use technology at any time. A key component of that is being where your consumers are. For administrators, business owners, and those kinds of people, we offer single sign-on and universal identity plugins for all your favorite platforms, making that part of the process seamless. But that’s only helpful when setting up an existing business, not your own custom tooling and offering that out to your customers in a secure way. Beyond that, we manage the whole process of identity. We connect with your identity management systems, give your developers access to our platform for extending it, for getting data in and out of it, and making it something you centrally have everywhere. This goes to our mission of making you one common identity and approach to security, and then pushing that out as far as possible in terms of devices, applications, and languages.

Asim Razzaq:

Just wanted to give a shout out. We are users of Auth0 ourselves and we love the single sign-on aspects for our customers because it just makes it very easy. You have the out-of-the-box integration and don’t have to worry about proprietary systems. It’s effectively becoming a de facto standard for a lot of identity solutions. That’s great to see.

Journey to Predictable Cloud Cost Management

Asim Razzaq:

OK, so I think this is the meat and potatoes of the discussion. As we’ve been talking about leading up to the webinar, one interesting thing about Okta is this concept of the journey to predictable cloud cost management. You and I were discussing how everyone wants cloud cost management to be predictable and efficiently predictable, because you don’t want to burn a ton of money. Talk to us about what this means and then we can get into where you guys are, how you got here, and where you want to go. I think that will be very valuable to our audience.

William Howell:

Predictable cloud cost is an interesting conversation because we end up having this conversation around saving money. Like almost all budgetary discussions really come down to this core question of how do we save money. But if you start looking at cost data as historic data and you’re looking at it from the lens of trends, you start noticing that companies that do a really good job at this don’t shoot for the bottom dollar; they shoot for predictability. Do I know how much money I’m going to spend next month based on how much I spent last month?

I remember being early in my career, reading a story about why Alaska Airlines has done as well as it has. A few years ago, there was a huge crunch on kerosene, which is what they use for jet fuel, and a lot of companies struggled to find affordable jet fuel because the cost had skyrocketed. But Alaska had sat down and analyzed the data for their own business, and they had actually made a deal prior to this happening to keep the gas cost at a fixed amount. So, when gas went from $10 a gallon to $100 a gallon, they were still paying $11 a gallon. They could predict their growth and operations costs. I always took away from that story that it’s not so much about how much you spend because how much you spend is a factor of your growth and inefficiencies in your system. It’s about being able to predictably and reliably know how much you’re going to spend so you can make reasoned and informed decisions about making it more efficient.

That really is what sort of drives our journey around cost management at Octa. We’re looking for predictable and reliable spend so that we can make reasoned and informed decisions about reducing costs.

Asim Razzaq:

So, I mean, that’s an interesting mindset shift, right, because a lot of times in cloud cost management, the prevalent theory is like, well, let’s just save money every which way that we can. And I think what I’m hearing you say is that it’s not really about that. It is really about how do we predict the demand and how do we make sure the supply is matching that, right? It’s a supply-demand mismatch, which is the bigger problem there…

William Howell:

Yes. Microeconomics. It’s microeconomics.

Asim Razzaq:

That’s right. And I think that frame is important, right, because otherwise you could be, as they say, chasing your tail, trying to do all sorts of weird economic theory-type things, but at the end of the day, this is what matters. But the problem, the reason why that happens is because people aren’t able to justify, right? When there’s no justification, then there is a top-down mandate that comes in to say, well, just save across the board 10%, 15%, 30%, right? That’s kind of the problem. So, talk to us about what did it look like before leveraging Yotascale? What was the situation, what were you guys doing, and how has that changed over time where you are in the journey?

William Howell:

Yeah, so I think we did a lot of what anybody does, right? We took the provider tools and looked at what they offered for cost analysis. When it comes to AWS, we had the CUR, we had cost analysis through the console, billing reports, and a number of ways to represent the same piece of data. But we didn’t have a very common language about what that piece of data represented. For instance, I would look at our raw spend, and then look at our raw spend at the end and use that as the differential for how much money we saved, but finance would be using net amortized costs. They wouldn’t see any actual changes because of discount programs and all of that. So we ended up having these very frustrating conversations about, I can show you that we’ve saved $10,000, and they’re like, you haven’t because the bill didn’t change by $10,000.

So, one of the things we were looking at when we started our journey of trying to pick a vendor around cost management was we needed someone who understood that there were people who had different views on this data, different ways they wanted to rationalize it. Different places that they were coming from as far as what this number meant to them. All the other stuff like APIs and the ability to categorize were important too, but really we wanted to start having this way of having a conversation with everyone involved about where our money was actually going. And I was very clear with leadership and the architecture team that we’re not bringing this tool in to save us money. On the whole, this tool will actually cost us more money, right, because we still have the problems we have today, and now we’re also going to be paying Yotascale. But what is important here is that this tool facilitates us being able to save a lot more money than we could without it.

And so, that gets into this sort of idea of predicting out our cost because I have to be able to answer the question of when will this start paying for itself or have paid itself off. That goes a lot into, I think, what Yotascale can do, and that’s how it summarizes the data and gives us the ability to break it out by stuff. But really, I think the most interesting part internally has been, there are a lot of teams now that are coming to us and saying, “Hey, we’ve heard you have Yotascale. Can you show it to us?” And we do that. And they’re like, “Oh, it’s pretty much our tool.” And then we show them that it’s not about the tool, it’s about how we set it up and what we can do with that.

We’re in the process now of finalizing the decision to move most of the teams inside of Okta to this tool or a very similar version of what we have set up. The idea of having a common language to discuss finance, cost savings and reduction plans is one that really resonates with people because there’s not a barrier to entry. We’re not talking about, well, M52XL’s have this CPU time and cost this much, but they’re only in this region and we run our… It’s just, this is how much we spend. If you want to drill in and see why we’re spending that amount of money, you certainly can. Or if you want to stay up at the high level, it’s the same discussion. Somebody can come in and drag the slider to the left by 10% and see how much money we’d have to save, and where, and what is going to be impacted the most by that change.

Asim Razzaq:

Do you have any specific examples of this common language and how you have been able to democratize the data? A couple of examples would be very helpful because we run into prospects and potential customers where one of the people in engineering translates between finance and engineering. That initial frustration you mentioned is pretty apparent because the person in the middle from engineering doesn’t want their teams to be just sitting there doing what we call a CSI crime scene investigation in the rearview mirror phenomenon.

William Howell:

I’ve got two examples. One of them is kind of a gimme. So as we were doing this and we were starting to put these budgets together, obviously we don’t necessarily know if the budget is realistic. That takes time, but it also takes involvement from other people. So we were sitting down and we were looking at the process and we were saying, “OK, it costs this much money.” We’ll just use $100 because it’s easy. It costs $100 to run a build-test-package cycle. Everyone’s like, “OK, but what does that actually mean?” So we started digging in further and we were like, well, as all engineers do, it’s not one number, it varies, it moves around, it does all of this stuff. But someone asks the question, they’re like, “Well, why does it vary? I mean, what is actually different?”

There’s some really key things like if it aborts early and stuff like that. But we started noticing that we had some tests that were just more expensive than other tests to run. We found those tests were picking a very old architecture to run on at Amazon and it was 20% more expensive to use that architecture. So we asked the teams, “Why are you running on this old architecture?” and they said, “Well, it works and we’ve never had to change it.” So we bumped those instance sizes up, the time goes down, compute cost goes down, and now it costs less to run the same test.

Asim Razzaq:

Can I? I hate to interrupt. Can we back up a little bit because what you just said is actually quite profound, which is you guys were able to get to the cost at a test level or a build or a pipeline level. How do you guys achieve that? Because I think a lot of customers are not able to do that and how they do like Yotascale help you kind of get to the point, right? Because that is the crux of the issue is like you don’t even know how much the unit economics, you know something like that is is.

William Howell:

We started where everybody starts: go out into your infrastructure, tag everything you can possibly tag, and start dumping those costs in. Then we sat down with the two key players involved, which are the platform engineering services, developer productivity that owns the actual infrastructure, and the teams that do various build, test, and package cycles.

With the developer productivity team, we had them instrument everything with whether it’s shared or whether it’s custom for this one test or build. Then we pump all of that data into our own database, but we also pump that tag and hierarchy information into Yotascale to track costs. The CUR only comes out once a day, so we get back that our SQS queues cost us $100 to run yesterday, and then we go back into our own data and we say, “Well, these fifty jobs were in queue at that time, and because of the timestamps for check-in and the logs, we know how long each one of those jobs was in the queue,” and we break that percentage down and assign the cost out to it.

For individual tests, when we launched the test run, we tag and annotate it with the information that we need to track it down to an individual commit or actual test. One of the tags we push to AWS is the SHA of the change, so we can then backdate that and say, “OK, this instance had this SHA tag on it and was run for this long and therefore cost this much.” We’re actually in the process of working down to annotate and say, “OK, this actual resource or this container running on that workload costs this much.” But that’s kind of a lower-level, more for engineering thing. But what it allowed us to do is start doing statistical summarization of the data for finance. So as engineers, we would argue out, “Well, test X is better than test Y,” common engineering things. But finance was really just interested in, “Wow, you ran 100 yesterday and that cost us $100, and you ran 50 today and that still cost us $100, what’s going on there?” We were able to go and drill in and go, “You know, that’s not a really interesting question, but it led us to a bunch of really interesting questions.” And the net net is we now save 20% because we’re doing something different, right? And finance doesn’t care; they’re just like, “Great, 20% less is 20% less.”

The second example is more complex. To give you an idea of scale here, a build and test cycle for our core product takes about 6,000 hosts and runs over a 2 to 4 hour period. In that time, we run around 200,000 tests. You can imagine this is a really massive landscape, and no one person can realistically rationalize it. So, the question started being asked: what is the contribution of those 200,000 tests? What are they actually doing? We didn’t have good insight into that.

One of the things we can do using Yotascale is look at how long and how expensive these tests are to run. We start categorizing tests by things like the instance size required to run the test or the amount of time it’s actually on the CPU. We give these as dimensions to sort the data for our users. We’ve started identifying tests that are more beneficial to run first because they are high yield in terms of the data we get out of them, but they’re also very cheap to run. Small instances run very fast, so we are now building on that to say these are the tests we will run first because they’re highly indicative and cheap. Then there are the more expensive ones.

As you know, lots of people talk about how to prioritize in a queue. You get a bunch of work, you go through it, and you say this is critical to run. But what do you do with what you’ve decided not to run? We have an effort internally called peak shaving. We look at the peak cost in AWS and try to find times where running workloads justifies our cost savings plans, and run workloads then. We do this by taking off capacity from these peaks and not forcing everything to run then.

This leads to an effort we call economy mode. When you submit a test (and we’re working on automating this), you can say, “Hey, I’m about to leave to go to lunch; this is not critical to run right now,” and it’ll get scheduled at a lower priority. The calculation of that lower priority is based on cost and some other dimensions, not just on what modules you touched. It allows teams to tune this dial, saying, “I’ve been asked to hit a budget, and my budget is blocked out like this.”

I always say this: when we can tie budgeting to common-sense things that humans do every day, it makes a lot more sense. For example, lots of places I have lived have on-peak and off-peak charges for power. You don’t want to charge your Tesla in the middle of the day because it’s hugely expensive, so you do it at night. On the flip side, you can’t run your air conditioner only at night to cool it off during the day. You can do a little bit of that, but you have to run your air conditioner during the day.

We need to know if our air conditioner is efficient. How much is it going to cost us to run it during the day? Then we can figure out, well, maybe we only run it for the very hottest part of the day and we sort of suffer during the other two-hour periods.

Asim Razzaq:

Right, it’s taking all the constraints into account, knowing that certain things need to be run at peak times and they are going to cost more. So, maybe you can make it somewhat efficient, and that conversation back to finance is very productive, right? Instead of saying, “Well, yeah, that’s what it costs, just leave me alone,” which isn’t an explanation finance will accept. It’s a much smarter and more intelligent conversation. It’s like saying, “Hey, we have nothing to hide here. Here’s all the data. We can make a decision.” As you and I were talking about, it’s all about trade-offs at the end of the day—there’s no purist approach here.

William Howell:

Yeah, I think that comes down to the crux of it. It’s taken me a while in my career to get to this point, but the conversation is where the magic happens. Yes, there’s lots of cool software being built out there, and I can slap the word “AI” in front of anything I build and make it even more magical. But it’s about sitting down and recognizing that the finance guy across the table from me is super frustrated by the answers I’m giving him and stepping back and saying, “Well, what would work for you as an answer?” and then looking at how we can get there.

Once we start doing that, these budget conversations go from grueling hours down to minutes, and in many cases, down to automated publishing of spreadsheets where everybody just knows. Now, there is still human endeavor needed to keep reducing costs, right, because we’re talking about predictability and cost. But you still need to sort of curve fit and say “we want to lower this”. But I I think having a more rational discussion about what we can do, instead of what we have to do, ends up making everybody happier.

Asim Razzaq:

Yeah, and the third bullet here is pretty key, right? Making sure that people understand that every engineering decision is a budget decision is crucial in the times we live in. So, if you have the data and the understanding, then you can make much smarter decisions.

Okta Platform Portfolio Management

Asim Razzaq:

This helps segue into the concept of portfolio management. I was in a conversation with somebody, and they basically said, “Well, if you can get the EDP discount from AWS, why do you need cost management?” Sometimes that’s a prevalent mindset, and people can say that about other economic things like spot instances or reserved instances. You’ve gotten the deal done, but the point is, if you’re still spending $10 million a year—or in some cases, a month, and for some companies, a week—you have to manage the portfolio, right? Who gets what money and based on what? You can’t answer that question until you answer the question of where the money is going.

So, talk to us about how you think about portfolio management and where you want to go from here.

William Howell:

One of the most valuable conversations we’ve had internally was sitting down with people and explaining how AWS costs actually work. Because, as you said right off the bat, somebody’s like, “Well, we negotiated an EDP discount, so we’re getting the cheapest amount of spend we can get, right?” And it’s sort of, you know, it’s always sort of. But when we actually showed them a graph with a red line on it that said, “This is what we pay. This is what we guarantee Amazon we will pay every month,” they saw that red line and focused on the peaks. They thought that where we were going over meant we needed to move the red line up. I said, “No, you actually need to look at the valleys because the valleys are where we’re giving AWS money and not doing anything. These instance minutes are just sitting there because we paid for them, but we’re not using them.”

What we actually need to do is flatten these peaks down. When we started talking about that, we got into portfolio management. How in the world are we going to flatten these down? Because you kick off a build, start a test run, reserve instance minutes, and at a macro level, like at a budgeting level, that’s exactly what we do. But we have a lot of choice as to when and what we run, where, and how we conceptualize something being started. Again, we’re talking about something that’s running for 120 to 240 minutes. So do we need to do a bunch of upfront reservations, or can we let those go? The answer is not always obvious. Sometimes it’s actually better for us if we do upfront reservations rather than let it slide for awhile.

Does it make sense to queue every single test into an SQS queue as a message, knowing that 10% of those are going to fail? Or does it make more sense to only send off messages when we can? When we get into priority, especially when we couple that with a portfolio, we start getting into, “Well, how much work could we do upfront to ensure that what we’re gonna run actually needs to run?” The easy answer is to cut out tests for things that don’t run, but what if we looked at this slightly differently and said, “I don’t need to run any of these tests for these kinds of builds.” When it’s just the developer kicking in commits and they just want a fast feedback loop, then if I knew how to prioritize the tests, I wouldn’t have to run any of that.

From there, we move into talking about, “What if we assigned a fleet?” What if we assigned a certain number of EC2 instances to that specific task? Even though they’re reserved out of the general fleet, they still start optimizing and controlling the cost around this and giving people an indicator of whether or not we’re over-scheduled for particular functions in the system. Instead of saying, “We only have 6,000 instances and they’re allocated like this,” we can say, “Hey, the queue for rapid tests is full. If you submit now, it’s going to take four hours to run. So maybe keep adding commits until this goes down.” Or if you really need it, add the priority task flag and know that it’s going to charge you more, but you’ll get bumped to the front of the queue.

I think people really begin to understand that when talking about trade-offs. In a system where it’s just running, you don’t know how to make trade-offs. You just submit your build, it runs, and it tells you what happened, even though you don’t necessarily need that. You only needed this one thing that happened. This drives us towards the conversation we’re now having about how to make the developer environment seamless with the CI/CD environment as far as resources go.

We’re talking about buying bigger and bigger laptops for developers as things get more complex. But what if I could host most of those resources for you and be cheaper across the board and also more secure, able to share those, and do other kinds of stuff? It’s driving these other conversations to start thinking about how we manage this like a portfolio. How do you manage this as a set of common resources that people need and that cost us but also bring us value?

Asim Razzaq:

Yeah, I mean, that trade-off part is pretty key, right? As you said, we kind of get this sometimes as well. People say, “Well, I’m pretty efficient because 90% of my whole fleet is reserved instances.” Well, OK, but is that the right thing to do? How much is it being utilized? Just because that’s a deal you struck doesn’t necessarily mean it is a high-yield deal when it comes to efficiencies. So, I think that makes a ton of sense, and I love the active approach of ensuring that you’re taking money from areas where you have capacity and putting it into areas that need capacity. I think it’s the ongoing piece of this.

Okta’s Results using Yotascale

Asim Razzaq:

So, with that, this was pretty staggering for us to look at, given the short amount of time you guys have been doing this. Walk us through the savings of 25% in spend, and then more importantly, I know you guys had early success in reducing build time and test time. I think that’s impressive. Because back to our original point, a lot of people think about cloud cost management as just saving money somehow, right? But they don’t think about it as developer productivity or efficiency in many other ways that is not just a bottom line, “Hey, we saved a bunch of money.” So, talk to us about these.

William Howell:

Yeah, so when I took over as architect for the team, we were spending an amount of money that I was staggered by, based on what I was being told we were able to do with that money. So, we started looking into it with the help of the really excellent engineering services team that we have internally. We began breaking down where our money went and all of that, and it was kind of a long journey.

I think, as with everybody, we came out of the coronavirus crisis and walked right into a recession, right into this massive cost-cutting. We essentially had one of those very early meetings with finance where they were told, and we were told, “25% reduction in spend. Thank you very much,” and they walked out of it. No discussion—you got it, you got to do this.

That was made even more complicated by the fact that we were lumped together with a number of other teams. So, even if I individually saved 25%, it was about the group saving.

Asim Razzaq:

So your group might have to save even more.

William Howell:

Yes, and that was the conversation we had at many points. As we were going through this two-quarter effort, we would get a report and find out that a certain group was not going to be able to save money, so we were told we needed to save even more money. Myself and the rest of the team sat down and said, “OK, how are we actually going to get here?”

There were obviously straightforward, first-day kinds of tasks, like housekeeping. Let’s get rid of all this cruft that’s just sitting around costing us money and doing nothing for us—unattached drives, instances that are running but doing nothing. All of that. And that got us a good chunk of savings along the way, but it was a one-time thing. We weren’t going to get those savings again.

So, we started talking about the second pass: what is actually going to save us money in the long term? The first thing we hit on was doing less work whenever we have to do work. If we can cut out CPU cycles, then we can start saving money. One of the things we realized is that some of our instances were undersized, so they were OOMing and that would just extend the amount of computing. They’d run 50% of their tests, then they’d OOM again, and start back up and run that same 50% again. So, we were spending 150% on what should be 100%. By bumping those instance sizes up, the time goes down, and compute costs go down.

Asim Razzaq:

That that sounds so counter intuitive to a lot of people, right? Like bump up the size and you will save. Money. Like what?

William Howell:

Yeah, and people would ask me in meetings, “Well, why don’t we give more compute priority to this other thing?” And I would say, “Actually, Amazon does some really clever things in this realm to make it easy to figure out what instance sizes you need. For one, they do a 2:4 ratio of CPU to memory. So for every two cores, there’s 4 GB of memory.” So you look at your workload and ask, “What is the smallest amount of either compute or memory that I need?” and you optimize on that one.

We’d ask people, “For this test, is it memory intensive or CPU intensive?” They would say it’s memory intensive and it uses 11 GB of memory. So we’d say, “Then you need a 16 GB instance.” And they’d reply, “Yeah, but I want the CPU from a 32 GB instance.” And I’d explain, “Right, but you’re paying for that extra 21 GB of memory that you’re not using.” So we could run two tests in that same space by devoting that instance elsewhere.

This led us to looking at test packing and container packing to do some better stuff around there. But the net result was we started looking at how valuable the resource we’re allocating is. If you want a C5.2xlarge, and it comes with these attributes, are you actually using it? If you are, can you use less by refactoring the test or running it differently or timing it differently?

We went through and did a lot of that, and out of that effort came a number of smaller projects. These ongoing efforts help teams understand that when you build a test like this, it becomes memory intensive; when you build a test like that, it becomes CPU intensive. Here’s your budget: you have to fit within this many CPU cycles and this much memory, these these things you can’t do. When you bump up into the next class, you double the price of running this test, even if you only use 1% more than what the lower class gives you.

I think that actually helped a lot. There were a lot of people who were staggered when we started having that discussion. When we showed them the graph of all the reserved time we were just not using because they insisted on using these instance sizes to avoid refactoring their tests, they saw how much it was costing us. As we narrowed the funnel on that, it became increasingly easy to realize these cost savings.

It also became increasingly easy for me to put out graphs to leadership, saying, “You’re not going to get this again.” We just went through our budget discussion, and they said, “You saved us $1.1 million last year, $1.1 million again this year?” And I replied, “No, I’m going to save you this much, but more importantly, I’m not going to spend more than this amount. I can tell you right now, I’m not going to spend more than that.” There was a good amount of discussion about how I could guarantee that we weren’t going to spend more than that with growth and all.

Asim Razzaq:

Yeah, because you have to stay efficient, right? It’s not just about, “Well, we saved this, we’ll save it again.” You have to maintain that line you’ve put in place, and a lot of times the need for capacity only increases.

OK, well, I think in the interest of time, this has been great. I’m just floored by the fact that you guys are able to quantify this, and it drives our discussion and is backed up with a lot of data.

Q & A

Asim Razzaq:

So with that, let’s move into some of the questions. Folks who are attending, and even those on the live stream, please put in your questions. We’ll take the next few minutes to address some of these.

So, one of the questions was about your thoughts on building this capability—cloud cost management—in-house versus buying a third-party product?

William Howell:

Yeah, we had that discussion quite a bit. There’s actually a team in another group who was building their own implementation. The thing is, this is one of those kinds of problems that becomes deceptively complicated, right? When you sit down and look at it, you say, “Oh, I could get all that data. I could figure out how much things cost.” And I think you absolutely could. If you were doing this for a team of ten, I think that is absolutely the right way to go. Sorry, I’m not trying to cut out your small customers, but I think that’s absolutely the right way to go for the first part of doing this for 100 or more.

Asim Razzaq:

To be clear, we don’t sell into any company that’s ten people for that reason.

William Howell:

You need to know what you’re doing and what your intention is with this data and how you’re going to use it. This kind of data becomes important for decision-making, optimizing processes, and really thinking about how you’re going to not just cut costs but also grow and target things. For instance, whether or not you’re going to put that new data center in Frankfurt, based on cost and other factors at the engineering level. We’re internally having discussions about moving to a new substrate. We were able to show that it is cheaper to run on that other substrate; instance time is cheaper, and we get better discounts. But here’s the cost of moving the data back and forth because that other substrate is missing certain features. So we have to ship this data back and forth, and when you factor that in, it’s ten times more expensive. That led to discussions about whether we wanted that 10X cost to get this other benefit.

When you move beyond the simple, just tell me how much I spent or how much things are costing, you really need the help of people who do this across multiple businesses. Whenever I’m looking at a new technology, I consider the industry it serves. When I think about finance, it’s a massive industry with many complexities you need to know and understand. For the simple case, sure, I think I could build it, but when we get into more complex stuff, I really want to partner with someone to offload some of that work.

Asim Razzaq:

I’m also assuming that a good partner will have innovation, right? They’re constantly staying ahead of the curve, whereas in-house you would often just be catching up.

So, there’s another question: how do you respond to people who say, “My engineers don’t care about cost?” How do you make them care?

William Howell:

Yeah, we were actually having this discussion yesterday, I think setting aside engineers for a second and talking about engineering managers and directors is important. The reason why they don’t care about cost is because it’s one more task on top of everything else they already have to do: make sure the website is up, the software works, customers are happy, features are delivered on time, and everyone who needs to know something knows it. Oh, and also reduce your budget by 10%, right? By the way, performance reviews are due next week, and the company’s going to be out. So, what happens is engineering managers and directors make priority decisions, and the easiest priority decision is to not do something.

OK, you want me to look at cutting costs. I can do that, but I don’t have any good tools to prioritize that. One of the things we’re trying to focus on is making common-sense budget controls that helps people actually talk about their budget, not just what they need to do. How much money did you spend last month? That should be a really easy number to figure out, and it should be specific to their teams and product areas. Once you can do that, you can start having the conversation with the actual engineers. You can say, “Look, Steve, here are your commits. Here’s how much it cost us before your commits, and here’s how much it costs us after. Here’s the team average for how much commits cost us. I’m not trying to say it’s your fault, Steve. I’m just saying this data is interesting, don’t you think?”

Engineers get into this mindset where cost is not a parameter to be optimized. Runtime, cyclomatic complexity, and how many machines it takes are things to be optimized. Cost is harder to optimize because it’s a secondary attribute. If we start making it a primary attribute, it becomes a fourth dimension on time complexity. Then engineers start saying, “I wrote something that’s very cheap, efficient to run, and very fast.” We can have that discussion and generate interest in optimizing around the cost dimension. It’s not straightforward, but it requires treating their time as important and listening to them about the tools that would make it easier for them to take this seriously.

Asim Razzaq:

That’s an important topic. One of the things we talk about is how to motivate people, especially those in the creative arts and engineering. It’s about extrinsic versus intrinsic motivation. The pillars are autonomy, purpose, and mastery—everyone wants those three things. The challenge is there’s a lot of extrinsic motivation, like, “Hey, you gotta reduce the budget by this amount or else.” But intrinsic motivation is about understanding the purpose of reducing this money.

Your example at Okta, I love the beginning. It means we can give more to nonprofit organizations.

William Howell:

Yeah, right.

Asim Razzaq:

Right. If you reduce costs, it’s not like it all has to go into our pockets. There are other larger causes we can support. In these macroeconomic challenging times, some of your coworkers might not be laid off because you’re saving money. Giving people autonomy is key. Instead of micromanaging, it’s like, “Here’s your budget, but we leave the how to you.”

William Howell:

Mm-hmm.

Asim Razzaq:

So, you’re not telling Steve who did the commit, “Here are seven things wrong with your commit.” You’re saying, “Here is a comparative thing. Why don’t you go back and think about it?” They might come back with some answers.

William Howell:

Which is, I mean, insights are insights. I’ve always been told people are motivated by three things: profit, privilege, and praise. You have to be open to the fact that people are motivated by different things. For instance, our senior leadership has said for every dollar you save in your spend, that dollar is prioritized for your team when purchasing something they need, like new software packages or new laptops.

So, if you’re saving the company $10,000 a quarter, a portion of that $10,000 is earmarked for your team and what you need. There’s a real incentive to say, “We need this tool or we want this tool to make our jobs better,” and if we save money and justify it, then that conversation is already had. Most of the engineers I talk to would rather discuss how to save money on what they’ve already built than have the conversation with finance about where to find 25% savings. So, flipping that conversation around to happen proactively instead of reactively.

Asim Razzaq:

Well, Will, this has been fascinating. It’s been great. Congratulations on all the success. These are not easy tasks, and making a cultural change and building a framework like you guys have for discussion can definitely be challenging. I appreciate you sharing your thoughts, your wisdom, and your insights.

William Howell:

Yeah, no problem.

Asim Razzaq:

Have a great rest of the day. To all our attendees, thank you for joining. We’ll make a recording of this webinar available shortly. Stay safe and have a wonderful rest of the day. Thank you.