A recent Andreessen Horowitz report identifies that with the increasing in demand for AI/ML services, there is an approaching shortage in the supply of compute resources by a factor of 10! Already Amazon customers are noticing rising costs across different instance types. This means that access to compute resources — at the lowest total cost — is become a determining factor for the success of every digital business.
This article presents a 6-factor framework for managing your cloud costs. It is written for engineering leaders and executives who want to establish a cloud cost management practice. Some times this is referred to as a cloud economics or “FinOps” practice. The framework provides guidance on how to achieve cost efficiency and establish a culture of cost awareness within your organization — motivating your frontline engineers to do the work of managing costs. For this framework, using an OKR format to track progress against the 6-core factors is a great methodology for prioritizing projects and tracking your progress in systematically managing your cloud costs.
Factor 1: Cloud Cost Visibility
The first factor is Cloud Cost Visibility. This is the foundation of your cloud cost management strategy. You need cost reporting on your cloud resource providers, and a 3rd party cloud cost management tool will provide the most holistic visibility and needed granularity, especially when viewing costs from multiple sources. A strategy for tagging and measuring your consumption of cloud resources is critical to ensure long-term visibility of your costs. This requires that you cultivate a culture of cost awareness that motivates your engineering teams to be proactive and take ownership of managing this issue. Cost visibility should take into consideration:
- Service provider and account level breakdowns
- K8s and/or microservices granular cost reporting
- Cost attribution via taxonomy and/or tagging
- Costs trend analysis: day-over-day, week-over-week, month-over-month…
Factor 2: Cloud Cost Insights
The second factor is Cloud Cost Insights. These go deeper into your cloud resources than just cost visibility. You’ll need integrations with monitoring systems and SaaS vendor products to gain these insights. You’ll also need AI/ML capabilities to in order to analyze your cloud costs and maintain a 24/7 vigil on the consumption of those resources. Collaboration with your Finance and Engineering teams is essential to bring consensus on what “unit economic metrics” for tracking progress and bringing analytics and dashboards together for application owners. Some of the key insights that you should look to develop include: unit economics, utilization metrics, anomaly detection, idle resource information, and cost spike analysis.
Factor 3: Cost Governance
The third factor is Cost Governance, which establishes effective budgeting, forecasting, cost allocation, and cross-functional collaboration practices to achieve your financial and business goals. Alignment between Finance and Engineering teams is crucial for tracking costs and ensuring consensus on product priorities. Examples of cost governance strategies that have been used include:
- Cost savings-reinvestment framework between finance and engineering
- Chargeback/showback models
- Monthly cost reviews with engineering managers and leadership
- Aligning finance and engineering teams on budgets and product priorities
All of these strategies have the goal of motivating both Finance and Engineering to better track costs and achieve team consensus of priorities.
Factor 4: Cloud Cost Optimization
The fourth factor is Cloud Cost Optimization. This involves prioritizing the top applications with the highest costs and identifying cost-saving opportunities for them. A runbook should track the progress of identified opportunities, and prioritization should be based on a highest ROI and lowest effort criteria to drive efficiencies into your cloud infrastructure. Some opportunities, like migrating Amazon EBS volumes types from GP2 to GP3, could be a side-project that can be worked in parallel across teams. Some optimization opportunities for your public cloud providers include:
- Reservations-commitment discounts
- Shutting down non-production resources during non-business hours
- Spot instance framework
- Dedicated infrastructure audit and optimizations
- Alternate architecture evaluation.
Factor 5: Vendor Management
The fifth factor, Vendor Management, is often overlooked in cloud cost management but is critical. Whether it is an Engineering, Finance or FinOps team leading the charge in negotiating with your vendors, it is important that you map your cloud resource costs to your business priorities. By linking the two, you will ensure agreement between Engineering and Finance on the most cost-effective decisions when selecting vendors and tools. You will also begin to build a well-defined vendor management process that drives business performance. To be successful with vendor management, the managing “economics” or FinOps team has to bridge the gap between Finance and Engineering. Vendor management practices include: evaluating vendor costs/overages, defining deal strategies, evaluating alternate tooling, and avoiding overages.
Factor 6: Automation Framework
The sixth and final factor is your Automation Framework. Automation is essential to ensuring cloud cost management for the long-haul. Building or configuring tooling tailored to your company’s specific needs can increase trust in the automation process, free up engineering time and resources for more strategic initiatives, while avoiding unnecessary costs and ensuring optimal performance. Your automation should be self-adapting to your changing cloud landscape and constant business reorganizations. You can’t shut a 24/7 digital business operation just to reconfigure your cloud resource tagging and cost reporting. You can accelerate the development of your automation framework by:
- Partnering with Cloud infrastructure SRE teams
- Implementing guardrails for developer and test accounts
- Purging unused and/or unwanted Cloud resources automatically
- Implementing a post-mortem incident response framework for cost spike mitigation
With this 6-factor framework for effective cloud cost management, organizations can establish a culture of cost awareness and achieve efficiency, resulting in improved margins and optimized cloud resources based on a better understanding of the economic impact of engineering’s work and utilization of resources.