Autopilot - How hyperscalers solved the trillion dollar paradox

Hyperscalers have been down the road of cloud waste acceleration and there exists a significant body of knowledge leveraging AI and ML techniques to “Autopilot” your multiple cloud environments for efficiency.

Autopilot - How hyperscalers solved the trillion dollar paradox
Photo by Timelab Pro / Unsplash

Authors: Somik Behera

INTRODUCTION

Last few years, we saw that the Pandemic of 2020 has accelerated enterprise spending on Cloud Services, finally dwarfing spending on Data Centers. This should be no surprise to long-time cloud watchers. Cloud enables the ubiquitous paradigm of consuming compute as a utility to accelerate innovation, drive enterprise agility and leverage AI/ML to make sense of the ever increasing pool of enterprise data. The need and imperative of the cloud operating model that stretches the enterprise beyond the physical borders of an office , has never been more pronounced than the age we now live in.

source: Synergy Research Group

As the spending on Cloud Services has accelerated, so has cloud computing waste. Cloud spending is already at a $100B base and still rapidly growing. With that growth, there is also an increase in cloud waste, which by some estimates stands at 30% of total cloud spend or $30B/year, and is accelerating.

This impact and the immediacy of tackling cloud (incl. private cloud) and multi-cloud operations in a cost-efficient manner has never been more important as we notice cloud cost reduction now topping enterprise priorities together with expanded cloud adoption.

THE PARADOX

Well, you may ask, it seems like cloud usage is rightfully expanding, and it seems most enterprises are recognizing that there is an expansion in cloud waste and they are on “top of it.”

There also happens to be a strategic framework created by industry analysts to manage cloud cost reduction not to manage bottom line but to accelerate top line even.

So, what’s the problem?

The challenge first noticed and termed “A Paradox” was penned in the article The Cost of Cloud, a trillion dollar paradox, by Casado et al. of a16z, a leading silicon valley venture capital firm.

The paradox goes as follows, as enterprises adopt and expand their usage of cloud, their cloud spend expands and so does ability to service their customers. At the same time “wasted” spend on cloud environments starts slowly increasing to a point where it takes over and starts to materially impact the market capitalization of companies in certain sectors. In the case of publicly listed SaaS providers, Casado et al. discovered that  wasted cloud spend locks off margins, and results in up to a trillion dollars of market capitalization locked up in this paradox.

Since the publication of the article by a16z, we noticed further confirmation of this trend as indicated in the recent filing by Pinterest where Pinterest has committed to pay Amazon Web Services $3.18B through 2029, even if Pinterest’s business collapses or Pinterest developers severely underutilize their compute capacity - a common occurrence in enterprises due to the complexity of intelligent cluster management.

We have also seen recently among Google Cloud clients, where the companies are unable to always utilize their committed spend with cloud providers and that’s just the tip of the iceberg. Even within enterprises that are able to consume their purchase commitments, which means the enterprise deploys a software application on the purchased cloud capacity, most enterprises remain severely underutilized.

WHERE HAVE WE SEEN THIS BEFORE?

This should be no surprise, given that Google, a hyperscale pioneer, also noticed average utilization in their private cloud environment in early days of around 10-50%, with Google websearch servers spending 30% of their capacity as Idle.

More recently, Microsoft research published research around how the “Idle” server capacity in Azure datacenters is significant and Azure could potentially leverage emerging techniques around Idle Capacity Harvesting, to “monetize” this Idle capacity.

As you can see, the hyperscalers were and are acutely aware of the challenges and opportunities created by cloud waste and cloud underutilization.

HOW DID HYPERSCALERS SOLVE THIS CHALLENGE?

So, how did the Hyperscalers, which include the 3 prominent Cloud vendors - Amazon Web Services, Google Cloud Platform and Microsoft Azure, solve this cloud waste challenge internally for themselves or are they leaving a lot of margin on the table?

Containers, Planet-Scale Cluster Management and Autopilot

While at Google, the invention of  Linux containers by Rohit Seth, back in 2006 was a key piece of the puzzle, Google invested heavily in running a cost, capacity and availability optimized infrastructure that spans the planet. We can see evidence of Google leveraging Artificial Intelligence (AI) and Machine Learning (ML) techniques dubbed “Autopilot” to autonomously run planet-scale applications with high availability and cost/capacity efficiency. Autopilot continuously profiles and understands application needs as well its observed performance via service level objectives (SLOs) and then tries to “recommend” corrective actions to human operators. Once, approved, the “Autopilot” engine automatically and continuously “right-sizes” containerized applications that span multiple servers, then finally “bin-packing” multiple workloads into a single server to maximize the server’s utilization while minimizing the application down time as measured by the application’s SLOs.

Idle capacity harvesting and monetizing unused capacity

Google further doesn’t let every individual pool of capacity or datacenter run in a fragmented/inefficient manner, the most recent academic paper from Google shows how Google uses a planet-scale cluster management approach to manage global resource management and optimization.

Similarly, Amazon Web Services (AWS) montezies Idle capacity unused by enterprises running on AWS to create new “product” offerings around lower priority and lower guarantee classes of compute capacity. Microsoft Azure similarly leverages cutting edge AI/ML and novel resource management technologies to run planet-scale “cluster of clusters” that intelligently harvests any idle capacity and puts it to use.

CONCLUSION

Cloud spending is already at a $100B base and still rapidly growing. With that growth, there is also an increase in cloud waste, which by some estimates stands at 30% of total cloud spend or $30B/year, and is accelerating.

This trend today applies to publicly listed SaaS companies, but as technology eats every sector, we anticipate every industry and company will be transformed into a technology SaaS enterprise, be it a retail, financial services or manufacturing company. And, every enterprise will face the consequences of the Cloud Paradox.

Hyperscalers have been down the road of cloud waste acceleration and there exists a significant body of knowledge leveraging AI and ML techniques to “Autopilot” your multiple cloud environments for efficiency. We anticipate every enterprise will need to augment their human IT DevOps and FinOps professionals with “Autopilot” techniques to usher their enterprises into the efficient cloud era!


🙏🏼 for reading and being part of the FinKube community,  please share with your friends and colleagues!

Exclusive *Free early access* invitation to CloudNatix Community edition for you!

CloudNatix Community Edition is a great free tool that can connect to your Cloud or Kubernetes clusters and quickly provide cost visibility and efficiency recommendations using Autopilot tech pioneered at Google.

If you want this content available via, please sign-up at www.finkube.io

FinKube toolkit helps you learn pricing, packaging, & business case making, tailored for DevOps teams and teaches you how to level up your career.


Somik Behera is the Founding Member, Head of Products at CloudNatix, a company building an efficient Planet-Scale Cluster Manager . Prior, Somik was the Head of Products at D2iQ for Mesosphere business where he scaled the company pre-product to over 100k+ clusters and over $50M in revenue. Somik was an early Product Manager at Nicira (acq. by VMware for $1.3B) and early engineer at VMware (IPO 2007) where he shipped NSX, vSphere,Capacity IQ and vCenter Operations 1.0 among other products. Somik holds a Bachelors in Computer Science from The University of Texas at Austin and has completed graduate work in Management Sciences & Engineering from Stanford University. Somik holds 10+ Patents and Patents Pending in the field of Cloud Infrastructure and Software Defined Networking (SDN).