Bottleneck #04: Cost Efficiency

https://martinfowler.com/articles/bottlenecks-of-scaleups/04-costs.html#ReducePhase

Before engineers rush into optimizing cost individually within their own teams, it’s best to assemble a cross-functional team to perform analysis and lead execution of cost optimization efforts. Typically, cost efficiency at a startup will fall into the responsibility of the platform engineering team, since they will be the first to notice the problem – but it will require involvement from many areas. We recommend getting a cost optimization team together, consisting of technologists with infrastructure skills and those who have context over the backend and data systems. They will need to coordinate efforts among impacted teams and create reports, so a technical program manager will be valuable.

Understand primary cost drivers

It is important to start with identifying the primary cost drivers. First, the cost optimization team should collect relevant invoices – these can be from cloud provider(s) and SaaS providers. It is useful to categorize the costs using analytical tools, whether a spreadsheet, a BI tool, or Jupyter notebooks. Analyzing the costs by aggregating across different dimensions can yield unique insights which can help identify and prioritize the work to achieve the greatest impact. For example:

Application/system: Some applications/systems may contribute to more costs than others. Tagging helps associate costs to different systems and helps identify which teams may be involved in the work effort.

Compute vs storage vs network: In general: compute costs tend to be higher than storage costs; network transfer costs can sometimes be a surprise high-costing item. This can help identify whether hosting strategies or architecture changes may be helpful.

Pre-production vs production (environment): Pre-production environments’ cost should be quite a bit lower than production’s. However, pre-production environments tend to have more lax access control, so it is not uncommon that they cost higher than expected. This could be indicative of too much data accumulating in non-prod environments, or even a lack of cleanup for temporary or PoC infrastructure.

Operational vs analytical: While there is no rule of thumb for how much a company’s operational systems should cost as compared to its analytical ones, engineering leadership should have a sense of the size and value of the operational vs analytical landscape in the company that can be compared with actual spending to identify an appropriate ratio.

Service / capability provider: Across project management, product roadmapping, observability, incident management, and development tools, engineering leaders are often surprised by the number of tool subscriptions and licenses in use and how much they cost. This can help identify opportunities for consolidation, which may also lead to improved negotiating leverage and lower costs.

The results of the inventory of drivers and costs associated with them should provide the cost optimization team a much better idea what type of costs are the highest and how the company’s architecture is affecting them. This exercise is even more effective at identifying root causes when historical data is considered, e.g. costs from the past 3-6 months, to correlate changes in costs with specific product or technical decisions.

Identify cost-saving levers for the primary cost drivers

After identifying the costs, the trends and what are driving them, the next question is - what levers can we employ to reduce costs? Some of the more common methods are covered below. Naturally, the list below is far from exhaustive, and the right levers are often very situation-dependent.

Rightsizing: Rightsizing is the action of changing the resource configuration of a workload to be closer to its utilization.

Engineers often perform an estimation to see what resource configuration they need for a workload. As the workloads evolve over time, the initial exercise is rarely followed-up to see if the initial assumptions were correct or still apply, potentially leaving underutilized resources.

To rightsize VMs or containerized workloads, we compare utilization of CPU, memory, disk, etc. vs what was provisioned. At a higher level of abstraction, managed services such as Azure Synapse and DynamoDB have their own units for provisioned infrastructure and their own monitoring tools that would highlight any resource underutilization. Some tools go so far as to recommend optimal resource configuration for a given workload.

There are ways to save costs by changing resource configurations without strictly reducing resource allocation. Cloud providers have multiple instance types, and usually, more than one instance type can satisfy any particular resource requirement, at different price points. In AWS for example, new versions are generally cheaper, t3.small is ~10% lower than t2.small. Or for Azure, even though the specs on paper appear higher, E-series is cheaper than D-series – we helped a client save 30% off VM cost by swapping to E-series.

As a final tip: while rightsizing particular workloads, the cost optimization team should keep any pre-purchase commitments on their radar. Some pre-purchase commitments like Reserved Instances are tied to specific instance types or families, so while changing instance types for a particular workload could save cost for that specific workload, it could lead to part of the Reserved Instance commitment going unused or wasted.

Using ephemeral infrastructure: Frequently, compute resources operate longer than they need to. For example, interactive data analytics clusters used by data scientists who work in a particular timezone may be up 24/7, even though they are not used outside of the data scientists’ working hours. Similarly, we have seen development environments stay up all day, every day, whereas the engineers working on them use them only within their working hours.

Many managed services offer auto-termination or serverless compute options that ensure you are only paying for the compute time you actually use – all useful levers to keep in mind. For other, more infrastructure-level resources such as VMs and disks, you could automate shutting down or cleaning up of resources based on your set criteria (e.g. X minutes of idle time).

Engineering teams may look at moving to FaaS as a way to further adopt ephemeral computing. This needs to be thought about carefully, as it is a serious undertaking requiring significant architecture changes and a mature developer experience platform. We have seen companies introduce a lot of unnecessary complexity jumping into FaaS (at the extreme: lambda pinball).

Incorporating spot instances: The unit cost of spot instances can be up to ~70% lower than on-demand instances. The caveat, of course, is that the cloud provider can claim spot instances back at short notice, which risks the workloads running on them getting disrupted. Therefore, cloud providers generally recommend that spot instances are used for workloads that more easily recover from disruptions, such as stateless web services, CI/CD workload, and ad-hoc analytics clusters.

Even for the above workload types, recovering from the disruption takes time. If a particular workload is time-sensitive, spot instances may not be the best choice. Conversely, spot instances could be an easy fit for pre-production environments, where time-sensitivity is less stringent.

Leveraging commitment-based pricing: When a startup reaches scale and has a clear idea of its usage pattern, we advise teams to incorporate commitment-based pricing into their contract. On-demand prices are typically higher than prices you can get with pre-purchase commitments. However, even for scale-ups, on-demand pricing could still be useful for more experimental products and services where usage patterns have not stabilized.

There are multiple types of commitment-based pricing. They all come at a discount compared to the on-demand price, but have different characteristics. For cloud infrastructure, Reserved Instances are generally a usage commitment tied to a specific instance type or family. Savings Plans is a usage commitment tied to the usage of specific resource (e.g. compute) units per hour. Both offer commitment periods ranging from 1 to 3 years. Most managed services also have their own versions of commitment-based pricing.

Architectural design: With the popularity of microservices, companies are creating finer-grained architecture approaches. It is not uncommon for us to encounter 60 services at a mid-stage digital native.

However, APIs that aren’t designed with the consumer in mind send large payloads to the consumer, even though they need a small subset of that data. In addition, some services, instead of being able to perform certain tasks independently, form a distributed monolith, requiring multiple calls to other services to get its task done. As illustrated in these scenarios, improper domain boundaries or over-complicated architecture can show up as high network costs.

Refactoring your architecture or microservices design to improve the domain boundaries between systems will be a big project, but will have a large long-term impact in many ways, beyond reducing cost. For organizations not ready to embark on such a journey, and instead are looking for a tactical approach to combat the cost impact of these architectural issues, strategic caching can be employed to minimize chattiness.

Enforcing data archival and retention policy: The hot tier in any storage system is the most expensive tier for pure storage. For less frequently-used data, consider putting them in cool or cold or archive tier to keep costs down.

It is important to review access patterns first. One of our teams came across a project that stored a lot of data in the cold tier, and yet were facing increasing storage costs. The project team did not realize that the data they put in the cold tier were frequently accessed, leading to the cost increase.

Consolidating duplicative tools: While enumerating the cost drivers in terms of service providers, the cost optimization team may realize the company is paying for multiple tools within the same category (e.g. observability), or even wonder if any team is really using a particular tool. Eliminating unused resources/tools and consolidating duplicative tools in a category is certainly another cost-saving lever.

Depending on the volume of usage after consolidation, there may be additional savings to be gained by qualifying for a better pricing tier, or even taking advantage of increased negotiation leverage.

Prioritize by effort and impact

Any potential cost-saving opportunity has two important characteristics: its potential impact (size of potential savings), and the level of effort needed to realize them.

If the company needs to save costs quickly, saving 10% out of a category that costs $50,000 naturally beats saving 10% out of a category that costs $5,000.

However, different cost-saving opportunities require different levels of effort to realize them. Some opportunities require changes in code or architecture which take more effort than configuration changes such as rightsizing or utilizing commitment-based pricing. To get a good understanding of the required effort, the cost optimization team will need to get input from relevant teams.

Figure 2: Example output from a prioritization exercise for a client (the same exercise done for a different company could yield different results)

At the end of this exercise, the cost optimization team should have a list of opportunities, with potential cost savings, the effort to realize them, and the cost of delay (low/high) associated with the lead time to implementation. For more complex opportunities, a proper financial analysis needs to be specified as covered later. The cost optimization team would then review with leaders sponsoring the initiative, prioritize which to act upon, and make any resource requests required for execution.

The cost optimization team should ideally work with the impacted product and platform teams for execution, after giving them enough context on the action needed and reasoning (potential impact and priority). However, the cost optimization team can help provide capacity or guidance if needed. As execution progresses, the team should re-prioritize based on learnings from realized vs projected savings and business priorities.