Cloud Optimisation
Cloud optimisation has been a hot topic of conversation, so let's dig into the strategies behind leveraging the cloud and how organisations can improve efficiency, scalability, and cost-effectiveness.
Mollis Group has five success pillars within our take on cloud optimisation:
Migration; lots of teams lift and shift onto infrastructure as a service (IaaS) as an expensive but often necessary first step and then move to various platform as a service (PaaS) offerings; this should be done as part of a strategic cloud migration strategy; teams should not consider their strategy to be set in stone, you should be able to flex based on new offerings from vendors, licencing changes and even new innovative start-ups disrupting the market. Typically, we see people move from IaaS towards serverless technology, as it shifts large chunks of responsibility to the vendor. A word of caution here with serverless technology is always to evaluate if things can be taken back in-house if you need to and how open the cloud vendor is. Locking yourself into a highly proprietary serverless platform can have negative long-term repercussions, so always consider your exit strategy as well. How much would it cost to move this service if, for example, the vendor removed that service? This is where we always recommend using infrastructure as code (IaC) as part of your migration strategy; you can version infrastructure in git but always quickly refactor across cloud vendors. Data is critical to a successful strategy; how do you get what is sometimes petabytes of data into the cloud so you can use it effectively? Running expensive on-site data lakes is almost pointless, given the low cost of cloud storage, so having the data you want to use close to the services that consume them, in most cases, there is a simple choice.
Shared Responsibility; at the end of the day, you pay a premium for hosting your cloud services with one or more prominent players. You might also integrate Software as a Service (SaaS) services within your offerings. All of these offerings have baked-in service levels that are already paid for; these are often described broadly in the shared responsibility model. The critical thing here is to ensure your cloud support teams are right-sized for the services you are consuming. You should roadmap the resourcing in place for the skills needed as you move through your cloud migration strategy towards more serverless tooling. Shifting responsibility onto the vendor and away from your team, leaving a smaller but more multi-skilled team in place. The key effect that the cloud has over time is that specialists tend to end up sitting in the vendor space, and your teams tend to become more generalised with fewer specialists.
Observability; there are many reasons why observability is essential, including support of applications, security, and right-sizing applications based on usage. You need to decide if you can suffice with a single cloud vendor’s observability offering or if you need to move to a SaaS offering; you also need to ask yourself if you can combine observability domains. For example, security operations with application operations under one vendor solution. You will often see the SOC running a domain-specific SIEM solution and the operations team running a very different solution with 80% of the function overlapping but duplicating the cost. It’s worth considering if these areas can be consolidated. From our perspective, we would suggest that all businesses should have good observability of the technology platforms just in case the worst happens. The first thing the government bodies will look for if you suffer a data breach is your logs; treat them like an insurance policy, and plan how they will be consumed across your business from security, finance and operations.
Security; Cloud is a double-edged sword. On the one hand, done well, it’s far more secure than anything you could build in your data centres. However, if done badly, it can leave you open to the world and subject of ridicule. DevSecOps, for us, forms one of the keys to enhancing your cloud journey aside from the obvious benefits of making changes quickly within a secure framework, for example, CCM v4. You also have the added benefit of blending the security domains with the operational domain, which touches on the previous point of observability domains often ending up siloed. Suppose you take the example of an SRE responsible for the application and its security. In that case, you can often get accelerated benefits from the early visibility of application vulnerabilities in the SRE domain, avoiding the SOC having to pass that information across teams; this gives the benefit of the SOC being able to focus on more serious threats and issues.
Secure Access Service Edge (SaSe) & Cloud Access Security Broker (CASB) tooling can help drastically cut complexity and improve security in a hybrid multi-cloud environment. These tools are a topic in themselves but think of them as secure proxies that allow you to broker access to SaaS & cloud tooling from client end-user devices. They can drastically reduce costs by helping to simplify how you control and monitor access to cloud based tools.
FinOps; There are many costs to keep track of in the cloud; we will touch on some of the main ones and how they can be optimised. Firstly, and primarily, you need to agree on how cost attribution will happen across your business, how you enforce that and what controls you have to stop overspend. If you only operate one cloud service, workload placement is a relatively simple case of picking a solution from those available and understanding the benefits and costs. If you work across vendors, then it’s a case of having pre-defined patterns to fit different cloud workloads that enable teams to be quickly directed to the right cloud platform; the reason for this is if you don’t have pre-defined patterns, then you end up shifting any cost savings into repeating the cost analysis over and over.
- Compute Costs; this is the big one that everyone always thinks of, as it often forms a substantial chunk of the cost of a solution. The significant savings are that 24/7 workloads should be on committed use pricing; this alone can save thousands; we’ve seen many mature projects using on-demand pricing. Then there is ephemeral compute; if a job runs for a short period, it’s idempotent, then it’s ideal for ephemeral compute. This type of compute is the cloud provider’s way of selling you excess capacity and is infinitely cheaper than on-demand pricing. Finally, matching the architecture to the workload, we now have ARM & X86 architectures readily available with many cloud vendors. They have different pricing models and different capabilities, and definitions of vCPU; for example, X86 includes hyper threads in its vCPU count, whereas ARM64 is always a physical core, which means if your workload is a compute-intensive high contention application, then it could be better off running on ARM64 as it will provide better cost price performance.
- Networking Costs; depending on how hybrid your solution is, it’s worth considering how you peer your cloud environment, for example using a site-to-site VPN to peer AWS & GCP is an expensive choice for high volumes of data; not only do you pay for a VM with large volumes of compute to get the network throughput, you also pay a premium on egress, its often better if you have a high network commitment to peer the platforms through a common POP directly. There are also some simple rules around network design;
- Keeping data replicated across redundancy zones means you’re not transiting to get to data and increasing the cost unnecessarily.
- Keep things close together; network pricing is tiered, zonally, regionally, and internationally. If you want redundancy, then use regional over international redundancy.
- If you have an international global platform, consider what you replicate globally and what you store locally, as moving data and network traffic internationally at scale is expensive.
- Network tiers, premium vs standard lots of systems don’t need to use premium network tiers, so consider the features you need from each network device to optimise cost.
- Resilience; do you have an architectural pattern that includes non-critical non-geo redundant applications? Yes, it’s a thing; not all cloud applications need to be scalable or redundant; if this is the case, you should allow these applications to fail in a controlled way. We often over-engineer applications to be fault tolerant that don’t need to be; there are core business systems that should be geo-redundant and scalable, but these are not as frequent or as many as we often build.
- IaC and Ephemeral Environments, if we have IaC, for example, terraform, we can deploy whole ecosystems for regression testing and then destroy them. This saves enormous amounts of money over time. There are also several other key benefits IaC: -
- It is an enabler of core DevSecOps principles, for example, making your infrastructure changes machine inspectable before deploying potentially dangerous configurations.
- Policy as Code (PaC) can also be implemented parallel to IaC; see projects like OPA (Open Policy Agent).
- Time; repetition of manual tasks and accurate implementation of corporate architectures saves time, valuable engineering time.
AI (Artificial Intelligence) & Automation; no, not chatGPT impressive and fun that it is, we’re talking about anomaly detection, over granting IAM controls and cost optimisation. Automation can also be used to make suggestions. For example, the GCP platform gives you a recommendations API that allows you to reduce costs through right-sizing programmatically; the API consumes data from your operations to make recommendations. Most platforms will monitor the IAM privileges used by different cloud identities and report on this, which can feed into a DevSecOp’s process and remove excess privileges in your platform. If you’re hiring a red team to perform platform reviews, their time will be spent auditing much more valuable areas rather than excess privileges.
Compliance as Code, Policy as Code & Infrastructure as Code provide ways and means for teams to operate securely but still with velocity. They require time to start, but they reduce errors and increase consistency and security across your operating platforms. And secure, consistent platforms drive cost savings. Conversely, mistakes drive expensive root cause analysis and take your engineering teams away from productive tasks. There is also a swathe of tools that will profess to optimise cloud deployment costs; however, consume with caution as most of them only deliver cost benefits if the original platform has been poorly designed. If you’re operating a mature environment with IaC, you can add cost-based policy controls to your infrastructure pipelines that block, for example, deployments outside certain regions without exception. This tooling stack can also help you meet sustainability targets by creating a policy skewed towards green objectives, allowing only deployments to low co2 zones, for example.
In conclusion, cloud optimisation can bring numerous benefits to businesses of all sizes. Achieving significant cost savings while becoming more agile, secure, sustainable, and better prepared to meet the demands of a rapidly changing business environment. If you want to unlock the full potential of cloud optimisation, get in touch. #evolvewithus