Learnings From Our 8 Years Of Kubernetes In Manufacturing — Two Main Cluster Crashes, Ditching Self-Managed, Reducing Cluster Prices, Tooling, And Extra | by Anders Jönsson | Feb, 2024

Early on at Urb-it, earlier than I joined, we determined to make use of Kubernetes as a cornerstone for our cloud-native technique. The considering behind this selection was our anticipated fast scaling, coupled with the need to leverage container orchestration capabilities to get a extra dynamic, resilient, and environment friendly surroundings for our purposes. And with our microservice structure, Kubernetes fitted effectively.
The choice was made early, which, in fact, ought to be questioned because it represents a big dependency and a considerable quantity of information to hold for a startup (or any firm). Additionally, did we even face the issues Kubernetes solves at that stage? One would possibly argue that we might have initially gone with a large monolith and relied on that till scaling and different points grew to become painful, after which we made a transfer to Kubernetes (or one thing else). Additionally, Kubernetes was nonetheless in early improvement. However let’s go deep on this one other time.
Having run Kubernetes for over eight years in manufacturing (separate cluster for every surroundings), we’ve made a mixture of good and not-so-good selections. Some errors have been merely a results of “otur när vi tänkte” (unhealthy luck in our decision-making), whereas others originated from us not fully (or not even partly) understanding the underlying expertise itself. Kubernetes is highly effective, however it additionally has layers of complexity.
We went head-on with none earlier expertise of operating it at scale.
For the primary years, we ran a self-managed cluster on AWS. If my reminiscence serves me effectively, we didn’t have the choice initially to make use of Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes Service (EKS) since they didn’t present a official managed answer but. It was on Amazon Web Services (AWS) self-hosted we had our first and most horrible cluster crash in Urb-it historical past, however extra on that later.
Since we have been a small crew, it was difficult to maintain up with all the brand new capabilities we wanted. On the similar time, managing a self-hosted cluster required fixed consideration and care, which added to our workload.
When managed options grew to become usually out there, we took a while to judge AKS, GKE, and EKS. All of them have been a number of occasions higher for us than managing it ourselves, and we might simply see the fast ROI with transferring.
Our platform again then was 50% .Internet and 50% Python, and we have been already utilizing Azure Service Bus, Azure SQL Server, and different Azure providers. Due to this fact, transferring our cluster to Azure wouldn’t solely make it simpler to make use of them in an built-in vogue but in addition profit us by using the Azure Spine Networking Infrastructure, avoiding the prices related to leaving/coming into exterior networks and VNETs, which we had between our combined AWS and Azure setup. Additionally, lots of our engineers have been aware of Azure and its ecosystem.
We must also point out that for our preliminary setup on AKS, we didn’t should pay for the management airplane nodes (grasp nodes) — which, was an additional bonus (saving cash on nodes).
We migrated throughout the winter of 2018, and although we’ve encountered some points with AKS over time, we’ve by no means regretted the transfer.
Throughout our self-managed time on AWS, we skilled a large cluster crash that resulted within the majority of our programs and merchandise happening. The Root CA certificates, etcd certificates, and API server certificates expired, which brought on the cluster to cease working and prevented our administration of it. The assist to resolve this, at the moment, in kube-aws was restricted. We introduced in an knowledgeable, however ultimately, we needed to rebuild your entire cluster from scratch.
We thought we had all of the values and Helm charts in every git repository, however, shock, shock, that wasn’t the case for all providers. On prime of this, not one of the configurations for creating the cluster have been saved. It grew to become a race in opposition to time to arrange the cluster once more and populate it with all of the providers and merchandise we had. A few of them required reinventing the Helm charts to create the lacking configurations. There have been moments like Dev1 to Dev2: “Do you bear in mind how a lot CPU or RAM this service ought to have, or what community and port entry it ought to have?”. To not point out all of the secrets and techniques that have been gone with the wind.
It took us days to get it up and operating once more. Not our proudest second, to say the least.
And now you would possibly say: the second crash couldn’t have been as a result of a certificates, since you could have discovered your lesson from the primary crash, proper? Sure and no. When recreating the cluster from crash #1, sadly, the particular model of kube-aws that we used had an issue. When it created the brand new clusters, it didn’t set the expiration of the etcd certificates to the offered expiry date; it defaulted to at least one yr. So precisely one yr after the primary cluster crash, the certificates expired, and we skilled one other cluster crash. Nevertheless, this one was simpler to recuperate from; we didn’t should rebuild every thing. But it surely was nonetheless a weekend from hell.
Facet be aware 1: Different corporations have been additionally affected by this bug the identical approach we have been, not that it helped our clients…
Facet be aware 2: Our plan was to replace all of the certificates after a yr, however to offer ourselves some margin, we set the expiration to 2 years (if I bear in mind it appropriately). So we had plans to replace the certificates, however the bug beat us to it.
Since 2018, we’ve not had any extra cluster crashes… Jinxing it? Sure.
- Kubernetes Is Advanced
You want engineers who’re concerned with and wish to work with the infrastructure and operations features of Kubernetes. In our case, we wanted a few engineers who would commit their time to Kubernetes when wanted. It was unimaginable to rotate and cut up the work over your entire crew; the expertise is just too advanced to “leap out and in of” each second week. In fact, everybody must know the way to use it (deploy, debugging, and so on.) — however to excel within the tougher features, devoted time is critical. Moreover, it’s essential to have somebody who leads with a imaginative and prescient and has a method for evolving the cluster. - Kubernetes Certificates
Having skilled two cluster crashes, each as a result of certificates expiring, it’s essential to be well-versed within the particulars of inner Kubernetes certificates and their expiration dates. - Maintain Kubernetes & Helm Up To Date
If you fall behind, it turns into costly and tedious. We all the time waited a few months earlier than leaping on the most recent model to make sure that others would face any new model points first. However even with maintaining it updated, we confronted many time-consuming rewrites of configurations information and charts as a result of new variations of Kubernetes and Helm (Kubernetes API’s going from alfa to beta, beta to 1.0, and so on.). I do know Simon and Martin beloved all of the Ingress modifications. - Centralized Helm Charts
When it got here to the Helm charts, we grew bored with updating all 70+ charts for every model change, so we adopted a extra generic “one chart to rule all of them” method. There are lots of professionals and cons to a centralized Helm charts method, however ultimately, this suited our wants higher. - Catastrophe Restoration Plan
I can’t emphasize this sufficient: ensure to have methods to recreate the cluster if wanted. Sure, you’ll be able to click on round in a UI to create new clusters, however that method won’t ever work at scale or in a well timed method.
There are other ways to deal with this, starting from easy shell scripts to extra superior strategies like utilizing Terraform (or related). Crossplane can be used to handle Infrastructure as Code (IaC) and extra.
For us, as a result of restricted crew bandwidth, we settled on storing and utilizing shell scripts.
Whatever the methodology you choose, ensure to check the stream once in a while to make sure you can recreate the cluster if wanted. - Backup Of Secrets and techniques
Have a method for backing up and storing secrets and techniques. In case your cluster goes away, all of your secrets and techniques will likely be gone. And belief me, we skilled this first-hand; it takes loads of time to get every thing proper once more when you’ve got a number of completely different microservices and exterior dependencies. - Vendor-Agnostic VS “Go All In”
To start with, after transferring to AKS, we tried to maintain our cluster vendor-agnostic, that means that we might proceed to make use of different providers for container registry, auth, key vaults, and so on. The thought was that we might simply transfer to a different handle answer at some point. Whereas being vendor-agnostic is a good concept, for us, it got here with a excessive alternative price. After some time, we determined to go all-in on AKS-related Azure merchandise, just like the container registry, safety scanning, auth, and so on. For us, this resulted in an improved developer expertise, simplified safety ( centralized entry administration with Azure Entra Id), and extra, which led to sooner time-to-market and decreased prices (quantity advantages). - Buyer Useful resource Definitions
Sure, we went all in on the Azure merchandise, however our guiding star was to have as few Customized Useful resource Definitions as doable, and as an alternative use the built-in Kubernetes assets. Nevertheless, we had some exceptions, like Traefik, for the reason that Ingress API didn’t fulfill all our wants. - Safety
See beneath. - Observability
See beneath. - Pre-Scaling Throughout Identified Peaks
Even with the auto-scaler, we typically scaled too slowly. By utilizing visitors knowledge and customary information (we’re a logistics firm and have peaks at holidays), we scaled up the cluster manually (ReplicaSet) a day earlier than the height arrived, then scaled it down the day after (slowly to deal with any second peak wave which may happen). - Drone Inside The Cluster
We stored the Drone construct system within the stage cluster; it had some advantages but in addition some drawbacks. It was simple to scale and use because it was in the identical cluster. Nevertheless, when constructing an excessive amount of on the similar time, it consumed virtually all of the assets, resulting in a rush in Kubernetes to spin up new nodes. The very best answer would in all probability be to have it as a pure SaaS answer, not having to fret about internet hosting and sustaining the product itself. - Choose The Proper Node Sort
That is very context-specific, however relying on the node kind, AKS reserves about ~10-30% of the out there reminiscence (for inner AKS providers). So for us, we discovered it helpful to make use of fewer however bigger node sorts. Additionally, since we have been operating .Internet on most of the providers, we wanted to decide on node sorts with environment friendly and sizable IO. (.Internet steadily writes to disk for JIT and logging, and if this requires community entry, it turns into gradual. We additionally made positive that the node disk/cache had at the very least the identical measurement as the overall configured node disk measurement, to once more, forestall the necessity for community jumps). - Reserved Cases
You may debate that this method goes a bit in opposition to the pliability of the cloud, however for us, reserving key situations for a yr or two resulted in huge price financial savings. In lots of instances, we might save 50–60% in comparison with the “pay as you go” method. And sure, that’s loads of cake for the crew. - k9s
https://k9scli.io/ is useful gizmo for anybody who needs one degree larger abstraction than pure kubectl
Monitoring
Make sure you monitor the utilization of reminiscence, CPU, and so on., over time so you’ll be able to observe how your cluster is performing and decide if new capabilities are bettering or worsening its efficiency. With this, it’s simpler to search out and set the “appropriate” limits for various pods (discovering the fitting stability is essential, for the reason that pod is killed if it runs out of reminiscence).
Alerting
Refining our alerting system was a course of, however ultimately, we directed all alerts to our Slack channels. This method made it handy to obtain notifications each time the cluster was not functioning as anticipated or if any unexpected points arose.
Logging
Having all logs consolidated in a single place, together with a sturdy hint ID technique (e.g. OpenTelemetry or related), is essential for any microservices structure. It took us 2–3 years to get this proper. If we had applied it earlier, it might have saved us a substantial period of time.
Safety in Kubernetes is an unlimited subject, and I extremely advocate researching it completely to know all of the nuances (e.g. see NSA, CISA release Kubernetes Hardening Guidance). Beneath are some key factors from our expertise, however please be aware, that is under no circumstances a whole image of the challenges.
Entry Management
Briefly, Kubernetes isn’t overly restrictive by default. Due to this fact, we invested appreciable time in tightening entry, implementing least privilege ideas for pods and containers. Moreover, as a result of particular vulnerabilities, it was doable that an unprivileged attacker, might doubtlessly escalate their privileges to root, circumventing Linux namespace restrictions, and in some instances, even escape the container to achieve root entry on the host node. Not good to say the least.
It’s best to set learn solely root filesystem, disable service account token auto mount, disable privilege escalation, drop all pointless capabilities, and extra. In our particular setup, we use Azure Coverage and Gatekeeper to verify we didn’t deploy unsecure containers.
In our Kubernetes setup inside AKS, we leveraged the robustness of Function-Based mostly Entry Management (RBAC) to additional improve safety and entry administration.
Container Vulnerability
There are lots of good instruments on the market that may scan and validate containers and different components of Kubernetes. We used Azure Defender and Azure Defender for Containers to focus on a few of our wants.
Notice: As a substitute of getting caught in “analysis paralysis” looking for the proper device, the one with all of the “bells and whistles”, simply decide one thing and let the training start.
- Deployments
As with many others, we use Helm to handle and streamline the deployment and packaging of our purposes on Kubernetes. Since we began utilizing Helm a very long time in the past and initially had a mixture of .Internet/Go/Java/Python/PHP, we’ve rewritten the Helm charts extra occasions than I dare to recollect. - Observability
We began utilizing Loggly along with FluentD for centralized logging, however after a few years, we moved over to Elastic and Kibana (ELK stack). It was simpler for us to work with Elastic and Kibana since they’re extra extensively used, and in addition, in our setup, it was cheaper. - Container Registries
We began with Quay, which was a superb product. However with the migration to Azure, it grew to become pure to make use of Azure Container Registry as an alternative because it was built-in and thus a extra “native” answer for us. (We additionally then bought our containers underneath the Azure Safety Advisor). - Pipelines
From the beginning, we’ve been utilizing Drone for constructing our containers. Once we first started, there weren’t many CI programs that supported containers and Docker, nor did they provide configurations as code. Drone has served us effectively over time. It grew to become a bit messy when Harness acquired it, however after we caved in and moved to the premium model, we had all of the options we wanted.
In the previous few years, Kubernetes has been a game-changer for us. It has unlocked capabilities that allow us to scale extra effectively (risky visitors volumes), optimize our infrastructure prices, improved our developer experiences, made it simpler take a look at new concepts, and thus considerably cut back the time-to-market/time-to-money for brand spanking new services.
We began with Kubernetes a bit too early, earlier than we actually had the issues it might remedy. However in the long term, and particularly within the newest years, it has confirmed to offer nice worth for us.
Reflecting on eight years of expertise, there’s an abundance of tales to share, with many already fading into reminiscence. I hope you loved studying about our setup, the errors we made, and the teachings we discovered alongside the best way.
Thanks for studying.