(Nearly) Each infrastructure determination I endorse or remorse after 4 years working infrastructure at a startup · Jack’s residence on the net

Picture from UnSplash
I’ve led infrastructure at a startup for the previous 4 years that has had
to scale rapidly. From the start I made some core selections that the
firm has needed to keep on with, for higher or worse, these previous 4 years. This put up
will checklist a few of the main selections made and if I endorse them to your
startup, or if I remorse them and advise you to choose one thing else.
Choosing AWS over Google Cloud
Link to heading
🟩 Endorse
Early on, we have been utilizing each GCP and AWS. Throughout that point, I had
no thought who my “account supervisor” was for Google Cloud, whereas on the identical
time I had common cadence conferences with our AWS account supervisor. There’s a really feel
that Google lives on robots and automation, whereas Amazon lives with a buyer focus.
This assist has helped us when evaluating new AWS providers. Moreover assist, AWS has completed a terrific job round stability
and minimizing backwards incompatible API adjustments.
There was a time when Google Cloud was the selection for Kubernetes clusters, particularly
when there was ambiguity round if AWS would spend money on EKS over ECS. Now although, with
all the additional Kubernetes integrations round AWS providers (external-dns, external-secrets, and so forth),
this isn’t a lot of any challenge anymore.
EKS
Link to heading
🟩 Endorse
Until you’re penny-pinching (and your time is free), there’s no motive to run your personal
management aircraft moderately than use EKS. The primary benefit of utilizing another in AWS, like ECS,
is the deep integration into AWS providers. Fortunately, Kubernetes has caught up in some ways: for instance,
utilizing external-dns to combine with Route53.
EKS managed addons
Link to heading
🟧 Remorse
We began with EKS managed addons as a result of I assumed it was the “proper” method to make use of EKS. Sadly, we at all times
ran right into a scenario the place we wanted to customise the set up itself. Perhaps the CPU requests, the picture tag,
or some configmap. We’ve since switched to utilizing helm charts for what have been add-ons and issues are working a lot better
with promotions that match just like our present GitOps pipelines.
RDS
Link to heading
🟩 Endorse
Knowledge is essentially the most crucial a part of your infrastructure. You lose your community: that’s downtime. You
lose your knowledge: that’s an organization ending occasion. The markup value of utilizing RDS (or any managed database)
is price it.
Redis ElastiCache
Link to heading
🟩 Endorse
Redis has labored very properly as a cache and basic use product. It’s quick, the API is easy and
properly documented, and the implementation is battle examined. Not like different cache choices, like
Memcached, Redis has a number of options that make it helpful for extra than simply caching. It’s a
nice swiss military knife of “do quick knowledge factor”.
A part of me is uncertain what the state of Redis is for Cloud Suppliers, however I really feel it’s so broadly utilized by AWS clients
that AWS will proceed to assist it properly.
ECR
Link to heading
🟩 Endorse
We initially hosted on quay.io. It was a scorching mess of stability issues. Since transferring to ECR,
issues have been way more secure. The deeper permission integrations with EKS nodes or dev servers has additionally been a
large plus.
AWS VPN
Link to heading
🟩 Endorse
There are Zero Belief VPN alternate options from firms like CloudFlare. I’m positive these merchandise work
properly, however a VPN is simply so lifeless easy to setup and perceive (“simplicity is preferable” is my mantra). We use
Okta to handle our VPN entry and it’s been a terrific expertise.
AWS premium support
Link to heading
🟧 Remorse
It’s tremendous costly: virtually (if no more) than the price of one other engineer. I feel if we had little or no
in home data of AWS, it will be price it.
Control Tower Account Factory for Terraform
Link to heading
🟩 Endorse
Earlier than integrating AFT, utilizing management tower was a ache largely as a result of it was very tough to automate. We’ve since
built-in AFT into our stack and spinning up accounts has labored properly since. One other factor AFT makes simpler is
standardizing tags for our accounts. For instance, our manufacturing accounts have a tag that we are able to then use to make
peering selections. Tags work higher than organizations for us as a result of the choice of “what properties describe
this account” isn’t at all times a tree construction.
Automating autopsy course of with a slack bot
Link to heading
🟩 Endorse
Everyone seems to be busy. It could possibly really feel such as you’re the “dangerous man” reminding folks to fill out the autopsy. Making a robotic
be the dangerous man had been nice. It streamlines the method by nudging folks to observe the SEV and autopsy process.
It doesn’t need to be too complicated to start out. Simply the fundamentals of “It’s been an hour of no messages. Somebody put up an replace” or
“It’s been a day with no calendar invite. Somebody schedule the autopsy assembly” can go an extended methods.
🟩 Endorse
Why reinvent the wheel? PagerDuty publishes a template of what to do throughout an incident. We’ve personalized it a bit,
which is the place the flexibleness of Notion is useful, however it’s been a terrific start line.
🟩 Endorse
Alerting for an organization goes like this:
- There are not any alerts in any respect. We’d like alerts.
- We now have alerts. There are too many alerts, so we ignore them.
- We’ve prioritized the alerts. Now solely the crucial ones wake me up.
- We ignore the non-critical alerts.
We now have a two tiered alerting setup: crucial and non-critical. Vital alerts wake folks up. Non-critical alerts
are anticipated to ping the on-call async (e mail). The issue is that non-critical alerts are sometimes ignored. To resolve
this, we’ve got common (normally each 2 weeks) PagerDuty evaluation conferences the place we go over all our alerts. For the crucial
alerts, we talk about if it ought to keep crucial. Then, we iterate the non-critical alerts (normally choosing a number of every assembly)
and talk about what we are able to do to clear these out as properly (normally tweaking the edge or creating some automation).
Month-to-month value monitoring conferences
Link to heading
🟩 Endorse
Early on, I arrange a month-to-month assembly to go over all of our SaaS value (AWS, DataDog, and so forth). Beforehand, this was simply
one thing reviewed from a finance perspective, however it’s exhausting for them to reply basic questions round “does this value
quantity appear proper”. Throughout these conferences, normally attended by each finance and engineering, we go over each software program
associated invoice we get and do a intestine verify of “does this value sound correct”. We dive into the numbers of every of the excessive payments
and attempt to break them down.
For instance, with AWS we group gadgets by tag and separate them by account. These two dimensions, mixed with the final
service title (EC2, RDS, and so forth) offers us a good suggestion of the place the key value drivers are. Some issues we do with this knowledge
are go deeper into spot occasion utilization or which accounts contribute to networking prices essentially the most. However don’t cease at
simply AWS: go into all the key spend sinks your organization has.
🟥 Remorse
Everybody ought to do post-mortems. Each DataDog and PagerDuty have integrations to handle writing post-mortems and we tried
every.
Sadly, they each make it exhausting to customise the autopsy course of. Given how highly effective wiki-ish instruments
like Notion are, I feel it’s higher to make use of a software like that to handle post-mortems.
Not utilizing Perform as a Service(FaaS) extra
Link to heading
🟥 Remorse
There are not any nice FaaS choices for working GPU workloads, which is why we may by no means go totally FaaS. Nonetheless,
many CPU workloads might be FaaS (lambda, and so forth). The most important counter-point folks convey up is the price. They’ll
say one thing like “This EC2 occasion kind working 24/7 at full load is method cheaper than a Lambda working”.
That is true, however it’s additionally a false comparability. No person runs a service at 100% CPU utilization and
strikes on with their life. It’s at all times on some scaler that claims “By no means attain 100%. At 70% scale up one other”. And it’s
at all times unclear when to cut back down, as a substitute it’s a heuristic of “If we’ve been at 10% for 10 minutes, scale down”.
Then, folks assume spot cases after they aren’t at all times on market.
One other hidden good thing about Lambda is that it’s very straightforward to trace prices with excessive accuracy. When deploying providers
in Kubernetes, value can get hidden behind different per node objects or different providers working on the identical node.
GitOps
Link to heading
🟩 Endorse
GitOps has thus far scaled fairly properly and we use it for a lot of components of our infrastructure: providers,
terraform, and config to call a number of. The primary draw back is that pipeline oriented workflows give
a transparent image of “right here is the field which means you probably did a commit and listed below are arrows that go from
that field to the top of the pipeline”. With GitOps we’ve needed to spend money on tooling to assist folks reply
questions like “I did a commit: why isn’t it deployed but”.
Even nonetheless, the flexibleness of GitOps has been an enormous win and I strongly advocate it to your firm.
Prioritizing staff effectivity over exterior calls for
Link to heading
🟩 Endorse
Most definitely, your organization shouldn’t be promoting the infrastructure itself, however one other product. This places stress on the
staff to ship options and never scale your personal workload. However similar to airplanes let you know to place your personal masks on
first, it’s worthwhile to ensure your staff is environment friendly. With uncommon exception, I’ve by no means regretted prioritizing
taking time to put in writing some automation or documentation.
A number of functions sharing a database
Link to heading
🟥 Remorse
Like most tech debt, we didn’t make this determination, we simply didn’t not make this determination. Finally, somebody
needs the product to do a brand new factor and makes a brand new desk. This feels good as a result of there are actually international keys between
the 2 tables. However since every little thing is owned by somebody and that somebody is a row in a desk, you’ve obtained
international keys between all objects in all the stack.
For the reason that database is utilized by everybody, it turns into cared for by nobody. Startups don’t have the luxurious of a DBA,
and every little thing owned by nobody is owned by infrastructure ultimately.
The most important downside with a shared database are:
- Crud accumulates within the database, and it’s unclear if it may be deleted.
- When there are efficiency points, infrastructure (with out deep product data) has to debug the database and work out who to redirect to
- Database customers can push dangerous code that does dangerous issues to the database. These dangerous issues could PagerDuty alert the
infrastructure staff (since they personal the database). It feels dangerous to get up one staff for one more staff’s challenge. With utility owned databases,
the appliance staff is the primary responder.
All that mentioned, I’m not towards stacks that wish to share a single database both. Simply concentrate on the tradeoffs above
and have an excellent story for a way you’ll handle them.
Not adopting an id platform early on
Link to heading
🟥 Remorse
I caught with Google Workspace firstly, utilizing it to create teams for workers as a solution to assign permissions. It simply isn’t versatile sufficient.
Looking back, I want we had picked up Okta a lot sooner. It’s labored very properly, has integrations for nearly every little thing,
and solves a number of compliance/safety points. Simply lean into an id answer early on and solely settle for SaaS
distributors that combine with it.
Notion
Link to heading
🟩 Endorse
Each firm wants a spot to place documentation. Notion has been a terrific alternative and labored a lot simpler than issues
I’ve used previously (Wikis, Google Docs, Confluence, and so forth). Their Database idea for web page group has additionally
allowed me to create fairly refined organizations of pages.
Slack
Link to heading
🟩 Endorse
Thank god I don’t have to make use of HipChat anymore. Slack is nice as a default communication software, however to scale back stress
and noise I like to recommend:
- Utilizing threads to condense communication
- Speaking expectations that individuals could not reply rapidly to messages
- Discourage non-public messages and encourage public channels.
Shifting off JIRA onto linear
Link to heading
🟩 Endorse
Not even shut. JIRA is so bloated I’m anxious working it in an AI firm it will simply flip totally sentient. When
I’m utilizing Linear, I’ll typically suppose “I ponder if I can do X” after which I’ll try to I can!
Not utilizing Terraform Cloud
Link to heading
🟩 No Regrets
Early on, I attempted emigrate our terraform to Terraform Cloud. The most important draw back was that I couldn’t justify the
value. I’ve since moved us to Atlantis, and it has labored properly sufficient. The place atlantis
falls brief, we’ve written a little bit of automation in our CI/CD pipelines to make up for it.
GitHub actions for CI/CD
Link to heading
🟧 Endorse-ish
We, like most firms, host our code on GitHub. Whereas initially utilizing CircleCI, we’ve switched
to Github actions for CI/CD. {The marketplace} of actions obtainable to make use of to your workflows is
giant and the syntax is simple to learn. The primary draw back of Github actions is their assist
for self-hosted workflows could be very restricted. We’re utilizing EKS and actions-runner-controller
for our self-hosted runners
hosted in EKS, however the integration is commonly buggy (however nothing we can’t work round).
I hope GitHub takes Kuberentes self-hosting extra significantly sooner or later.
Datadog
Link to heading
🟥 Remorse
Datadog is a good product, however it’s costly. Extra than simply costly, I’m anxious
their value mannequin is very dangerous for Kubernetes clusters and for AI firms. Kubernetes
clusters are most cost-effective when you’ll be able to quickly spin up and down many nodes, as properly
as use spot cases. Datadog’s pricing mannequin is predicated on the variety of cases you
have and which means even when we’ve got not more than 10 cases up without delay, if we spin up
and down 20 cases in that hour, we pay for 20 cases. Equally, AI firms
have a tendency to make use of GPUs closely. Whereas a CPU node may have dozens of providers working without delay,
spreading the per node Datadog value between many use circumstances, a GPU node is more likely to have
just one service utilizing it, making the per service Datadog value a lot larger.
🟩 Endorse
Pagerduty is a good product and properly priced. We’ve by no means regretted choosing it.
Schema migration by Diff
Link to heading
🟧 Endorse-ish
Schema administration is difficult irrespective of the way you do it, largely due to how scary it’s. Knowledge is necessary and a foul
schema migration can delete knowledge. Of all of the scary methods to unravel this difficult downside, I’ve been very proud of the concept
of checking in all the schema into git after which utilizing a tool to generate
the SQL to sync the database to the schema.
🟩 Endorse
Initially I attempted making the dev servers the identical base OS that our Kubernetes nodes ran on, considering this could make
the event setting nearer to prod. Looking back, the trouble isn’t price it. I’m joyful we’re sticking
with Ubuntu for improvement servers. It’s a well-supported OS and has a lot of the packages we’d like.
AppSmith
Link to heading
🟩 Endorse
We ceaselessly have to automate some course of for an inside engineer: restart/promote/diagnose/and so forth. It’s straightforward sufficient
for us to make APIs to unravel these issues, however it’s a bit annoying debugging somebody’s particular set up of a
CLI/os/dependencies/and so forth. With the ability to make a easy UI for engineers to work together with our scripts could be very helpful.
We self-host our AppSmith. It really works … properly sufficient. In fact there are issues we might change, however it’s sufficient
for the “free” value level. I initially explored deeper integration with retool, however I couldn’t justify the worth
level for what, on the time, was just some integrations.
helm
Link to heading
🟩 Endorse
Helm v2 obtained a foul fame (for good motive), however helm v3 has labored … properly sufficient. There are nonetheless points
with deploying CRDs and educating builders on why their helm chart didn’t deploy appropriately. General, nonetheless,
helm works properly sufficient as a solution to bundle and deploy versioned Kubernetes objetcs and the Go templating language
is tough to debug, however highly effective.
helm charts in ECR(oci)
Link to heading
🟩 Endorse
Initially our helm charts have been hosted inside S3 and downloaded with a plugin. The primary downsides have been needing to put in
a customized helm plugin and manually managing lifecycles. We’ve since switched to OCI saved helm charts and haven’t
had any points with this setup.
bazel
Link to heading
🟧 Uncertain
To be honest, a number of good folks like bazel, so I’m positive it’s not a dangerous option to make.
When deploying Go providers, bazel personally appears like overkill. I feel Bazel is a good alternative in case your final firm
used bazel, and you’re feeling residence sick. In any other case, we’ve got a construct system that only some engineers can dive deeply into,
in comparison with GitHub Actions, the place it appears everybody is aware of how one can get their fingers soiled.
Not utilizing open telemetry early
Link to heading
🟥 Remorse
We began off sending metrics on to DataDog utilizing DataDog’s API. This has made it very exhausting to tear them out.
Open telemetry wasn’t as mature 4 years in the past, however it’s gotten a lot better. I feel the metrics telemetry continues to be
a bit immature, however the tracing is nice. I like to recommend utilizing it from the beginning for any firm.
Choosing renovatebot over dependabot
Link to heading
🟩 Endorse
I actually want we had considered “maintain your dependencies updated” sooner. Whenever you wait on this too lengthy,
you find yourself with variations so outdated the improve course of is lengthy and inevitably buggy. Renovatebot has labored properly with
the flexibleness to customise it to your wants. The most important, and it’s fairly large, draw back is that it’s VERY sophisticated
to setup and debug. I assume it’s the perfect of all of the dangerous choices.
Kubernetes
Link to heading
🟩 Endorse
You want one thing to host your lengthy working providers. Kuberentes is a well-liked alternative and it’s labored properly for us.
The Kubernetes group has completed a terrific job integrating AWS providers (like load balancers, DNS, and so forth) into the
Kubernetes ecosystem. The most important draw back with any versatile system is that there are a number of methods to make use of it, and
any system with a number of methods to make use of has a number of methods to make use of flawed.
any system with a number of methods to make use of has a number of methods to make use of flawed
Shopping for our personal IPs
Link to heading
🟩 Endorse
If you happen to work with exterior companions, you’ll ceaselessly have to publish a whitelist of your IPs for them. Sadly,
you might later develop extra programs that want their very own IPs. Shopping for your personal IP block is an effective way to keep away from this by
giving the exterior accomplice a bigger CIDR block to whitelist.
Choosing Flux for k8s GitOps
Link to heading
🟩 No Regrets
An early GitOps alternative for Kubernetes was to determine between ArgoCD and Flux: I went with Flux (v1
on the time). It’s labored very properly. We’re presently utilizing Flux 2. The one draw back is we’ve needed to make our
personal tooling to assist folks perceive the state of their deployments.
I hear nice issues about ArgoCD, so I’m positive should you picked that you simply’re additionally secure.
Karpenter for node administration
Link to heading
🟩 Endorse
If you happen to’re utilizing EKS (and never totally on Fargate), you need to be utilizing Karpenter. 100% full cease. We’ve used different autoscalers,
together with the default Kubernetes autoscaler and SpotInst. Between all of them, Karpenter has
been essentially the most dependable and essentially the most cost-effective.
Utilizing SealedSecrets to handle k8s secrets and techniques
Link to heading
🟥 Remorse
My authentic thought was to push secret administration to one thing GitOps styled. The 2 foremost drawbacks of utilizing
sealed-secrets have been:
- It was extra sophisticated for much less infra educated builders to create/replace secrets and techniques
- We misplaced all the present automations that AWS has round rotating secrets and techniques (for example)
Utilizing ExternalSecrets to handle k8s secrets and techniques
Link to heading
🟩 Endorse
ExternalSecrets has labored very properly to sync AWS -> Kubernetes secrets and techniques. The method is easy for builders to
perceive and lets us make the most of terraform as a solution to simply create/replace the secrets and techniques inside AWS, as properly
as give customers a UI to make use of to create/replace the secrets and techniques.
Utilizing ExternalDNS to handle DNS
Link to heading
🟩 Endorse
ExternalDNS is a good product. It syncs our Kubernetes -> Route53 DNS entries and has given us little or no
issues previously 4 years.
Utilizing cert-manager to handle SSL certificates
Link to heading
🟩 Endorse
Very intuitive to configure and has labored properly with no points. Extremely advocate utilizing it to create your Let’s Encrypt
certificates for Kubernetes. The one draw back is we generally have ANCIENT (SaaS issues am I proper?) tech stack
clients that don’t belief Let’s Encrypt, and it’s worthwhile to go get a paid cert for these.
Bottlerocket for EKS
Link to heading
🟥 Remorse
Our EKS cluster used to run on Bottlerocket. The primary draw back was we ceaselessly bumped into networking CSI points
and debugging the bottlerocket pictures have been a lot tougher than debugging the usual EKS AMIs. Utilizing the
EKS optimized
AMIs for our nodes has given us no issues, and we nonetheless have a backdoor to debug the node itself when there are
unusual networking points.
Choosing Terraform over Cloudformation
Link to heading
🟩 Endorse
Utilizing Infrastructure as Code is a should for any firm. Being in AWS, the 2
foremost decisions are Cloudformation and Terraform. I’ve used each and don’t
remorse sticking with Terraform. It’s been straightforward to increase for different SaaS suppliers
(like Pagerduty), the syntax is less complicated to learn than CloudFormation, and hasn’t been
a blocker or slowdown for us.
Not utilizing extra code-ish IaC options (Pulumi, CDK, and so forth)
Link to heading
🟩 No Regrets
Whereas Terraform and CloudFormation are knowledge recordsdata (HCL and YAML/JSON) that describe
your infrastructure, options like Pulumi or CDK let you write code that
does the identical. Code is after all highly effective, however I’ve discovered the restrictive nature of
Terraform’s HCL to be a profit with lowered complexity. It’s not that it’s
not possible to put in writing complicated Terraform: it’s simply that it’s extra apparent when it’s
occurring.
A few of these options, like Pulumi, have been invented a few years in the past whereas
Terraform lacked a number of the options it has at the moment. Newer variations of Terraform
have built-in a number of the options that we are able to use to scale back complexity. We as a substitute
use a middleground that generates primary skeletons of our Terraform code for components we wish to summary away.
Not utilizing a community mesh (istio/linkerd/and so forth)
Link to heading
🟩 No regrets
Community meshes are actually cool and a number of good folks are inclined to endorse them, so I’m
satisfied they’re fantastic concepts. Sadly, I feel firms underestimate the
complexity of issues. My basic infrastructure recommendation is “much less is best”.
Nginx load balancer for EKS ingress
Link to heading
🟩 No Regrets
Nginx is outdated, it’s secure, and it’s battle examined.
🟩 Endorse
Your organization will probably want a solution to distribute scripts and binaries to your engineers to make use of. Homebrew has labored
properly sufficient for each linux and Mac customers as a solution to distribute scripts and binaries.
🟩 Endorse
Go has been straightforward for brand spanking new engineers to choose up and is a good alternative general. For non-GPU providers which might be largely community
IO certain, Go must be your default language alternative.