fly.io replaces Nomad with NIH

The newtech, selfhosted, cloud, IaaS provider fly.io released a blog post last month detailing the escapade.

I came upon this post as I've been recently looking through the landscape of [I|P]aaS providers. I'm simple, when I see "replacing Nomad", I click; wasn't disappointed with the quality or the jab at Hashicorp's OSS-incompatible pricing.

Interestingly the post starts with something I think would be amazing interview fodder, and great content for deeply learning a programming language in a distributed env or as a youtube series. A good half the post starts us out with a well-done intro to orchestration and scheduling, including architecture, theory and (a little) code, love it.

The first part of the second half, now, is a play-by-play shit smearfest on k8s and transitively Nomad through the family tree to their shared ancestor: borg. Much is laid out in the area of why the borg-a-like schedulers are fundamentally broken in our glorious synchronous FaaS future.

Ostensibly this was all a setup to justify their decision to NIH both of them and Consul as described in the final 1/4th by effectively pointing up-post and describing how they actually implemented their mid-level dev whiteboard problem.

I landed on this post ultimately starting from a recent DevClass article about how fly is struggling to fix their platform and scale partly at the hands of hordes of devs looking for a one-click PaaS to run their stricken free-tier heroku apps. I was less surprised in retrospect retrospect(thinking back to the article), having the problems they were describing.

We’re in an awkward phase where the company isn’t quite mature enough to support the infrastructure we need to deliver a good developer UX

No time for that when the engineers are half-solving the interesting orchestrator, scheduler, distributed discovery, network mesh, secret storage, cloud DB, diskdev, and executor problems whole cloth.

gossip-based consistency is a difficult problem

What do you do when your global service catalog becomes corrupted because of a bug?

We’ve pushed the platform past what it was originally built to do

I look back at my previous assertion that implementing a stack like this from scratch is a very tough value prop to sell. It's a solved lego problem and building the fly.io UX would arguably have been a better place to expend limited startup resources while also not being harangued by an inoperative platform.

The kubernetes scheduler, for example, can be customized with webhooks making it an ideal strategy playground for unorthodox scheduling. With RuntimeClass custom executors can be quickly experimented with while binding at the pod level.

Now obviously k8s can't be the solution to everything.. Yes

fly.io replaces Nomad with NIH

Josh Perry

Josh Perry

The Atlantic-Sized Hole between docker and k8s

Why Kubernetes is the Linux of the Future