In a recent post on their blog, fly talks about how they finally hit a scale where they couldn't fit every user's wireguard peer in their regional gateway node's kernels anymore, and then in one fell paragraph, centered on one deadly sentence, NATS finds itself HTMXized, ostensibly to the engineer's great relief.

Our NATS cluster was losing too many messages to host a reliable API🔫... Scaling back our use of NATS made WireGuard gateways better, but still not great.

Now, I've never been good at O notating things, but the order of scales here doesn't line up in my mind. I've personally managed NATS clusters successfully doing RPC delivering on the order of millions of messages/s.

Seriously, though: you could store every WireGuard peer everybody has ever used at Fly.io in a single SQLite database, easily. What you can’t do is store them all in the Linux kernel.

To deal with delivery guarantees we added a TCP-like retry layer in our shared RPC client code (ack on msg receipt), and retries at the API gateway; for us, replacing ruby with golang for hot service paths was a much better win for reliability than replacing NATS would have been.

Now, I don't necessarily disagree with their ultimately empracing HTTP for RPC transport. If I was building an API layer like theirs, I would probably also not use NATS for RPC again. I had other issues with it, but none of those involved things that HTTP has necessarily solved either.

The rest of the article goes on to detail how they moved to using netlink for direct management of wireguard peer configs, and bpf and the server's private key to intercept and crack open the client's hello packet to extract its public key and 4-tuple.

Even without NATS, and with a local sqlite store, it still can't hit fast enough to respond with a peer miss; they go on to detail an interesting mitigation to the unreliable transport they'd created: Now having the client's identity and 4-tuple, instead of installing the peer and wait for a retry, they instead install it as an initiator and take things into their own hands.

The fact that they were able to deliver this level of experience that other wireguard topologies handle by dropping to userspace impls is no small feat. Using the client's hello packet as something of a STUN probe is just pretty dang ingeneous–and open source nonetheless–and something I will definitely be using.

NATS

Now, NATS was the topic of this post and I'm mostly writing it because I feel like there's probably pain beneath fly's decision to pull back on their deployment, and where there's pain there's lessons to be learned that shouldn't be dismissed in a sentence.

However I'm also somewhat personally interested in the decision as I'm playing with using NATS jetstream as a mutation store to play the part of etcd for ephemeral k8s clusters, and the stack's got some rough spots. The kine->jetstream proxy for a single master controlplane with no workers consumes 250millicore.

In real $, the best per-core price you can scrape by with spot instances is around $5/core/month by the time you get memory to run with it. So just the kine code for an idling controlplane is costing $1.25/month in compute; the 3 node NATS cluster behind it was using 150millicore, total. As an aside, I have to say that running NATS on spot instances has been a dream.

Now, I don't think this issue can be laid at jetstream's feet, I suspect there is performance to be had in the kine code (it remarshals and compresses each mutation), and even replacing it with a low-level system's language implementation could be a solution. Another area that could show promise is looking at the transactions that are actually flowing and see if there are any over-chatty clients that could be fixed.

I need idle controlplanes to basically be free, only doing essential work when it needs doing. This isn't really an operating mode most controlplane providers spend much time thinking about as they aren't really in the controlplane business, they're in the worker business.

It's still to be seen if NATS jetstream can play a part in a topology like this at scale, but the power in being able to horizontally scale a multitenant kafkaesque system that can do things like generate point-in-time snapshots of a cluster's state, and replicate it to other failure domains, is too tempting a feature not to use an event store, and I've used other event stores.

The way the jetstream kv-store composes watches with NATS subjects is particularly interesting for controlplane management. The ability to set up targeted watches into the state changes of all clusters at this layer could scale much better than making kubeapi connections to the tenant clusters themselves.

The ability of the store to track changes over time also gives a powerful source of automation and UX data. Intuitive audit timelines, ML defenses against bad actors/state, time-machine style controlplane debugging/testing/fuzzing/research, etc.