I've spent the last year working at my current gig with a talented team to make a huge move from onprem kvm virtual machines and puppet, to containers and kubernetes deployed to an onprem cloud. Before that I spent the better part of a decade building a containers-to-the-edge IoT startup on arm and kubernetes. It's been my purpose for a long time now.
We've deployed interesting things like custom operators to sync istio services with consul, filesystem providers based on e2e encrypted ceph with s3 and snapshots, network plugins for transparent IP connectivity to VMs and metal, mesh networks to remove encryption and routing concerns from apps, and we deployed buttloads of proxies to bridge the old with the the new.
I've learned a lot while being addicted. I'd like to share some of the ups and downs, pros and cons, along side some tips and tricks we've learned up to this point.
TLDR: Yes it's worth the work.
There is some peripheral machinery that needs to be managed in-house in order to create an onprem cloud with a similar surface area to the big providers, while being efficient in cost of time and resources. However, the difficulty seems to be inversely proportional to the height of the house of cards.
One of our goals in this endeavor was to be cloud ready, ready to run our workloads on clusters on the cloud providers for scalability and isolation. With GCP DCs in the same cities as ours, we've been seriously eyeing the low-latency connectivity and autoscaling clusters.
This axis of the requirements has been easy to pin with the well-adopted abstractions and automatability provided by kubernetes and the projects of the extraordinary community around it.
Education, though, is the first thing I think that needs addressing.
This has been the most difficult aspect of it for me, I need to become a better teacher. I've failed a lot here, and learning from the failures has been difficult; I wouldn't even say the project was wholly a success at this point or that it won't yet fail under it's own weight, I can just say that it was more than fit for the intent that we built it.
Access to wide and deep training available on all the pieces is one of the most exciting parts of the kubernetes ecosystem, there is abundant information in not only trainings, but huge numbers of blogs and videos.
While automation can make clusters seem magical in the happy path, when shit goes wrong why is someone trying to tcpdump a pod? Handling the exceptional aspects, like is so often the case, is a long-tailed dragon; having well-trained people is the only way to wrangle it.
In the ops back office, one of the most painful aspects is that complex systems are complex, when things break people are needed that understand the layers beneath the yaml. Even if there were service contracts, there's often little time to ring up a vendor to troubleshoot your env in realtime.
On the frontend, we can't expect a developer to write calico network policy and istio virtual services, or wire up their canary deploys to an automatic analysis stage and tie their saturation metric up to a horizontal pod autoscaler.
To do SDLC in the vein of CICD at scale, having full-stack aware team members is a really big ask even as shifting everything from ops to security leftward is en vogue. I don't know the best method for scaling the mindshare of the stack to different sets of C-leveling, planning, platform, operator, QA, and dev kind of people, but it's a knowledge space that's ripe for my experience-by-failure M.O.
There are problems here that can definitely be solved with more abstractions, but the abstractions are always become Landru to our beta III.
There are times I wonder if the premium cost of the cloud promise to scale the hands-on-keyboard to server ratio is worth it. Does leaving the layer beneath the yaml in the hands of the providers as big a win as they charge for? This same thought crosses the mind of many a tech executive looking to do more with less.
I am a strong proponent of onprem operations at scale, but whether the workloads at my current gig make a swift shift to the cloud remains to be seen. Our team unfortunately as of yet has done little in the way of moving our workloads into a cloud provider region. I look forward to seeing what the future holds here.
You hear an oft cited refrain echoing down the halls of the colo datacenters about how the cloud is just someone else's computers. Can we take control of our computers back and bring on a new dawn of the internet PC? I think we maybe already have.
What do you do if you're addicted to your higher power?