Curiously Nerdy

TIL Coding with Agents: Never block the main context

Josh Perry — Sat, 28 Feb 2026 23:57:07 GMT

Never block, never stream.

Always filter, always search.

Always subsist on the substance of a subinstance summary.

TIL coding with agents: Patch, the language

Josh Perry — Sat, 28 Feb 2026 23:48:47 GMT

Claude just doesn't even bother regenerating a patch when it needs one, it just codes in patch–as if it was purposely built as a language for mutating text by humans, and not as a file delta interchange format.

It doesn't care, so much of programing is actually just–translation.

Gitub+GCP Identity Federation with Config Connector

Josh Perry — Wed, 10 Apr 2024 03:16:29 GMT

I was recently needing to access GCP resources from a github actions workflow in order to run an end-to-end test of gcp-kms-issuer. The old way of doing this would be to create a GCP service account and then export a long-lived authentication keyfile that you'd then stuff into an actions environment secret.

Today, however, the promise of OIDC federation is upon us, and when it works it is truly magical and much more secure. Both the google cloud blog and the github documentation sites already have excellent articles on how to set up this federation using clickops and cliops, but I'm using kubernetes config connector and gitops, so I'm going to show you how I translated these instructions thusly.

Enabling keyless authentication from GitHub Actions | Google Cloud Blog

Authenticate from GitHub Actions to create and manage Google Cloud resources using Workload Identity Federation.

Google Cloud

Configuring OpenID Connect in Google Cloud Platform - GitHub Docs

Use OpenID Connect within your workflows to authenticate with Google Cloud Platform.

GitHub Docs

The gist of the setup is that you need to create a workload identity pool in GCP to hold a provider which is configured to issue GCP workload identities in exchange for a github actions JWT, this also requires enabling permission to create this JWT on the github actions side.

Claim assertions piped from the JWT through the provider can then be bound via an IAM policy to assign roles to the workload identity of actions running in a specific github repo/branch/etc.

When the workflow runs, one of the earliest actions should be google-github-actions/auth. This will get the signed github JWT and use the federation provider to exchange it for a GCP workload identity.

This identity can then be used by the other steps in the workflow to interact with GCP APIs in accordance with permissions granted via roles assigned in IAM.

Boss! The Code! The Code!

All of this code is live and running in my public gitops repo and I'll link to the resources and code directly in the text that describes them.

To create the pool and provider, config connector provides straight-forward resources that we can drop in the gitops repo:

apiVersion: iam.cnrm.cloud.google.com/v1beta1
kind: IAMWorkloadIdentityPoolProvider
metadata:
  name: github
spec:
  projectRef:
    external: projects/gptops-playground
  location: global
  workloadIdentityPoolRef:
    name: github
  displayName: github
  description: Github repo actions workload identity federation
  disabled: false
  attributeMapping:
    google.subject: assertion.sub
    attribute.repository: assertion.repository
  attributeCondition: assertion.repository_owner == 'drzzlio'
  oidc:
    issuerUri: https://token.actions.githubusercontent.com
---
apiVersion: iam.cnrm.cloud.google.com/v1beta1
kind: IAMWorkloadIdentityPool
metadata:
  name: github
spec:
  projectRef:
    external: projects/gptops-playground
  location: global
  displayName: github
  description: Identity pool for github actions workloads
  disabled: false

In the provider here we're using attributeMapping to bind the custom attribute repository to the repository claim in github's JWT, which will be a value like drzzlio/kms-issuer for my example case.

The attributeCondition is a whitelist policy that only allows handling identities for repos under my github org. Without this, any github action could request an identity from our provider, which is not great even though it may not have any permissions in IAM.

The oidc.issuerUri property tells the provider where it can get the well-known openid JWKS metadata necessary for authenticating the JWT created by github.

The pool is just a container and doesn't take much configuration itself.

On the github actions side we set permissions to create an id-token and add the google auth action:

jobs:
  e2e:
    permissions:
      id-token: write # This is required for requesting the JWT for GCP OIDC federation
      contents: read  # This is required for actions/checkout
    steps:
      - uses: actions/checkout@v4.1.1

      - name: Authenticate to Google Cloud
        id: gcpauth
        uses: google-github-actions/auth@v2.1.2
        with:
          workload_identity_provider: projects/726581429530/locations/global/workloadIdentityPools/github/providers/github

# -- snip creating kind cluster --

      - name: Install GCP auth secret
        run: kubectl create -n kms-issuer-system secret generic kmsissueradc --from-file=gadcreds.json=${GOOGLE_APPLICATION_CREDENTIALS}

# -- snip running tests --

By default, the auth action will create a credentials file compatible with google ADC when provided a workload identity pool to retrieve an identity from. It also sets various environment variables in the workflow that ADC clients look at, like GOOGLE_APPLICATION_CREDENTIALS.

In my case I need the credentials to allow the kms-issuer controller running in a kind cluster for e2e testing to reach out to my GCP project, so I inject a secret containing the contents of the credentials file created by the auth action, which it has conveniently pointed at using the above environment variable.

Finally, we need to bind the action's workload identity for some repo to roles in IAM. What permissions are necessary will be quite specific to the particular use-case, but in mine I needed to allow the controller access to a KMS key for signing.

We can do this simply with an IAMPolicy KCC resource, of course, in the gitops repo:

apiVersion: iam.cnrm.cloud.google.com/v1beta1
kind: IAMPolicy
metadata:
  name: kmsissuer-test-keypolicy
spec:
  resourceRef:
    apiVersion: kms.cnrm.cloud.google.com/v1beta1
    kind: KMSCryptoKey
    name: kmsissuer-test
  bindings:
    - role: roles/cloudkms.signerVerifier
      members:
        - principalSet://iam.googleapis.com/projects/726581429530/locations/global/workloadIdentityPools/github/attribute.repository/drzzlio/kms-issuer
    - role: roles/cloudkms.viewer
      members:
        - principalSet://iam.googleapis.com/projects/726581429530/locations/global/workloadIdentityPools/github/attribute.repository/drzzlio/kms-issuer

Using IAMPolicy we can ensure this is the only policy that applies to the key (it will overwrite any bindings not in this list, very powerful in combatting drift when combined with self-healing gitops tooling).

The principalSet: prefix used for the member entries let us tell google that the role binding is referencing a workload identity issued by our federated pool, with the value after attribute.repository/ binding against the custom attribute that we passed through in the identity provider attribute mapping.

In this case the member binding means that only a workload identity created from a cryptographically-verified JWT with a claim that github says it issued to an action in the drzzlio/kms-issuer repo will be able to make use of these roles.

For my needs binding on the repo is sufficient, but the same capability can bind against any claims that github includes in its JWT, like environment, actor, workflow, branch, sha, etc.

Workload Identity Federation via Service Account

What I've described above uses the direct method of identity federation, which is the preferred and simpler method for binding federated identities directly to roles in IAM. Unfortunately, some GCP services do not support this method of role binding and require impersonation of a GCP service account.

Another reason you might not be able to use direct identity federation is if your workflow logic needs access to OAuth access tokens or ID tokens from the auth plugin, as this is only supported when mediated by a GCP service account identity.

The main difference to the above setup is the addition of a GCP service account and a policy that grants the workloadIdentityUser role on it to the repo using the same principalSet: as above. The policies for roles that the actions need access to would then be granted to the service account instead of directly to the workload identity.

Adding the service_account property to the google auth action config will then cause it to use its workload identity to impersonate the specified service account instead of using the workload identity directly to access GCP services.

Conclusion

I think one of the nicest things about KCC, besides having your IaC eventually consistent and all unified under k8s controllers, is the way it is architected to allow composing its resources in such a way that elides much of the repitition and cross-referencing noise that is so prevalent in the definitions of other IaC tooling.

With a large portion of the GCP API surface covered by KCC, many tasks, even somewhat complicated ones like identity federation become kind of a joy to work with.

Take your newfound knowledge: go forth and control all the things!

People, not machines, drive the craft

Josh Perry — Mon, 08 Apr 2024 00:37:32 GMT

One of my favorite things in the world is a thought provoking book, and there's not much I enjoy doing more on a lazy Saturday than wandering the aisles of books at the second-hand stores.

There are so many gems to be found, a gem like a coffee table book about the effects house Digital Domain from 22 years ago, filled with stories about how they accomplished the visual effects for some of history's biggest movies like Apollo 13 and Titanic.

For the first time in the years since I found it, I was going through my collection to pick a couple fresh for the living room. As I was leafing through the amazing pictures of physical models and CGI rigging throughout, I found myself caught by the text of the introduction.

It addressed a concern I see circulating through many industries, centered around generative AI:

One of the greatest misconceptions about modern movies is that visual effects are generated by computers. Nothing could be further from the truth. Human inventiveness is the most important ingredient and it always will be. Computers offer amazing new possibilities, but the underlying challenges of movie illusions are the same today as they were nearly a century ago when the industry was young.

People, not machines, drive the craft of visual effects.

Prescient from so long ago, and I strongly believe will be shown to be the case with generative AI as well. Some of the dissonance we see in dev experience – through the fuzziness of hype – around AI-assisted coding seems to be correlated with the code's target problem space.

Sure, AI can spit out specializations to code that's been written countless times, though even in these cases it's often arguably not that skilled when measured by an experienced dev.

Though, the space I've had the most problems using AI to code is when trying to create things that don't exist yet, where I've found it counterproductive to even open the GPT interface.

Because I spend most of my time in this realm of de novo creation, I find reading code and docs is still exceptionally more productive than GPT when attempting to understand how a library or system works.

Augmenting Reality

To get a grip on how much genAI issues can impact your ability to create, I did an experiment recently. I attempted to use chatGPT to help in comprehending the machinations that happen beneath and between kubernetes and the container runtime, and with the planning aspect of implementing a new rootfs handling feature in containerd.

I spent ~2 hours a day for 4 days doing research and dev planning with just GPT4 and my notes, then on the 5th day I began reading containerd source and docs.

Wow, after 2 hours in that source and architechtural documentation I basically had to throw away the previous 8h of "work".

The thing that scared me most about this experiment was that by the time I started feeding my mind from the source of actual truth, it was already so filled with GPT hallucinations that unraveling the lies and misconceptions carried a surprising cost in time and mental stamina.

The work I spent rooting GPT's lies out of my mind could have instead been spent actually learning, analyzing, and ultimately creating things instead.

Gazing back upon a GPT-Tinged Memory

Now, the saturation aspect of this experiment was intentional and, I think, quite enlightening.

I wanted to produce as stark a juxtoposition as possible between the million hallucinated papercuts inflicted without respite to truth, in a domain where I have significant experience, but an area I didn't have deep knowledge in specifics.

I wanted to subsequently lean on my skill in the domain to then become an expert in the specifics of the area, and do a comparative analysis between what I'd learned from a source of truth versus what I learned via reality filtered through the world's most advanced GPT.

Most of the kubernetes world is replete in golang code and is organized in certain patterns like control loops and eventual consistency. These are areas I spend a lot of time not just studying, but also implementing, so it lines up well with a stack I felt I could rapidly ingest.

In the end, I was actually pretty shocked by the magnitude of difference in what I thought I knew after working with GPT and how I learned the system worked in actuality.

But what I might have been even more shocked by is the difference in the richness of the real world compared to the world I perceived through GPT4; perhaps it's not a stretch to say that humans are many times more interesting than GPTs think we are.

Alan Turing proved the infinite complexity of even his simple machines via analysis of the halting problem. This is the infinite space in which we ask our genAI to operate when the space in which they are taught is not that of understanding, but only of concept mapping.

My AI is a Narcissist

It is pretty difficult for most people to simultaneously hold a fact in two conflicting states in their mind in a way that makes it easy for them to quickly swap things they learned with something corrected, especially as a rule now more than an exception.

Even after you learn the real truth it can sometimes be difficult to collapse the dissonance wave, second guessing yourself when going back to remember. In fact, this may even incentivize one to implicitly trust in order to avoid the dissonance altogether, praying that the follow-on consequences are minimal.

I guess one thing that's left to be seen is whether this is an additive debuff, but one thing is for sure: people attempting to use GPT to create things in domains where they lack practical experience are doomed to wander in a fog that they're confidently blind to.

Looking at the recent GPT-4 Technical Report from March '24, it's interesting to note the things that GPT is "good" at. Sure it passes the Bar Exam, but law is probably one of the strictures most described in prose of human language. Look at where it struggles though and you can see it's in areas that require analysis, consideration, and inventiveness.

what GPT is "good" at

A confounder that we will shortly need to reckon with is Goodhart's law; it stands to reason that these measures will become less reliable when coverage of the AI's travails with them become published as analysis and data that it is then eventually trained upon.

Fin

I will talk about how I think genAI can never really root a creative work outside of human ingenuity in a followup post. As perhaps a thought-provoking hint, that post's title will be "You can't Factor Creation".

Composing and Diffing Your GitOps Repo

Josh Perry — Wed, 20 Mar 2024 21:02:34 GMT

In a recent post I opined on a past gitops failure and contemplated how showing diffs of the changes you're going to make to the running repo is one check that can help mitigate certain types of mistakes. The automation around a gitops repo makes it a natural target for automation in the space of fault and regression detection.

So I spent a little time on improving my workflow of just running kustomize build apps/myapp/overlay/primary | less in a separate terminal to look through the output, and fix any issues, then iterate.

Sometimes the LOC of the resources (especially if there are CRDs involved) when installing some app or service can get excessive to parse manually, and even some related groups or single resources, like a complex Pod configuration, can be difficult to parse for errors by eye.

One interesting tool I ran into during my research was dyff. It is a yaml diff tool that can give succinct and human-readable field and list specific difference reports. But a special feature (that's enabled by default) is its kubernetes detection: if it detects the documents it's diffing are k8s resources, it will pin its diffing engine on the document's GVK, namespace and name.

Similarly to argo's own reconciler, this gives dyff the ability to understand the difference between addition, removal, and change of specific kubernetes resource instances. With two sets of plain yaml documents to diff(i.e. no k8s identity), it has no way to tell if a document from the before set is the same as a document from the after set.

Diffing with Document Identity

Let's look at how diffing with knowledge of document identity can be helpful: this is a relatively simple-looking change to my local working copy of the gitops repo:

diff --git a/apps/argocd/overlays/primary/kustomization.yaml b/apps/argocd/overlays/primary/kustomization.yaml
index 934bf5c..32dc17f 100644
--- a/apps/argocd/overlays/primary/kustomization.yaml
+++ b/apps/argocd/overlays/primary/kustomization.yaml
@@ -3,6 +3,9 @@ resources:
 - ../../base
 - repo.yaml
 
+components:
+- ../../components/argonix
+
 patches:
 - path: argocd-cm.patch.yaml
 - path: custom-tools.patch.yaml

The thing that's special about this change is that components are a primary composition tool in kustomize; a component is one of only a couple types of multi-resource generators in kustomize that have the power to also mutate other documents in the pipeline; to use a programming term: they're kustomize mixins.

If we run kustomize build apps/argocd/overlays/primary this will generate the final resources that would be injected into the cluster during reconciliation if I were to commit the change. If I were able to also run this build command against the master branch, I would then be able to dyff the two outputs and see a compact view of what's going to change from what's live.

Handwaving the details for the moment, this is what the dyff output looks like for that change:

  
(file level)
    ---
    apiVersion: v1
    data:
    │ plugin.yaml: |
    │ │ apiVersion: argoproj.io/v1alpha1
    │ │ kind: ConfigManagementPlugin
    │ │ metadata:
    │ │   name: argonix-jobs
    │ │ spec:
    │ │   version: v1.0
    │ │   # Make sure the reconciler is up-to-date
    │ │   init:
    │ │     command: [bash, -c]
    │ │     args:
    │ │     - >-
    │ │         nix --extra-experimental-features 'nix-command flakes' build
    │ │         --out-link /opt/reconciler github:drzzlio/argonix?dir=cmp#reconcile;
    │ │         nix-collect-garbage
    │ │ 
    │ │   # Run the argonix job reconciler
    │ │   # Must always and only return k8s resources to stdout
    │ │   generate:
    │ │     command: [/opt/reconciler/bin/reconcile]
    │ │ 
    │ │   # Run against repos with a flake in the root
    │ │   discover:
    │ │     fileName: "./flake.nix"
    │ │     preserveFileMode: false
    │ │ 
    kind: ConfigMap
    metadata:
    │ name: argonix-cmp-plugin-54m5h9fgkt
    │ namespace: argocd



spec.template.spec.containers  (Deployment/argocd/argocd-server)
  + one list entry added:
    - name: argonix-cmp-plugin
    │ image: nixos/nix
    │ command:
    │ - /var/run/argocd/argocd-cmp-server
    │ volumeMounts:
    │ - name: var-files
    │ │ mountPath: /var/run/argocd
    │ - name: plugins
    │ │ mountPath: /home/argocd/cmp-server/plugins
    │ - name: argonix-cmp-plugin
    │ │ mountPath: /home/argocd/cmp-server/config/plugin.yaml
    │ │ subPath: plugin.yaml
    │ - name: cmp-tmp
    │ │ mountPath: /tmp

spec.template.spec.volumes  (Deployment/argocd/argocd-server)
  + two list entries added:
    - name: argonix-cmp-plugin
    │ configMap:
    │ │ name: argonix-cmp-plugin-54m5h9fgkt
    - name: argonix-cmp-tmp
    │ emptyDir: {}

We can see pretty simply from this dyff, without wading through all of the argocd install resources, that this component adds a new configmap and modifies the argocd-server pod to add a container and two volumes it references. Dyff is quite smart about how it compactly presents the changes in your yaml.

Stop, Impl Time

So how do we do this? Well, for my repo setup it's fairly simple. The primary driver about how simple this will be is how homogeneous your yaml build pipelines are. For this repo, I only expose kustomizations; kustomize can generate yaml from helm charts directly, so for upstream cases needing helm, kustomize is better at composing charts than helm is itself.

With that said, I'll share my code, but keep in mind that it may be more complex in your stack (i.e. if you have to detect which generation pipeline to run for an app). The cluster-level dyff function in particular will be difficult to implement if you don't have a way to easily discover all the roots that need to be generated for a cluster.

This is the implementation in my gitops project's flake.nix. As this is the scripts value for my devenv output, each of the '' enclosed multiline-strings is bash. The ${...} entities are nix variables and get expanded into the bash script strings (in the case of pkgs.* expansions, nix will also automatically install that tool).

          scripts = let
            newtree = ''
              set -e
              if git worktree list | grep gitopskdiffmaster &> /dev/null; then
                cd /tmp/gitopskdiffmaster
                git fetch
                git checkout origin/master &> /dev/null
                cd - > /dev/null
              else
                git fetch
                git worktree add /tmp/gitopskdiffmaster origin/master &> /dev/null
              fi
            '';
          in {
            # Useful for diffing an application's generated resources after
            # local changes.
            # Takes a relative path, checkes out master in a temp folder, then
            # does `kustomize build` against the same relative path from master
            # and the current directory before dyffing the output.
            kdiff.exec = ''
              ${newtree}
              echo diffing `pwd`/$1 with master/$1
              ${pkgs.dyff}/bin/dyff between --ignore-order-changes --truecolor on --omit-header \
                <(kustomize build --enable-helm /tmp/gitopskdiffmaster/$1) \
                <(kustomize build --enable-helm `pwd`/$1)
            '';
            # Automated kdiff on file changes.
            # Takes two directories, watches the first for any yaml file
            # changes, calls `kdiff` on the second any time a change
            # is detected.
            kdiffwatch.exec = ''
              ${pkgs.watchexec}/bin/watchexec -e yaml -w $1 kdiff $2
            '';
            # Similar to kdiff, but for all the resources in cluster
            # Takes the name of a cluster, checks out master to a temp folder,
            # then generates and dyffs the resources for the cluster at master
            # and the cluster in the local directory.
            cdiff.exec = ''
              ${newtree}
              echo diffing `pwd`/clusters/$1 with master/clusters/$1
              ${pkgs.dyff}/bin/dyff between --ignore-order-changes --truecolor on --omit-header \
                <(kustomize build /tmp/gitopskdiffmaster/clusters/$1 | yq '.spec.source.path' -r | tr '\n' '\0' | xargs -0i -n 1 bash -c 'kustomize build --enable-helm /tmp/gitopskdiffmaster/{} 2>&1; echo "---"') \
                <(kustomize build clusters/$1 | yq '.spec.source.path' -r | tr '\n' '\0' | xargs -0i -n 1 bash -c 'kustomize build --enable-helm {} 2>&1; echo "---"')
            '';
          };

At a high level, we're using git worktree to check out the origin/master commit in a temp directory. We can then run the build against both the local repo files and also against what's in master before running dyff against the two outputs.

The kdiff script is meant to render and diff just a specific kustomization(usually an overlay), useful to run when iterating on a particular app's configuration.

The kdiffwatch script adds watchexec to the party to automatically run a kdiff any time some yaml files get changed.

The cdiff script is a bit different. As my project is using argo to render app kustomizations in-cluster (using the app-of-apps pattern), if I want to diff everything in the cluster I need to extract every app path that argo is rendering from.

This is pretty simple since this repo already has a kustomization at clusters/ which renders all of the Application resources that argo will target for rendering into the cluster (this is actually the "app" in app-of-apps).

So the cdiff script first renders this kustomization to extract the spec.source.path property for each app using yq, then runs kustomize build on each of the paths; this is repeated for the master branch copy, before finally generating a dyff of the outputs.

If you're not already using some diffing tool to see what's changing in the output of your gitops repo, you should set something up; it's simple and it will save you not just time, but pain.

GPT4-created Kubernetes Policies

Josh Perry — Tue, 19 Mar 2024 22:17:37 GMT

Like I mentioned in my previous LLM+k8s post, I've been curious about how we can bring the power of LLMs to bear in generating k8s resources from their knowledge given some natural language instruction. As I'm a big fan of javascript from waaaay back, I'm recently a big fan of the JSPolicy engine and of the Voyager paper that used the mineflayer js-based minecraft API.

So can we similarly leverage an LLM to play the game of managing policy in a k8s cluster? I guess the basic UX we're looking for here is that the user presents a desire in human language, like Deny any IAMPolicy config connector resources that references a Project(like I defined in my previous post), and the controller will use GPT4 to generate the jspolicy implementing that intent.

And this is an example of what it comes up with in response to that intent:

apiVersion: policy.jspolicy.com/v1beta1
kind: JsPolicy
metadata:
  creationTimestamp: '2024-03-19T21:16:24Z'
  generation: 1
  name: no-iampolicy-targeting-projects.drzzl.io
  ownerReferences:
    - apiVersion: gpt.drzzl.io/v1
      controller: true
      kind: GPTPolicy
      name: no-iampolicy-targeting-projects.drzzl.io
      uid: f5794403-f498-4ca8-a9a2-a02e838dd9a6
  resourceVersion: '109696173'
  uid: 385ae777-2ad8-4ed1-bbda-23c77645071d
spec:
  apiGroups:
    - configconnector.cnrm.cloud.google.com
  apiVersions:
    - '*'
  javascript: >-
    if (request.object.spec.resourceRef.kind === 'Project') { deny('IAMPolicy
    cannot target a Project'); } else { allow(); }
  operations:
    - CREATE
    - UPDATE
  resources:
    - iampolicies
  scope: Namespaced
  type: Validating

The output is actually slightly flawed(like most LLM output), the apiGroups should be iam.cnrm.cloud.google.com.

This is afterall a first cut, it still has issues, and has already took my cluster down once. I've made a number of tweaks to the context I'm providing to the LLM as it actually doesn't have much reference material about what correct jspolicy resources look like.

While I think jspolicy is an amazing tool to simply enforce intent on your cluster in small bits of policy, mutation, and/or controller code, it's still not widely used and there are only a couple small repositories holding simple examples.

To that end, most of the context in my prompt is focused on shoring up the LLM's knowledge on jspolicy and telling it what information it will be given and how it should act with that information. Ostensibly I force the LLM to call a function, of which I only offer one, create which creates a k8s resource of the LLM's choosing with some very liberal schema to guide it.

I don't have any kind of feedback, like the Voyager researchers used, but I have hand-tested running some errors(like the one above) through GPT4 and have had poor results so far in getting it to raise a flag on subtle issues.

I will most likely need to fall back to adding more information into the context, which has thankfully been growing with each release of GPT, though it adds to costs. There is also the posibility of fine-tuning a lower-end model on a lot of quality jspolicy content to see if we can get the code generation more accurate.

This could be a self-feeding data pool of good training or context content if we store effective policies that meet human review at some level.

With all that said, here is the controller so far, implemented–can you imagine–as a jspolicy controller:

const AIURL = 'https://api.openai.com/v1/chat/completions'
const AIKEY = env('OPENAI_API_KEY')

// TODO: move state tracking to the status subresource
const LASTAPPLIED_ANNOT = 'gpt.drzzl.io/last-applied-description'

print(`got event ${JSON.stringify(request)}`)

const log = msg => print(`${request.name}: ${msg}`)

if(request.operation === 'DELETE') {
  // The first delete with a finalizer has an object
  if(request.object) {
    //TODO: Delete the owned jspolicy instance
    log(`removing owned jspolicy instance`)
  } else {
    log(`not handling delete event`)
  }

  // Done with delete handling, jump out
  allow()
}

log(`handling creation event`)
// Check if the last applied annotation matches the current description,
// if not then we need to create or update the owned jspolicy
if(request.object.metadata.annotations?.[LASTAPPLIED_ANNOT] !== request.object.spec.description) {
  log(`generating jspolicy code`)

  const description = request.object.spec.description

  const payload = {
    model: 'gpt-4',
    temperature: 0.02,
  //  top_p: 0.4,
    messages: [
      { role: 'system', content: `You are an expert in kubernetes, javascript, and JsPolicy that is responsible for creating and updating jspolicy resources in the kubernetes cluster.

You will be provided an 'owner:' and 'description:' for the policy resource.
The 'description:' provided is the description of what the policy code should accomplish and you should always use the owner's name for policy name and set its ownerReferences to the owner provided.

The policy code should be a string contained in the 'javascript' property of the resource and the current JsPolicy version is 'policy.jspolicy.com/v1beta1'.

In order to limit which resources a particular policy will be triggered for, use the following JsPolicy resource properties.

    operations:
        An array of strings to constrain the Kubernetes CRUD operations to trigger on (any combination of 'CREATE', 'UPDATE', 'DELETE').

    resources:
        An array of strings to constrain the Kubernetes resource plural names to trigger on (e.g. 'pods', 'deployments', 'services' etc.

    scope:
        A string to constrain the Kubernetes resource scope to trigger on ('Namespaced', 'Cluster', or '*' for both;  defaults to '*').

    apiGroups:
        An array of strings to constrain the Kubernetes API groups to trigger on (default: '*' matches all API groups).

    apiVersions:
        An array of strings to constrain the Kubernetes API versions to trigger on (default: '*' matches all API versions).

The following is a description of jspolicy functions available to call in policy code:

    mutate():
        Only available when the policy's 'spec.type' is set to 'Mutating', and tells jsPolicy to calculate a patch between the original request.object and the newly passed object. As soon as mutate(changedObj) is called, execution will be stopped. JsPolicy will remember the original request.object, which means you can freely change this object within the policy and call mutate(request.object) afterwards. If the passed object and the original object do not have any differences, jsPolicy will do nothing.

    allow():
        Allows a request and terminate execution immediately. This means that statements after allow() will not be executed anymore.

    deny():
        Denies a request immediately and halts execution. You can specify a message, reason and code via the parameters, which will printed to the client. In controller policies, deny() will only log the request to the violations log of a policy.`
      },
      { role: 'user', content: `owner:
      apiVersion: gpt.drzzl.io/v1
      kind: GPTPolicy
      name: ${request.name}
      uid: ${request.object.metadata.uid}
      controller: true
  description:
      ${description}
  `
      },
    ],
    function_call: { name: 'create' },
    functions: [
      {
        name: 'create',
        description: 'Creates a new resource instance of any kind in the kubernetes cluster.',
        parameters: {
          type: 'object',
          description: 'The kubernetes API resource to create in the cluster.',
          properties: {
            apiVersion: { type: 'string' },
            kind: { type: 'string' },
            metadata: {
              type: 'object',
              properties: {
                namespace: { type: 'string' },
                name: { type: 'string' },
              },
              required: ['name'],
            },
            spec: { type: 'object' },
          },
          required: ['apiVersion', 'kind', 'metadata', 'spec'],
        }
      },
    ]
  }

  try {
    log('calling GPT')
    const resp = fetchSync(AIURL, {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${AIKEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify(payload)
    })
    const fval = resp.json()
    log(`got create call ${JSON.stringify(fval)}`)

    if(resp.ok) {
      const policy = JSON.parse(fval.choices[0].message.function_call.arguments)
      const cresp = create(policy)
      if(!cresp.ok) {
        throw new Error(`error calling create: ${cresp.message}`)
      }
    } else {
      throw new Error(`Error response: ${resp.status}`)
    }
  } catch(err) {
    //TODO: Should we execute a chain-of-thought flow here to resolve the
    // error? Perhaps preempt the issues like the voyager team.
    log(`error making policy: ${err}`)
    requeue('requeuing for error calling gpt')
  }
    
  // Set last applied annotation on the GPTPolicy instance
  log('updating last applied annotation')
  const meta = request.object.metadata
  meta.annotations = {...meta.annotations, [LASTAPPLIED_ANNOT]: description}
  const upobj = update(request.object)

  if(!upobj.ok) {
    log(`error updating last applied: ${err.message}`)
    allow()
  }

  // Done with creation handling, bye
  log('done with create reconciliation')
  allow()
}

log(`no action taken`)

I have a jspolicy type: Controller with the above code in its spec.javascript property:

apiVersion: policy.jspolicy.com/v1beta1
kind: JsPolicy
metadata:
  name: gptpolicy-controller.drzzl.io
spec:
  type: Controller
  resources:
  - gptpolicies
  scope: Cluster
  operations:
  - CREATE
  - DELETE

I'd like to get some feedback added to the generation loop, maybe a chain of thought flow to problem solving, to catch more obvious issues, and there's also providing feedback from the platform itself in response to js runtime and k8s api errors. We do however get k8s-powered backoff logic for free.

I would also like to expand it to targeting small reproducable steps of building the logic for the policy, similar to how Voyager builds up a tool library, to get past learning codified in a code library that the LLM can use to solve like problems at a different level than training. At a level that the LLM controls.

This was shown to be a huge benefit to the Voyager system when paired with a higher-level goal seeking agent.

This was the policy that took my cluster down:

apiVersion: gpt.drzzl.io/v1
kind: GPTPolicy
metadata:
  name: no-prod-pods-in-test-ns.drzzl.io
spec:
  description: "Deny pods with the label 'env=prod' in the 'test' namespace."

Since I run my cluster on spot instances, when the instance running the webhook backing this policy went down, no pods could be started any longer(importantly the replacement for the webhook pod) because it was triggered on scheduling updates and by default denied the updates.

If I hadn't caught it sooner, once my spot instances all cycled through their lifetimes, no pods would be running any longer. In a non-spot cluster, when the instance that went down came back up, the webhook would have been started back up by its kubelet.

How to Delete Your GCP Org in One Easy Step With GitOps

Josh Perry — Tue, 19 Mar 2024 20:51:10 GMT

OK, not the whole org but all of its permissions, which will usually end up being almost the same.

At my last job I was helping a small company with k8s and app management and bringing everything up to SOC2 levels of security compliance. One task I spent some time on was getting the GCP IAM service accounts (using workload identity) for our apps under control of our gitops repo.

The system was configured with config connector(KCC), so making declarative changes to GCP resources boiled down to a relatively simple task of managing k8s resources. So I began searching for kcc documentation and noticed what looked like the perfect resource IAMPolicy.

One thing I frequently do with the kcc docs is to immediately scroll to the bottom to go over the resource samples as they often have enough data to figure out what I want to do at a high level, then I scroll back up to fill out knowledge on the resource's available fields.

What I failed to notice were all the giant red warnings at the top of the page.

Wow, ok. So now I can see why creating an IAMPolicy targeting the GCP project was a bad idea (a large portion of the permissions were managed via clickops)...

I pushed the first permission change and then began working on the next. A few short seconds later the first slack message comes in about things not working. A lot of the app made use of GCP resources, like functions, cloud sql, and pubsub, so having the permissions disappear made it difficult to get much useful work done.

As providence would have it, my boss had used the config-connector CLI to do a bulk export of all the resources in the project and had all of the permissions saved as kcc resources. He applied them to the cluster and kcc recreated all of the project's permissions. We later stopped the kcc controller and removed the resources to go back to our previous posture.

Wat Learan?

Why tell this story? To smear myself? Well, it can happen to anybody, and thinking about how we failed and thinking about mitigations is an important part of improving the future of operating large compute systems for everyone.

So what can we learn? What could we change in processes or automation to keep similar things from happening in the future?

The gitops methodology provides too many benefits to just call for its removal, but with the power to make changes to your infra with simple git commits comes the power to unintentionally delete or irreparably alter your infra with a simple git commit.

On this front there are a few things we can do:

Staging project/cluster

The first obvious thing we can do is to have a staging project and cluster that we target changes at before it gets deployed on production. In this particular environment we did not have a project or cluster like this to test changes on.

Yes it would be painful to rebuild your staging environment, though the gitops methodology should make that pretty painless. But at least the only outage is internally facing and not impacting your customers.

GitOps Output Linting and Analysis

Doing linting or static analysis on the output of your gitops pipeline is still not something that's common. There are tools like conftest which can be used to write tests against your yaml config, though I don't think there are any industry-wide test packs that enforce best practices(though there are some in the policy engine space a la OPA gatekeeper).

This issue could have been mitigated by a test or policy that failed when an IAMPolicy targeting a project was committed or applied.

In fact, I asked my GPT4-based policy engine to create just such a policy:

apiVersion: gpt.drzzl.io/v1
kind: GPTPolicy
metadata:
  name: no-iampolicy-targeting-projects.drzzl.io
spec:
  description: "Deny any IAMPolicy config connector resources that targets a Project"

This resulted in a jspolicy implementing my intent like so:

apiVersion: policy.jspolicy.com/v1beta1
kind: JsPolicy
metadata:
  creationTimestamp: '2024-03-19T21:16:24Z'
  generation: 1
  name: no-iampolicy-targeting-projects.drzzl.io
  ownerReferences:
    - apiVersion: gpt.drzzl.io/v1
      controller: true
      kind: GPTPolicy
      name: no-iampolicy-targeting-projects.drzzl.io
      uid: f5794403-f498-4ca8-a9a2-a02e838dd9a6
  resourceVersion: '109696173'
  uid: 385ae777-2ad8-4ed1-bbda-23c77645071d
spec:
  apiGroups:
    - configconnector.cnrm.cloud.google.com
  apiVersions:
    - '*'
  javascript: >-
    if (request.object.spec.resourceRef.kind === 'Project') { deny('IAMPolicy
    cannot target a Project'); } else { allow(); }
  operations:
    - CREATE
    - UPDATE
  resources:
    - iampolicies
  scope: Namespaced
  type: Validating

Bummer, it got the api group wrong, it should be iam.cnrm.cloud.google.com, but you can see how simple it can be to create a policy to protect yourself from a failure like I caused.

This is one reason I'm a big fan of jspolicy, it's so much simpler to create and understand your policies compared to solutions like gatekeeper. As rego was intended to be kind of a multi-modal policy language, I view it as another notch on the do-everything-tool-failed pole.

Changelog

Showing the committer a diff of changes they're going to make to cluster resources is also something I haven't seen as a common tool implemented in gitops pipelines.

For some failure cases, and especially during local iteration, seeing the resource changes that are going to be made is incredibly helpful. This can be kind of slow, depending on the size of your repo, but a common workmode for me is to just do something like kustomize build apps/yoke/overlays/test/ in a terminal beneath my code.

At a previous gig I had created a tool that would generate the config for a particular path(e.g. kustomize build) in your local working set and then pulled master to a tmp directory to generate config from that same path there, and produced a diff of the two. I'll have to remember enough to recreate it OSS.

This is one reason that some gitops practitioners will use both a source and a generated repo. In the source repo are the helm charts, kustomizations, raw yaml, etc, and commits there get config generated by automation and committed to the generated repo where the cluster is actually pointed; this makes it incredibly easy to diff between deploys.

For this failure case however, I thought using this resource was reasonable, so showing me what I was changing would have only made me more confident.

Backups

This particular failure was solved with a backup, but it was not a backup part of a regular occurring automation, it was just a manual command with the output on my boss' workstation.

Having regular backups, saved to a different platform/project than the backups are protecting, could be critical in reversing gitops failures like this that might not be caught by an existing test or review.

Abandonment

An option of the kcc controller includes an attribute that tells it to Abandon a GCP resource when a kcc resource is removed from k8s, meaning it's left in place rather than deleted in kind.

This wouldn't have helped this failure, but in cases where you are managing resources which represent data not reproducable from the k8s resources themselves(like persistent volumes, kms keys, or cloudsql instances), it is best to set this attribute so that the data backed by the GCP resource can be saved against accidental kcc resource removal.

A test or policy to this effect could cover all known resources where it's best to set this, and mutate it in place or fail.

Future of GitOps

I love gitops, I think it's a nobrainer. I would include tooling like automated IaC pipelines that are triggered by a commit; tools like atlantis, spacelift, plumi and kcc itself(implemented mostly in terraform under the hood).

What I'd like to see in the future is the community coming together and building a pack of regression tests that can be applied to output yaml to catch known failures. I think there is also room here for ML to help in detecting target states that are undesireable and reject them with suggestions.

I think this all may present itself eventually as a "high-level language" for interacting with the platform that transpiles down to k8s resources. Dealing with low-level details in configuring and managing a k8s cluster, even if the controlplane is handled for you, is still a nontrivial task that involves knowledge in storage, compute, memory, networking, and firewalls.

This isn't a level that devs can work directly against in most cases and most automated tooling I mentioned requires the creation of bespoke abstraction layers to make it dev-friendly. In fact, in a lot of larger orgs one or more FTEs are invested on DX tooling overlaying their inhouse PaaS(k8s or otherwise) exactly because of this.

GCP KMS cert-manager Issuer

Josh Perry — Tue, 19 Mar 2024 03:32:20 GMT

In my current project, the PKI heirarchy I want to create needs to be dynamic as automated and agile certificate topologies bring a lot of value to the defensive security story.

With the security hat on, I want the root of the certificate hierarchy, the root CA, rooted with its private key in an HSM. Since I'm running this on GKE it makes sense to leverage GCP's services here, and indeed there is a cert-manager issuer controller for googles Private CA Services.

google-cas-issuer/README.md at main · jetstack/google-cas-issuer

cert-manager issuer for Google CA Service. Contribute to jetstack/google-cas-issuer development by creating an account on GitHub.

GitHubjetstack

Yikes, it's not cheap, and I couldn't figure out if you can sign intermediates that reside ouside their CA product (external certs can be imported). Choosing KMS can offer us much better pricing to get an HMS-backed key if we manage the CA operations ourselves, and that's a big point of using cert-manager in the first place.

Searching for "cert-manager kms" lands you on Skyscanner's excellent opensource Issuer controller that gives cert-manager the power to sign certs with an AWS KMS key, but I'm not on AWS.

GitHub - Skyscanner/kms-issuer: KMS issuer is a cert-manager Certificate Request controller that uses AWS KMS to sign the certificate request.

KMS issuer is a cert-manager Certificate Request controller that uses AWS KMS to sign the certificate request. - Skyscanner/kms-issuer

GitHubSkyscanner

This is where the opensource spirit comes in and I make a project based on Skyscanner's. I initially tried to make a PR to add GCP KMS capabilities to the project, but while the code was able to act as an excellent framework for implementing an external Issuer, it's a bit too tightly woven with AWS semantics to be extended.

GitHub - drzzlio/kms-issuer: GCP KMS issuer is a cert-manager Certificate Request controller that uses GCP KMS to sign the certificate request.

GCP KMS issuer is a cert-manager Certificate Request controller that uses GCP KMS to sign the certificate request. - drzzlio/kms-issuer

GitHubdrzzlio

With this I can keep my root certificate long-lived and secure in a Google HSM (and managed via config connector), while keeping my intermediate and leaf certificates agile and oft updated. Give it a try and lmk what you think.

Here is an example deployment with config connector used to create the KSMCrypto(Ring|Key) and setup workload identity and policy to give the issuer access to run the sign operation via the GCP KMS API:

gitops/apps/yoke/base/issuer at master · drzzlio/gitops

Smooth is fast. Contribute to drzzlio/gitops development by creating an account on GitHub.

GitHubdrzzlio

Basically it boils down to these resources:

apiVersion: kms.cnrm.cloud.google.com/v1beta1
kind: KMSKeyRing
metadata:
  name: yoke
  annotations:
    cnrm.cloud.google.com/deletion-policy: "abandon"
spec:
  location: us-central1
---
apiVersion: kms.cnrm.cloud.google.com/v1beta1
kind: KMSCryptoKey
metadata:
  name: yokeroot
spec:
  keyRingRef:
    name: yoke
  purpose: ASYMMETRIC_SIGN
  versionTemplate:
    algorithm: RSA_SIGN_PSS_2048_SHA256
    protectionLevel: HSM
  importOnly: false
---
apiVersion: iam.cnrm.cloud.google.com/v1beta1
kind: IAMServiceAccount
metadata:
  name: yoke-kms-issuer
spec:
  displayName: yoke-kms-issuer
---
apiVersion: iam.cnrm.cloud.google.com/v1beta1
kind: IAMPolicyMember
metadata:
  name: yoke-kms-issuer-key-signerverifier
spec:
  member: serviceAccount:yoke-kms-issuer@gptops-playground.iam.gserviceaccount.com
  role: roles/cloudkms.signerVerifier
  resourceRef:
    apiVersion: kms.cnrm.cloud.google.com/v1beta1
    kind: KMSCryptoKey
    name: yokeroot
---
apiVersion: iam.cnrm.cloud.google.com/v1beta1
kind: IAMPolicy
metadata:
  name: yoke-kms-issuer-workloadidentity
spec:
  resourceRef:
    apiVersion: iam.cnrm.cloud.google.com/v1beta1
    kind: IAMServiceAccount
    name: yoke-kms-issuer
  bindings:
    - role: roles/iam.workloadIdentityUser
      members:
        - serviceAccount:gptops-playground.svc.id.goog[yoke/kms-issuer-controller-manager]
---
apiVersion: iam.cnrm.cloud.google.com/v1beta1
kind: IAMPolicyMember
metadata:
  name: yoke-kms-issuer-key-viewer
spec:
  member: serviceAccount:yoke-kms-issuer@gptops-playground.iam.gserviceaccount.com
  role: roles/cloudkms.viewer
  resourceRef:
    apiVersion: kms.cnrm.cloud.google.com/v1beta1
    kind: KMSCryptoKey
    name: yokeroot  
---
apiVersion: cert-manager.drzzl.io/v1alpha1
kind: KMSIssuer
metadata:
  name: yokeroot
spec:
  keyRef: 
    name: yokeroot
  commonName: Yoke Ephemeral Cluster Root
  duration: 87600h # 10 years
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: kms-test-ca
spec:
  isCA: true
  commonName: Test Cluster KMS CA
  secretName: test-ca-pki
  privateKey:
    algorithm: RSA
    size: 2048
  issuerRef:
    name: yokeroot
    kind: KMSIssuer
    group: cert-manager.drzzl.io

fly puts the bus in reverse and backs over NATS

Josh Perry — Thu, 14 Mar 2024 22:27:46 GMT

In a recent post on their blog, fly talks about how they finally hit a scale where they couldn't fit every user's wireguard peer in their regional gateway node's kernels anymore, and then in one fell paragraph, centered on one deadly sentence, NATS finds itself HTMXized, ostensibly to the engineer's great relief.

Our NATS cluster was losing too many messages to host a reliable API🔫... Scaling back our use of NATS made WireGuard gateways better, but still not great.

Now, I've never been good at O notating things, but the order of scales here doesn't line up in my mind. I've personally managed NATS clusters successfully doing RPC delivering on the order of millions of messages/s.

Seriously, though: you could store every WireGuard peer everybody has ever used at Fly.io in a single SQLite database, easily. What you can’t do is store them all in the Linux kernel.

To deal with delivery guarantees we added a TCP-like retry layer in our shared RPC client code (ack on msg receipt), and retries at the API gateway; for us, replacing ruby with golang for hot service paths was a much better win for reliability than replacing NATS would have been.

Now, I don't necessarily disagree with their ultimately empracing HTTP for RPC transport. If I was building an API layer like theirs, I would probably also not use NATS for RPC again. I had other issues with it, but none of those involved things that HTTP has necessarily solved either.

The rest of the article goes on to detail how they moved to using netlink for direct management of wireguard peer configs, and bpf and the server's private key to intercept and crack open the client's hello packet to extract its public key and 4-tuple.

Even without NATS, and with a local sqlite store, it still can't hit fast enough to respond with a peer miss; they go on to detail an interesting mitigation to the unreliable transport they'd created: Now having the client's identity and 4-tuple, instead of installing the peer and wait for a retry, they instead install it as an initiator and take things into their own hands.

The fact that they were able to deliver this level of experience that other wireguard topologies handle by dropping to userspace impls is no small feat. Using the client's hello packet as something of a STUN probe is just pretty dang ingeneous–and open source nonetheless–and something I will definitely be using.

NATS

Now, NATS was the topic of this post and I'm mostly writing it because I feel like there's probably pain beneath fly's decision to pull back on their deployment, and where there's pain there's lessons to be learned that shouldn't be dismissed in a sentence.

However I'm also somewhat personally interested in the decision as I'm playing with using NATS jetstream as a mutation store to play the part of etcd for ephemeral k8s clusters, and the stack's got some rough spots. The kine->jetstream proxy for a single master controlplane with no workers consumes 250millicore.

In real $, the best per-core price you can scrape by with spot instances is around $5/core/month by the time you get memory to run with it. So just the kine code for an idling controlplane is costing $1.25/month in compute; the 3 node NATS cluster behind it was using 150millicore, total. As an aside, I have to say that running NATS on spot instances has been a dream.

Now, I don't think this issue can be laid at jetstream's feet, I suspect there is performance to be had in the kine code (it remarshals and compresses each mutation), and even replacing it with a low-level system's language implementation could be a solution. Another area that could show promise is looking at the transactions that are actually flowing and see if there are any over-chatty clients that could be fixed.

I need idle controlplanes to basically be free, only doing essential work when it needs doing. This isn't really an operating mode most controlplane providers spend much time thinking about as they aren't really in the controlplane business, they're in the worker business.

It's still to be seen if NATS jetstream can play a part in a topology like this at scale, but the power in being able to horizontally scale a multitenant kafkaesque system that can do things like generate point-in-time snapshots of a cluster's state, and replicate it to other failure domains, is too tempting a feature not to use an event store, and I've used other event stores.

The way the jetstream kv-store composes watches with NATS subjects is particularly interesting for controlplane management. The ability to set up targeted watches into the state changes of all clusters at this layer could scale much better than making kubeapi connections to the tenant clusters themselves.

The ability of the store to track changes over time also gives a powerful source of automation and UX data. Intuitive audit timelines, ML defenses against bad actors/state, time-machine style controlplane debugging/testing/fuzzing/research, etc.

LLM-Powered Embodied Lifelong Learning SysOp

Josh Perry — Sun, 11 Feb 2024 03:57:51 GMT

I'm going to spend the next couple kilowords talking about how I view an LLM model's view of the world, and how we might map the LLM's conceptual-space to concept-first action spaces, like k8s-backed devops, to effect a higher-order persistent learning method driven by experimental success and failure.

Basically, I'm curious if we can get an bunch of LLM-based agents working together to create an kubernetes automation framework, via non-gradient trial and error, that you can drive with natural language requests, and I'm going to be running a series of experiements to that end.

Prior Artisans

One of my favorite AI papers is this interactive one that the authors published as a website, its goal: to explore the space of multi-agent LLM interaction models used to drive higher-order persistent learning with iteration around and through experimentation, and a clever code-based memory system, all towards the setting and reaching of long-term goals in a ruled system.

They say it best in the introduction of the paper:

We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components:

1) an automatic curriculum that maximizes exploration,

2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and

3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement.

Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting.

Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize.

Voyager | An Open-Ended Embodied Agent with Large Language Models

An Open-Ended Embodied Agent with Large Language ModelsGuanzhi Wang

I'm not particularly skilled at the mathematical intuition equal to implementing the tech underlying modern AI implementations, like transformers; but when they're used to generate software, an area where I have significant experience, the LLMs and I now have a relatively complex, and strict, shared conceptual model in which to poke at boundaries.

Our interaction with the world is so much in our language, it seems peculiar to find such a golem that seems to be able to easily generalize multi-modal human communication, and particularly excels when driven with conceptual constraints set in freeform human language, with all its beautiful vagaries.

Many struggles here mirror those in communicating between humans: shared context and context-free vernacular in particular, but with the addition that interactions with a statistical model feel unanchored.

Higher Order Conceptual Modeling

I think one thing people don't explicitly talk much about with LLMs is their multi-order conceptual capabilities. While saying that they're "next word guessers" is a somewhat technically accurate–though often pejorative–way techies sometimes like to describe LLMs, I think it really undersells how many levels of conceptual context lies beneath each "guess".

Each token the model operates on is represented as vector in a space that often has many thousands of dimensions. These degrees of freedom allow every token to represent an incredibly complex conceptual representation where those concepts draw on the training and usage context. When ignoring things like homonyms, any instance a word–like "apple"–can mean the same thing in every case while also being used in one of an infinite set of possible contexts.

The vectors don't necessarily encode the context directly, but the higher-order conceptual space that each context exists in. For example, when the word "apple" is found in a sentence, its usage context could map to concepts like computers, corporations and fruit in a way that a particuler embedding might conceptually encode a context like "a computer corporation that uses the apple fruit as its logo" for that single word.

The magic in transformers layers on top as something of a derivative of this conceptual space, it's a higher-order space that allows encoding for concepts around the context of the word in relation to what you might consider "time" from the perspective of the words in a sentence. Maybe the word "apple" gets encoded to a vector that means something like "the first part in a metaphor using the comparison with a disimilar object to ultimately convey the futility of comparing two very disimilar things, with a fruit specifically not used as the comparable for added hyperbole".

I think that what we might call an LLM's "intelligence" is an emergent property of operating in the space of this complex encoding of rich conceptual worlds that allows the model to "learn" the use of words in the source material from many levels and perspectives simultaneously.

This is the space in which it was discovered language translation could be solved as a 2 step mapping through such a model: from language A to concept-space, then from concept-space to language B.

Paradox

A seeming paradox then lies in the fact that the model can show a strong ability to converge on a large number of high-level conceptual models, while at the same time being able to show zero ability to converge on conceptual models derived purely from that same higher order set, but in ways that did not appear in the training set. The uneasiness that I called "unanchored" is often described by users in terms of common sense: "it's really smart, but it seems to be lacking common sense".

I think instead that this particular uneasiness is rooted in interacting with something that seems to have intelligence, but lacks any capability to use deductive reasoning; this is a really foreign concept to us as capability in deductive reasoning seems to scale with intelligence everywhere in the natural world. This supposition seems supported by the fact that deductive logic tests, like "door with 'push' in mirrored writing" are a favorite when new models drop, precisely because people find it fascinating how poorly LLMs do with simple variations on puzzles well-represented in the training set.

So in a way, hallucinations or creativity are derived from noise in the model of conceptual encodings; most likely in concepts where there is little or no correlation of particular concepts represented in the training set; where even almost imperceptible input on the training side, or even just plain coincidence, can drive resolution of a particular signal from this noise; distracting from the fact that it can't actually reason. Since the hallucination is a probability-based choice across the concept space means that it will fit the usage in so many ways we expect, that it becomes extremely difficult to detect the lies in their subtlety, their multidimensional perfection.

Formalism

If I was a mathemetician, this is probably where I would create a new penrose-style diagram that shows multidimensional token-concept convergence in a really interesting and meaningful way.

It's important to note that this view of LLMs is purely my headcanon, my theory of mind on the topic. It's most likely wildly different from what's actually happening under the covers, but I'm translating this description from embeddings in my own mind's conceptual weightings (or lack therof as the case may be).

I'll even drop a conspiracy theory here: I think the only reason we've seen models leak is that those dropping "open" models have gotten good enough to do conceptual elision with pretty tight accuracy and nuance; frighteningly, most call this ability to just wipe a concept from existence "alignment".

Memories and Language

With theorising over the LLMs conceptual-level operation behind, we can start making hypotheses in the space beyond hunchwork on how we can best exploit and extend it.

In the Voyager paper, one point that the authors are particularly proud of is this code-based skills library that the LLMs have written for themselves, which generalizes between new worlds and can be persisted and easily audited by humans. The ineffectiveness of the Voyager model sans the ability to persist skills puts a particularly sharp point on its importance.

LLMs are notorious for forgetting things, and there are many strategies that those in the industry use to address this: growing context windows, important context memory persistence, and information databases like a vectordb over API docs. There is a constant tension between–and a lot of thought&tech around–keeping important things in context while not blowing the size limit.

In Voyager a few agents interact with eachother to iterate on deciding what to do, writing code (javascript targetting the mindflayer library), and mutating it as feedback on effictiveness inside the language interpreter and/or the game come in. This feedback loop together with persona instruction are intended to drive the agents to converge towards parameterized task-level generalization (matching the function-based skills library well).

Freedom of Conceptual Association

I think one important aspect of the Voyager architecture is that the LLM-based agents themselves get full control over coordinating the mapping between human language and code language, or between the concept and action spaces, if you will. By coordinating through human language, consuming human language infobases, and getting feedback in human language, they decide the functions to make, to refactor, and then decide when and which of them to call based on the task at hand.

The conceptual mapping between human language and the code of the library is accomplished by storing embeddings(concept space) of human language descriptions of each function into a vector DB. Importantly, however, the concepts distilled from the descriptions were themselves communicated by an agent asked to write a human language description given only the function's code.

The code's description and the conceptual representation stored in the vector DB holds as a tight simile to our own process of learning about a library's functions via documentation; it is the rare developer who can recall documentation for a library verbatim, but what is common is the ability to recollect its conceptual model; even in cases where conceptual recollection itself is too weak to make direct use of, it is strong enough to effect a single-shot search of a human language infobase.

The first place I remember seeing something in this vein was an interesting article awhile back about some research they were doing at "the artificial intelligence nonprofit lab founded by Elon Musk and Y Combinator president Sam Altman". They had a number of AI agents given, sometimes conflicting, goals together with a channel over which to communicate; the language used over the channel was left up to the AIs.

These AI bots created their own language to talk to each other

A next step in the development of artificial intelligence.

VoxApril Glaser

This stuck with me as an example of how interesting adaptive capabilities can emerge in AI in multi-agent systems, in particular when they get to set the rules of conceptual mapping. The fact that the Voyager skills library is ultimately stored in a language which humans can grok is at least as interesting as seeing what Musk and Altman's AIs had to say to eachother.

Play with my Cluster

If there's one thing we've learned in ML, it's that strategies that look like experimentation, evolution, and play are extremely powerful and this case is no different. Even though Voyager does not use gradient-based training or fine-tuning we can see that the persistent, goal-driven, feedback-style learning loop in Voyager, paired with thoughtful context control and infobase references, mimics the effects of reinforcement learning in an important way.

In Voyager the agents have what we could consider a higher-order persistent memory of solutions that it has derived in the action space indexed by conceptual needs it has encountered in the concept space. This is similar to how a neural net's weights are used to configure input to output correlations during the training process, with reference to a reward function, and is retained as its "memory" for use during inference.

So while you would definitely not want to release an unskilled agent on your production cluster, human or not, you probably wouldn't be so against it skilling up against a test environment with a high-quality curricula and a ruberic heavily weighted towards execution.

In the end, after a lot of trial and error, a lot of play, you may end up excitedly welcoming a highly-skilled (highly cloneable) agent to take intelligent control of your cluster.

Future of SysOps

In the world of operations, devops, SRE, whateveryoucallit, the job is exactly to create dynamic systems which can, at some level, operate themselves given conceptual-level expectations from an upstream human.

We spend a lot of time drawing from, and compromising between, a number of large and general conceptual domains to ultimately craft our own, more focused, Frankenstein's concept model of something like a PaaS. The goal for this constrained, collated, and aggregated model is to enable our users to more simply and expressively communicate their intent across the complex myriad higher level domains while only needing to speak in terms of our ostensibly simpler conceptual model.

For a long time, in the industry, we stagnated on a set of stale tools which fail pretty badly at being dynamic, autonomous, and bridging the gap to unknowledgable users. While these tools gave those who understand the internals of distributed infra more productivity through concept modeling, most systems lacked much automation outside that triggered by human action, relied too much on unreliable and undeterministic state mutation, and did little to pierce the 4th wall into devland.

We were given a breath of fresh air with the community and piles of code–paired with support by most hosting providers–in kubernetes; it dominates share of mind and workload and brought with it a few attributes important to building both the conceptual and feedback-driven executive aspects of our ideal infra models.

While having this interface–the operator model in particular is brilliant–does give us much more capability to define models of what infrastructure means to us, it's still a lot like using assembly language to write a GUI app. While experts benefit greatly from the addition of information-driven autonomy, and deep platform abstractions, we persist in a world where the conceptual domains exposed by the platform are still too complex to directly interact with, sometimes even for those with the commensurate knowledge.

The Beginning of the Future

I have a number of ideas, and have even done some experimentation already, but before I share any of it, and to reify in my own mind what I'm looking at, I needed to land this brain dump.

I think that one of the most powerful aspects of LLMs is their ability to shift between conceptual representations, and with conceptual computing already becoming popular, this pairing could be a harbinger of a major shift in HCI. Many areas of the industry are experimenting with HCI-through-LLM right now, but I think devops in particular is primed for the integration by virtue of it's concept-focused interfaces.

Please stay tuned for experiments around how we can we can teach LLMs to do k8s sysoping for great good.

NixOS + SteamVR + OpenXR + Godot

Josh Perry — Tue, 02 May 2023 22:55:50 GMT

I'm in love with VR and with nix. I want to do my godot and other VR dev in Linux with the ability to easily share the output. SteamVR itself has a pretty simple usability story on nixos as of late and can be installed/configured basically by adding programs.steam.enabled = true; to your system derivation.

The system I'm integrating has an nvidia 1080ti, and I'm currently building nixos from the 22.11 branch, with godot_4 from unstable.

My primary usecase here is to support godot 4 VR dev with support for simply hitting the play button for quick iteration. This is working now, however there are two changes that need to be made to the default install to get this working.

First, we need our SteamVR to support OpenXR as this is the future of VR dev on godot and because it allows us to easily target non-steam headsets without steam being installed on the user's system. SteamVR is interestingly implemented as a steam "app" itself and uses the standard steam installation and upgrade flows that games themselves use.

Currently, the only version of SteamVR that supports OpenXR (at least on Linux, AFAIK) is in the beta release channel. To install this version, find SteamVR in your library, right click on it, select "properties, then "Betas" on the left and choose "beta" from the dropdown list to install it onto your system.

The integration with OpenXR is implemented at two important points. The OpenXR spec provides a well-known location for configuring OpenXR-compatible runtime libraries on a Linux system by specifying the config at $XDG_CONFIG_HOME/openxr/1/active_runtime.json. When initializing an OpenXR session, this file is read to determine which dynamic library to load into the process. This config is automatically dropped by the SteamVR install process.

On my machine this file looks like this:

{
  "file_format_version" : "1.0.0",
  "runtime" : {
    "VALVE_runtime_is_steamvr" : true,
    "library_path" : "/home/josh/.local/share/Steam/steamapps/common/SteamVR/bin/linux64/vrclient.so",
    "name" : "SteamVR"
  }
}

~/.config/openxr/1/active_runtime.json

The library referenced is an implementation of the khronos group's OpenXR interface specification for some particular vendor/stack (e.g. monado being an OSS impl).

In godot 4, OpenXR functionality is included by default and does not need to be added as a separate plugin like in godot 3. However, because of how nixos handles dynamic dependencies, when loading this library we run into a number of failures in the godot logs (run it from the terminal to see them on stdout):

Error [GENERAL | xrEnumerateInstanceExtensionProperties | OpenXR-Loader] : RuntimeInterface::LoadRuntime skipping manifest file /home/josh/.
config/openxr/1/active_runtime.json, failed to load with message "libSDL2-2.0.so.0: cannot open shared object file: No such file or director
y"

Because this library is installed as a steam app, the nixos machinations needed to map dependencies is not handled automatically. We ultimately need to manually patch this library so that it can find the necessary dependencies.

We're assisted by a nix package called steam-run which is a derivation that builds a virtual FHS filesystem with all the libraries and other deps necessary to run steam games linked in, effectively giving them the impression they're running on a standard Linux distro.

The steam-run command in this package is setup to run apps in this FHS environment so that they don't need to be patched themselves. Unfortunately the nixos godot package is not able to run in this FHS env because it needs a number of deps not available there, it simply fails to start if you try to steam-run godot.

If you're familiar with nix, you're probably also familiar with the runpath setting in ELF files (the format used for executables and library files on Linux). This allows specifying path(s) where runtime dependencies should be searched for. For example, if runpath on the above library was set to /usr/lib64, the runtime linker would look for the SDL lib at /usr/lib64/ibSDL2-2.0.so.0.

If we run ldd on the steam OpenXR library, we can see its dependencies and where the dynamic linker found them at, and those it could not find:

$ ldd $VRCLIENT
        linux-vdso.so.1 (0x00007ffdcef82000)
        libSDL2-2.0.so.0 => not found
        libGL.so.1 => not found
        librt.so.1 => /nix/store/vnwdak3n1w2jjil119j65k8mw1z23p84-glibc-2.35-224/lib/librt.so.1 (0x00007f16af3b4000)
        libdl.so.2 => /nix/store/vnwdak3n1w2jjil119j65k8mw1z23p84-glibc-2.35-224/lib/libdl.so.2 (0x00007f16af3af000)
        /nix/store/vnwdak3n1w2jjil119j65k8mw1z23p84-glibc-2.35-224/lib64/ld-linux-x86-64.so.2 (0x00007f16afa73000)
        libpthread.so.0 => /nix/store/vnwdak3n1w2jjil119j65k8mw1z23p84-glibc-2.35-224/lib/libpthread.so.0 (0x00007f16af3aa000)
        libstdc++.so.6 => not found
        libm.so.6 => /nix/store/vnwdak3n1w2jjil119j65k8mw1z23p84-glibc-2.35-224/lib/libm.so.6 (0x00007f16af2c8000)
        libc.so.6 => /nix/store/vnwdak3n1w2jjil119j65k8mw1z23p84-glibc-2.35-224/lib/libc.so.6 (0x00007f16af0bf000)

You can see there are a number of deps that cannot currently be found, though this is missing "deps-of-deps" (e.g. SLD2 depends on libX11). This could get really messy as the ultimate tree of dependencies can be quite large, and patching every dependency would be a non-starter.

Luckily all of these missing deps, already patched with the proper runpaths, are linked into the steam-run FHS env, so basically all we need to do is set the runpath for the OpenXR interface lib to load the libraries from this FHS. We can do this by using the nix tool called patchelf to modify the library itself.

I currently do this with the following script:

#!/usr/bin/env bash
VRCLIENT=~/.local/share/Steam/steamapps/common/SteamVR/bin/linux64/vrclient.so
STOREPATH=$(nix-store -qR `which steam` | grep steam-fhs)/lib64
patchelf --set-rpath $STOREPATH $VRCLIENT

This finds the steam-fhs path in the nix store that the current steam is being run in. Once this rpath is set, we can run ldd again to take a look at the fully resolved dependency list, and at this point hitting the play button in godot should work properly (start SteamVR first as it won't auto-start like on windows).

$ ldd $VRCLIENT
        linux-vdso.so.1 (0x00007ffcaaab0000)
        libSDL2-2.0.so.0 => /nix/store/6sdhss95xxd674hrlm2b6qvm5bbnrkz7-steam-fhs/lib64/libSDL2-2.0.so.0 (0x00007ff3d3a8c000)
        libGL.so.1 => /nix/store/6sdhss95xxd674hrlm2b6qvm5bbnrkz7-steam-fhs/lib64/libGL.so.1 (0x00007ff3d39fe000)
        librt.so.1 => /nix/store/6sdhss95xxd674hrlm2b6qvm5bbnrkz7-steam-fhs/lib64/librt.so.1 (0x00007ff3d39f9000)
        libdl.so.2 => /nix/store/6sdhss95xxd674hrlm2b6qvm5bbnrkz7-steam-fhs/lib64/libdl.so.2 (0x00007ff3d39f4000)
        /nix/store/vnwdak3n1w2jjil119j65k8mw1z23p84-glibc-2.35-224/lib64/ld-linux-x86-64.so.2 (0x00007ff3d4367000)
        libpthread.so.0 => /nix/store/6sdhss95xxd674hrlm2b6qvm5bbnrkz7-steam-fhs/lib64/libpthread.so.0 (0x00007ff3d39ef000)
        libstdc++.so.6 => /nix/store/6sdhss95xxd674hrlm2b6qvm5bbnrkz7-steam-fhs/lib64/libstdc++.so.6 (0x00007ff3d37d7000)
        libm.so.6 => /nix/store/6sdhss95xxd674hrlm2b6qvm5bbnrkz7-steam-fhs/lib64/libm.so.6 (0x00007ff3d36f7000)
        libc.so.6 => /nix/store/6sdhss95xxd674hrlm2b6qvm5bbnrkz7-steam-fhs/lib64/libc.so.6 (0x00007ff3d34ee000)
        libX11.so.6 => /nix/store/33qdhi8l6f4ixqzdc387w9gwdxrdsara-libX11-1.8.4/lib/libX11.so.6 (0x00007ff3d33a7000)
        libXext.so.6 => /nix/store/7wvl0fsdjf225qfkm55x8clwcbmx6mvn-libXext-1.3.4/lib/libXext.so.6 (0x00007ff3d3392000)
        libXcursor.so.1 => /nix/store/zp53b74y1a633ln8911306k72fr66da4-libXcursor-1.2.0/lib/libXcursor.so.1 (0x00007ff3d3383000)
        libXi.so.6 => /nix/store/4fadc9ggmmy1lm260lazra2b3is9ivfv-libXi-1.8/lib/libXi.so.6 (0x00007ff3d336f000)
        libXfixes.so.3 => /nix/store/fl75671jh474li11v36ar7305z1a4mzm-libXfixes-6.0.0/lib/libXfixes.so.3 (0x00007ff3d3367000)
        libXrandr.so.2 => /nix/store/c9c4xh22pfnyr2aravx7v4rvvznmcxcl-libXrandr-1.5.2/lib/libXrandr.so.2 (0x00007ff3d335a000)
        libXss.so.1 => /nix/store/1w8nic2428ppr4v3q5xhnqn0zqgp7i8f-libXScrnSaver-1.2.3/lib/libXss.so.1 (0x00007ff3d3355000)
        libGLX.so.0 => /nix/store/1yllc6r36zxgxmjmn2kqcs2vjqhlvyl9-libglvnd-1.5.0/lib/libGLX.so.0 (0x00007ff3d331f000)
        libGLdispatch.so.0 => /nix/store/1yllc6r36zxgxmjmn2kqcs2vjqhlvyl9-libglvnd-1.5.0/lib/libGLdispatch.so.0 (0x00007ff3d3267000)
        libgcc_s.so.1 => /nix/store/6plx60y4x4q2lfp6n7190kaihyxr7m1w-gcc-11.3.0-lib/lib/libgcc_s.so.1 (0x00007ff3d324d000)
        libxcb.so.1 => /nix/store/i9hwmlk4va2dxcca00cmy9vy38d3f5l1-libxcb-1.14/lib/libxcb.so.1 (0x00007ff3d3222000)
        libXrender.so.1 => /nix/store/rrs3p13wgrlp82i41cksgbcyfri8k6yg-libXrender-0.9.10/lib/libXrender.so.1 (0x00007ff3d3213000)
        libXau.so.6 => /nix/store/wmpqgysa9qmm6gr9smn3wrmnz2wr0pf5-libXau-1.0.9/lib/libXau.so.6 (0x00007ff3d320e000)
        libXdmcp.so.6 => /nix/store/89xgh7cmxmzclkpci1v4zbfr1idg0ha4-libXdmcp-1.1.3/lib/libXdmcp.so.6 (0x00007ff3d3206000)

I'm not sure if there's a way to hook the SteamVR installation process to handle this automatically, but for now I need to run this script anytime that SteamVR recieves an update. One last caveat, this is on an X11 system, I have no clue if it would work properly on wayland.

The Atlantic-Sized Hole between docker and k8s

Josh Perry — Sun, 23 Apr 2023 02:18:17 GMT

One place where I've been really unhappy with the UX of modern container-based SDLC pipelines is the creation of the images themselves. Docker was a disruptor, they were first to bring a (mostly) effective UX to the concept of "containers". But it hasn't kept up. The dockerfile methodology has always felt halfway between cache-efficient and delivery-efficient, ending up not being particularly efficient at either.

One of the most glaring compromises I ran into while testing how the content-based layer IDs work in docker, is that just the act of building an image causes the mtimes on the files in the container image to be set to the current build-time: this effectively causes every layer in the build to be impure, rendering the hash useless as a measurement.

At the end of the day it works, and to management shipping is really one of the only reliable SDLC metrics most invest in understanding. But security shifts evermore left as upper management is forced--ostensibly by laws created in the face of actual breaches--to address issues like supply-chain attacks and with a dearth of software security professionals at their disposal, pushing even more weight on having tight control of provenance-style concerns onto the developers.

I've spent a lot of time thinking about how the complexity of the developer UX impacts their ability to build sound solutions in the kubernetes infra space and, empirically, it seems that most devs don't know, and in most cases will probably never have the time to become an expert in, building full-stack (e.g. down to configuring netpol and HPA). This is where the ops part of devops usually starts to gel out as an individual specialization in a lot of orgs.

Similarly, a developer doing yarn install does not have the expertise to build secure infrastructure around ensuring the deps in their yarn.lock file, or the FROM in their dockerfile, have not been compromised. And these are just the security-critical aspects. The dockerfile method of building images has many other landmines that are not security critical but are huge productivity killers (cref: caching) and are very difficult to handle properly for a single image let alone a whole org.

Case Study

# escape=`

# Use the latest Windows Server Core image with .NET Framework 4.8.
FROM mcr.microsoft.com/dotnet/framework/sdk:4.8-windowsservercore-ltsc2019

# Restore the default Windows shell for correct batch processing.
SHELL ["cmd", "/S", "/C"]

# Use temp dir for environment setup
WORKDIR C:\TEMP

# Download the Build Tools bootstrapper.
ADD https://aka.ms/vs/16/release/vs_buildtools.exe vs_buildtools.exe

# Install Build Tools with msvc excluding workloads and components with known issues.
RUN vs_buildtools.exe --quiet --wait --norestart --nocache `
    --installPath C:\BuildTools `
    --add Microsoft.VisualStudio.Workload.VCTools --includeRecommended `
    --add Microsoft.VisualStudio.Component.VC.ATL `
    --remove Microsoft.VisualStudio.Component.Windows10SDK.10240 `
    --remove Microsoft.VisualStudio.Component.Windows10SDK.10586 `
    --remove Microsoft.VisualStudio.Component.Windows10SDK.14393 `
    --remove Microsoft.VisualStudio.Component.Windows81SDK `
 || IF "%ERRORLEVEL%"=="3010" EXIT 0

#ENV chocolateyVersion 0.10.3
ENV ChocolateyUseWindowsCompression false

# Set your PowerShell execution policy
RUN powershell Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Force

# Install Chocolatey
RUN powershell -NoProfile -ExecutionPolicy Bypass -Command "iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))" && SET "PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin"

# Install Chocolatey packages
RUN choco install git.install bzip2 -y && choco install cmake --version 3.17.2 --installargs 'ADD_CMAKE_TO_PATH=System' -y

# Copy dependency setup script into container
COPY .ci\depsetup.ps1 depsetup.ps1

# Downloaded dependency versions
ENV MATHFU_VERSION master
ENV CEF_VERSION cef_binary_81.3.10+gb223419+chromium-81.0.4044.138_windows64

# Install dependencies
RUN powershell.exe -NoLogo -ExecutionPolicy Bypass .\depsetup.ps1

# Change workdir to build workspace
WORKDIR C:\workspace

# Define the entry point for the docker container.
# This entry point starts the developer command prompt and launches the PowerShell shell.
ENTRYPOINT ["C:\\BuildTools\\Common7\\Tools\\VsDevCmd.bat", "&&", "powershell.exe", "-NoLogo", "-ExecutionPolicy", "Bypass"]

Dockerfile

######
# Install project dependencies
######

echo "MATHFU_VERSION: $Env:MATHFU_VERSION"
echo "CEF_VERSION: $Env:CEF_VERSION"

Push-Location $Env:USERPROFILE

# Get mathfu
git clone --recursive https://github.com/joshperry/mathfu.git
Push-Location mathfu
git checkout "$Env:MATHFU_VERSION"
Pop-Location

setx MATHFU_ROOT "$Env:USERPROFILE\mathfu"

# Get cef
$CEF_VERSION_ENC=[uri]::EscapeDataString($Env:CEF_VERSION)
Invoke-FastWebRequest -URI "http://opensource.spotify.com/cefbuilds/$CEF_VERSION_ENC.tar.bz2" -OutFile "$Env:USERPROFILE\$Env:CEF_VERSION.tar.bz2"
bunzip2 -d "$Env:CEF_VERSION.tar.bz2"
tar xf "$Env:CEF_VERSION.tar"
Remove-Item "$Env:CEF_VERSION.tar" -Confirm:$false

setx CEF_ROOT "$Env:USERPROFILE\$Env:CEF_VERSION"

Pop-Location

depsetup.ps1 (minus Invoke-FastWebRequest for core compat)

# Generate visual studio sln
.\gen_vs2019.bat cibuild

# Execute the build
Push-Location cibuild
cmake --build . --config Release
Pop-Location

build.ps1

That's a lot of ish to just drop on you, I know, but it's important to the narrative I'm building and isn't really important to understand.

This is the containerized build setup for a VR project I've been working on. I know it's a lot of windowese (heck, I don't even have powershell syntax highlighting on the site), but though the tooling differs, it mirrors mostly the same process on other OSs like Linux.

This is fresh in my mind because I've been working on moving this project over to Linux at the same time that I happened to be trying to do a nix+nixos immersion. I am going to do a comparison, it feels to me as stark a shift as the move from VMs to containers+k8s. Keep in mind as well that the above also leaves out the manual process of building zmq libs ahead of time as well as vendored deps that I used to check into the repo (just never got around to automating it, it's a pain).

{
  description = "A flake for building ovrly";

  inputs.nixpkgs.url = "github:nixos/nixpkgs/nixos-22.11";

  inputs.cef.url = "https://cef-builds.spotifycdn.com/cef_binary_111.2.7+gebf5d6a+chromium-111.0.5563.148_linux64.tar.bz2";
  inputs.cef.flake = false;

  inputs.mathfu.url = "git+https://github.com/google/mathfu.git?submodules=1";
  inputs.mathfu.flake = false;

  inputs.openvr.url = "github:ValveSoftware/openvr";
  inputs.openvr.flake = false;

  inputs.flake-utils.url = "github:numtide/flake-utils";

  inputs.nixgl.url = "github:guibou/nixGL";
  inputs.nixgl.inputs.nixpkgs.follows = "nixpkgs";

  outputs = { self, nixpkgs, cef, mathfu, openvr, flake-utils, nixgl }:
    flake-utils.lib.eachSystem [ "x86_64-linux" ] (system:
    let
      pkgs = import nixpkgs { inherit system; overlays=[ nixgl.overlay ]; };
      deps = with pkgs; [
  # cef/chromium deps
        alsa-lib atk cairo cups dbus expat glib libdrm libva libxkbcommon mesa nspr nss pango
        xorg.libX11 xorg.libxcb xorg.libXcomposite xorg.libXcursor xorg.libXdamage
        xorg.libXext xorg.libXfixes xorg.libXi xorg.libXinerama xorg.libXrandr
  # ovrly deps
        cppzmq fmt_8 glfw nlohmann_json spdlog zeromq
      ];

      ovrlyBuild = pkgs.stdenv.mkDerivation {
          name = "ovrly";
          src = self;

          nativeBuildInputs = with pkgs; [
            cmake
            ninja
            autoPatchelfHook
          ];

          buildInputs = deps;

          FONTCONFIG_FILE = pkgs.makeFontsConf {
            fontDirectories = [ pkgs.freefont_ttf ];
          };

          cmakeFlags = [
            "-DCEF_ROOT=${cef}"
            "-DMATHFU_DIR=${mathfu}"
            "-DOPENVR_DIR=${openvr}"
            "-DPROJECT_ARCH=x86_64"
            "-DCMAKE_CXX_STANDARD=20"
          ];
        };
      dockerImage = pkgs.dockerTools.buildImage {
        name = "ovrly";
        config =
          let
            FONTCONFIG_FILE = pkgs.makeFontsConf {
              fontDirectories = [ pkgs.freefont_ttf ];
            };
          in
          {
            Cmd = [ "${ovrlyBuild}/bin/ovrly" ];
            Env = [
              "FONTCONFIG_FILE=${FONTCONFIG_FILE}"
            ];
          };
      };
    in {
      packages = {
        ovrly = ovrlyBuild;
        docker = dockerImage;
      };
      defaultPackage = ovrlyBuild;

      devShell = pkgs.mkShell {
        nativeBuildInputs = [ pkgs.cmake ];
        buildInputs = deps;

        # Exports as env-vars so we can find these paths in-shell
        CEF_ROOT=cef;
        MATHFU_DIR=mathfu;
        OPENVR_DIR=openvr;

        cmakeFlags = [
          "-DCMAKE_BUILD_TYPE=Debug"
          "-DCEF_ROOT=${cef}"
          "-DMATHFU_DIR=${mathfu}"
          "-DOPENVR_DIR=${openvr}"
          "-DPROJECT_ARCH=x86_64"
          "-DCMAKE_VERBOSE_MAKEFILE:BOOL=ON"
          "-DCMAKE_CXX_STANDARD=20"
        ];
      };
    }
  );
}

flake.nix

I notably also do not have syntax highlighting for nix (at time of writing).

I do want to talk about the nix language for a couple paragraphs, but without even understanding the above I want to explain some highlights in my mind.

Besides replacing all of the above scripts (which may have been Makefile+bash on Linux), this provides the following features:

hermetic builds: flake.lock, custom canonicalizing archive format (nar), all output file times set to epoch+1, $HOME == "/homeless-shelter", etc.
build-time deps: a number of deps are already in the nix store, some already compiled; for the rest, using flake inputs basically just puts them in a folder for you and gives you the path (amazing). I was able to move all dep handling to the flake in 3 or 4 iterations.
runtime deps: after the build, the build scripts will find all files referenced from the store by any of the outputs (false positives really can't happen because of hashes in the paths). It considers these the runtime deps and automatically sets up the dependency tree for them!
container image: this can build a container image nix build .#docker, including all the deps, WITHOUT DOCKER! It links a tar.gz of the layer stack (in docker save format) as ./result, use docker load < ./result to get it into your docker image store. I'll talk more about this, but this is close to #1 wow.
nix develop: the devShell derivation is realized when you run nix develop in the same dir as the flake. This puts you into a nix shell with all the deps and tooling in the path. All of the tooling exposed by mkDerivation for its automatic build process are also available: run cmakeConfigurePhase and it will create ./build , enter it, and use your CMakeLists.txt and cmake to fill it with either a makefile or ninja project ready to build. Running buildPhase in the shell will actually build the project. The ease of iteration and the amount of data and configurability available here is really outstanding.
nix build: Running this will build the flake and link ./result to the completed build output folder in the nix store. In this case, ./result/bin/ovrly will run the OpenGL VR application.
content-addressed: because of the hermetic builds we get support for creating a fully content-addressed provenance stack. We can prove where every single component came from in every build, down to the bytes of the source code. :googly-eyes:

All of this together, the nix language itself notwithstanding, is honestly pretty mindbending for me. Even just typing this out for the first time makes it hit harder than the impact alone of actually using it to solve a problem.

Bridging the Divide

The building of a docker image is no small thing. I have to say that it is something that we were struggling to solve at my last dayjob. We had a pretty great CI/CD system using gitlab+gitops+k8s to keep applications moving out to prod with pretty high frequency. In the shadow of supply-chain attacks like solarwinds (our network guys are pros and they had our instance well isolated) we were very cognizant of our need to address our code provenance.

We had begun scanning all containers as they landed in our registry and used that tooling to generate software BOMs as well as threat scores. What we were still in the planning phases on was the nonrepudiation and triagability of the layers in our docker stacks.

All of our projects were using FROM to reference images directly upstream from dockerhub et al. This setup is ripe for zeroday supply-chain attacks on the registry that your scanner won't pop on and is exactly why we were working to address it.

Our plan of attack at that point was to create blessed base images that projects could derive FROM. The devops/security team would be in charge of providing these images which sould most likely also share a common base, through one or more layers. This would give us a single-point of touch to address issues in all downstream projects.

We could flag base image builds in our scanner (also affecting their ability to run in prod) and force the CD pipeline to do a new build with the fixed base immediately, including automated deployment to production.

We already had a number of package and project-level prophylactics for attacks on registries like ruby gems or pip, but these require complex deployments that understand applications syntactically at the deployment level. This is one place where containers excel in assisting in final package measurements.

This left on the table one big elephant to swallow: integration of our CI and devops systems. At the time we were still running all of our build machines as either VMs or, increasingly, bare metal. We already had a bunch of baremetal k8s nodes, we had a dictum to run prod builds in our secure prod environment, and we wanted to consolidate our runtime infrastructure managment under k8s.

This is a direction we were heading, with docker image builds causing quite a roadblock. Because building docker images requires mounting filesystems to create and manipulate the overlays, it necessarily needs to run as root. This is ostensibly the job of the docker daemon, though giving a process access to the daemon also gives them easy escalation to run arbitrary code as root.

Kubernetes is moving towards a user-namespaced future, but we're not there yet. We couldn't allow build scripts that run on arbitrary commits have access to the docker daemon on the hosts executing production workloads. There are other solutions, like a separate cluster, affinity, running microVMs, podman, buildah and the like, but none are great solutions particularly in the face of the dockerfile method's poor approach to provenance.

If we're going to rewrite the build system to better... support provenance, I don't think it's out of the question to look at solutions on the fringe of the space to see what's available to salve what ails us.

A New Standard?

Nix itself is in some ways unapproachable, the documentation lags and has holes, the shift to flakes muddies the future, and functional programming isn't well understood even in the ranks of the general dev pool. It's not a silver bullet by any means, but it has even more power than I let on here.

If you're familiar with nix then you'll know how the option system works, and how the different phases of the process of making the derivation and then realizing it jive. For those unfamiliar, I'd like to put my own color on something that I see as a lynchpin.

You have not only the ability to define a language for building, packaging, and deploying your application with strong provenance, but you also have the ability to define a language for configuring the application. That configuration can then additionally play an imporant part of the provenance of the total application and its deployment.

Something that tumbles in my mind any time I'm slogging through editing one of the millions of config files on a Linux system (whether directly or through a change management system like puppet or ansible) is "man we really need a standard for application configuration". And then in the next beat an image of the apropos always-relevant xkcd pops into my head.

While this may be yet another standard, another abstraction, in the pursuit of the one to bind them all. I feel like the drip this capability puts on top of everything I've laid out above really sells it.

This is already beginning to become quite winded and without a proper primer on nix it would be difficult to go much deeper into how this shifts things. But I do want to end with a quick look at where we might go from here with an eye towards devX.

When I let my imagination run a bit, it immediately bears much fruit. I'll share one such line and leave the others for future discussion and your own rumination.

We can build binaries with their deps, package dynamic languages with theirs, define derivations that instantly gives us a well-defined dev environment, why can't we have a derivation that builds the k8s deployment manifests? Well, great minds usually think alike, and I'm by no means an early adopter of the ways of nix: https://github.com/hall/kubenix.

I know many decrying their newfound status as a yaml jockey, I could see any number of lispers or schemers that would jump at the chance to throw functions at the problem than tabs (or was it spaces...)

fly.io replaces Nomad with NIH

Josh Perry — Sat, 11 Mar 2023 08:11:48 GMT

The newtech, selfhosted, cloud, IaaS provider fly.io released a blog post last month detailing the escapade.

I came upon this post as I've been recently looking through the landscape of [I|P]aaS providers. I'm simple, when I see "replacing Nomad", I click; wasn't disappointed with the quality or the jab at Hashicorp's OSS-incompatible pricing.

Interestingly the post starts with something I think would be amazing interview fodder, and great content for deeply learning a programming language in a distributed env or as a youtube series. A good half the post starts us out with a well-done intro to orchestration and scheduling, including architecture, theory and (a little) code, love it.

The first part of the second half, now, is a play-by-play shit smearfest on k8s and transitively Nomad through the family tree to their shared ancestor: borg. Much is laid out in the area of why the borg-a-like schedulers are fundamentally broken in our glorious synchronous FaaS future.

Ostensibly this was all a setup to justify their decision to NIH both of them and Consul as described in the final 1/4th by effectively pointing up-post and describing how they actually implemented their mid-level dev whiteboard problem.

I landed on this post ultimately starting from a recent DevClass article about how fly is struggling to fix their platform and scale partly at the hands of hordes of devs looking for a one-click PaaS to run their stricken free-tier heroku apps. I was less surprised in retrospect retrospect(thinking back to the article), having the problems they were describing.

We’re in an awkward phase where the company isn’t quite mature enough to support the infrastructure we need to deliver a good developer UX

No time for that when the engineers are half-solving the interesting orchestrator, scheduler, distributed discovery, network mesh, secret storage, cloud DB, diskdev, and executor problems whole cloth.

gossip-based consistency is a difficult problem

What do you do when your global service catalog becomes corrupted because of a bug?

We’ve pushed the platform past what it was originally built to do

I look back at my previous assertion that implementing a stack like this from scratch is a very tough value prop to sell. It's a solved lego problem and building the fly.io UX would arguably have been a better place to expend limited startup resources while also not being harangued by an inoperative platform.

The kubernetes scheduler, for example, can be customized with webhooks making it an ideal strategy playground for unorthodox scheduling. With RuntimeClass custom executors can be quickly experimented with while binding at the pod level.

Now obviously k8s can't be the solution to everything.. Yes

Why Kubernetes is the Linux of the Future

Josh Perry — Tue, 07 Mar 2023 23:12:28 GMT

I wrote a large diatribe attempting to extol the virtues of kubernetes in an internal document, trying to sell it off as the perfect abstraction layer. The Linux kernel itself is the OS's arbiter between the hardware and software worlds and in fact plays an integral part of the container runtime. I view the job of k8s similar to the kernel's, but targeting some hypothetical world computer.

Not necessarily a world-size number of computers, but a unit of compute scheduling above that of a physical system. As we've (the computing public) moved our isolation capabilities to finer grains, from the machine, to the vm(many/machine), to the process(2-3x*many/machine), there has also been significant movement on the opposite side of the spectrum.

Like the Linux kernel presents comput task, network interface, and block storage abstractions over physical hardware, kubernetes provides compute task, CNI and CSI abstractions that are aware of the distributed system and expose the capabilities of the cluster to applications via a stable interface.

While our requirements may steer certain workloads into certain physical locations, the software itself should be able to be mostly agnostic to where it runs (think 12-factor). Things like multi-arch container images make this even more of a reality, as we see a coming proliferation in available target architectures and the already huge success in the armv7+arm64+x86_64 initiatives.

I have witnessed a small but vocal group pushing a "kubernetes is complex" view on the world. While it is true that kubernetes can be complex, necessarily from its attempt at abstracting over complexities. When the complexities beneath the abstraction have problems, you definitely want someone that understands the distributed systems.

But with a goal to make that abstraction more and more robust, there is a foreseeable future where the complexities no longer protrude; where the kubernetes kernel is as interface stable as the Linux kernel, and the need to delve beyond the veil is seen to be as complex a thing as debugging a kernel driver, isolated to the world of those who relish in such things.

The alternative, as I have seen, is either another abstraction a la openstack (expensive), or a small part of a disparate and lacking (and usually expensive) amalgamation of what's in kubernetes proper (in something like Hashicorp).

I don't want to love kubernetes-I'm often something of a contrarian-but in my personal experience using k8s and containerization, I feel it is as bigger a shift in our industry than VMs. But as I opined in my previous post, the k8s interface is far from the end of our journey; while I feel like investing in a greenfield project to compete with k8s is a bad bet, the place to invest resources is above the k8s abstractions.

There is so much inertia in the kubernetes project and those surrounding it that building on top of it in some way has been the only reasonable path to take, and is one that the cloud providers themselves have taken. There is so much value that is yet to be added on top that I'd consider some additions on top of a bare cluster to be requirements (we'll talk about these in the future) a la GNU to Linux's kernel.

So why is k8s Linux of the future?

It's stable under high dev velocity, performant, exposes stable resource abstractions, and the API and controller concept are brilliantly executed. All of this together gives us a highly extensible agent-driven execution environment with a unit of compute larger than a single host, one that should be capable of spanning the world. Above that is a target-rich environment for providing value in automation, processes, and UX that can potentially make it "easy to use".

This is the year of desktop kubernetes

Addicted to Kubernetes

Josh Perry — Sun, 22 Jan 2023 10:51:50 GMT

I've spent the last year working at my current gig with a talented team to make a huge move from onprem kvm virtual machines and puppet, to containers and kubernetes deployed to an onprem cloud. Before that I spent the better part of a decade building a containers-to-the-edge IoT startup on arm and kubernetes. It's been my purpose for a long time now.

At the current gig, by this point we have a large portion of some of our most important apps deployed exclusively into containers including some written in ruby, jruby, elixir, golang, rust, java, javascript, C++, and lots of yaml.

We've deployed interesting things like custom operators to sync istio services with consul, filesystem providers based on e2e encrypted ceph with s3 and snapshots, network plugins for transparent IP connectivity to VMs and metal, mesh networks to remove encryption and routing concerns from apps, and we deployed buttloads of proxies to bridge the old with the the new.

I've learned a lot while being addicted. I'd like to share some of the ups and downs, pros and cons, along side some tips and tricks we've learned up to this point.

TLDR: Yes it's worth the work.

There is some peripheral machinery that needs to be managed in-house in order to create an onprem cloud with a similar surface area to the big providers, while being efficient in cost of time and resources. However, the difficulty seems to be inversely proportional to the height of the house of cards.

One of our goals in this endeavor was to be cloud ready, ready to run our workloads on clusters on the cloud providers for scalability and isolation. With GCP DCs in the same cities as ours, we've been seriously eyeing the low-latency connectivity and autoscaling clusters.

This axis of the requirements has been easy to pin with the well-adopted abstractions and automatability provided by kubernetes and the projects of the extraordinary community around it.

Education, though, is the first thing I think that needs addressing.

This has been the most difficult aspect of it for me, I need to become a better teacher. I've failed a lot here, and learning from the failures has been difficult; I wouldn't even say the project was wholly a success at this point or that it won't yet fail under it's own weight, I can just say that it was more than fit for the intent that we built it.

Access to wide and deep training available on all the pieces is one of the most exciting parts of the kubernetes ecosystem, there is abundant information in not only trainings, but huge numbers of blogs and videos.

While automation can make clusters seem magical in the happy path, when shit goes wrong why is someone trying to tcpdump a pod? Handling the exceptional aspects, like is so often the case, is a long-tailed dragon; having well-trained people is the only way to wrangle it.

In the ops back office, one of the most painful aspects is that complex systems are complex, when things break people are needed that understand the layers beneath the yaml. Even if there were service contracts, there's often little time to ring up a vendor to troubleshoot your env in realtime.

On the frontend, we can't expect a developer to write calico network policy and istio virtual services, or wire up their canary deploys to an automatic analysis stage and tie their saturation metric up to a horizontal pod autoscaler.

To do SDLC in the vein of CICD at scale, having full-stack aware team members is a really big ask even as shifting everything from ops to security leftward is en vogue. I don't know the best method for scaling the mindshare of the stack to different sets of C-leveling, planning, platform, operator, QA, and dev kind of people, but it's a knowledge space that's ripe for my experience-by-failure M.O.

There are problems here that can definitely be solved with more abstractions, but the abstractions are always become Landru to our beta III.

There are times I wonder if the premium cost of the cloud promise to scale the hands-on-keyboard to server ratio is worth it. Does leaving the layer beneath the yaml in the hands of the providers as big a win as they charge for? This same thought crosses the mind of many a tech executive looking to do more with less.

I am a strong proponent of onprem operations at scale, but whether the workloads at my current gig make a swift shift to the cloud remains to be seen. Our team unfortunately as of yet has done little in the way of moving our workloads into a cloud provider region. I look forward to seeing what the future holds here.

You hear an oft cited refrain echoing down the halls of the colo datacenters about how the cloud is just someone else's computers. Can we take control of our computers back and bring on a new dawn of the internet PC? I think we maybe already have.

What do you do if you're addicted to your higher power?