I am a fan of straight-forward declarative infrastructure, and for smoothing the interface between "hardware" infra tools like Terraform and orchestrating workloads on top of that hardware, I have never used a tool I enjoy as much as kubernetes.

RabbitMQ is another project that I've enjoyed using in many systems. My goal here is to see how simply we can take the two of them and build an easily-scalable messaging cluster.

Unlike some clustered workloads, RabbitMQ actually turns out rather pleasant to cluster on k8s without much voodoo. A lot of the pleasantry can be attributed to rmq's erlang underpinnings providing robust inter-machine process handling over IP.

Auto Cluster Formation

To build an rmq cluster, each node needs to know what other hosts are supposed to be in the cluster with it, and a way to resolve the peer hostnames to IPs (reverse is also necessary).

For rmq discovery, the project already provides a great inbuilt plugin for using the k8s API for peer discovery which gets its information from the k8s API (which is backed by etcd).

For connectivity, the flat pod IP space in k8s is perfect and its DNS services provide an almost purpose-made tool to automatically sync cluster member entries on the fly.

Permissions

For security reasons, any kind of daemon should usually have as little access to its environment as practicable.

The peer discovery plugin requires that the pod it runs in has rights to query the k8s API endpoints resource. Permissions for a pod are rooted in a ServiceAccount which is then bound to role(s) to grant certain rights.

---

# For automatic cluster formation, members of the rmq cluster need access
# to query k8s for other pods in the cluster

# Attached to cluster member pods
apiVersion: v1
kind: ServiceAccount
metadata:
  name: rabbitmq

---

# Constrained to read operations on the endpoints resource
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: rmq-endpoints-reader
rules:
- apiGroups: [""]
  resources: ["endpoints"]
  verbs: ["get"]
  
# NB: This permission can not be constrainted to only enumerating
# other rmq instances.

---

# Bind the role to the service account
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: rmq-member
subjects:
- kind: ServiceAccount
  name: rabbitmq
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: rmq-endpoints-reader

DNS Setup

Once the pods have the rights to discover what other hosts are expected to be in the cluster with them, it is important to the erlang runtime connectivity that cluster nodes have DNS resolution properly configured. Clever use can be made of what is already the k8s-preferred discovery method of DNS to provide this service.

Because the cluster pods will be managed via a k8s StatefulSet, each pod is guaranteed to be assigned a well-known sequential hostname (i.e. rabbitmq-01, -02, -03). This guarantee is leveraged to make erlang happy by exporting the hostname and IP of each cluster pod to kube-dns using the, somewhat obscure, headless service.

The name of this service needs to be provided to the discovery plugin via the $K8S_SERVICE_NAME pod environment variable. It uses the name to query the k8s API for information on all pods selected by the service.

A normal k8s service provides load balanced connectivity to a group of pods via a single cluster-wide IP and DNS name. A headless service creates DNS entries for the hostname of each selected pod and its IP.

---

kind: Service
apiVersion: v1
metadata:
  name: rabbitmq
  labels:
    app: rabbitmq
spec:
  clusterIP: None # We need a headless service to allow the pods to discover each
  selector:       # other during autodiscover phase for cluster creation.
    app: rabbitmq # A ClusterIP will prevent resolving dns requests for other pods
  ports:          # under the same service.
   - name: http
     protocol: TCP
     port: 15672
   - name: amqp
     protocol: TCP
     port: 5672
   - name: metrics
     protocol: TCP
     port: 9419

RMQ Node ConfigMap

A k8s ConfigMap is used to feed the rabbitmq daemon in each cluster pod a bootstrap configuration which mainly defines peer discovery and connectivity settings.

The config first tells the cluster formation system that it should discover peers using the rabbit_peer_discovery_k8s plugin, and that the kubernetes API is available at DNS name kubernetes.default.svc.cluster.local.

Because erlang cluster members need a stable identifier--something that k8s pod IPs are not--the configuration is set to use hostnames instead.

A number of other small timeout and limit tweaks are also includeded which can be researched at the cited documentation URLs.

---

apiVersion: v1
kind: ConfigMap
metadata:
  name: rabbitmq-config
data:
  enabled_plugins: |
      [rabbitmq_management,rabbitmq_peer_discovery_k8s].
  rabbitmq.conf: |
      ## Cluster formation. See http://www.rabbitmq.com/cluster-formation.html to learn more.
      cluster_formation.peer_discovery_backend  = rabbit_peer_discovery_k8s
      cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
      ## Should RabbitMQ node name be computed from the pod's hostname or IP address?
      ## IP addresses are not stable, so using [stable] hostnames is recommended when possible.
      ## Set to "hostname" to use pod hostnames.
      ## When this value is changed, so should the variable used to set the RABBITMQ_NODENAME
      ## environment variable.
      cluster_formation.k8s.address_type = hostname
      ## How often should node cleanup checks run?
      cluster_formation.node_cleanup.interval = 30
      ## Set to false if automatic removal of unknown/absent nodes
      ## is desired. This can be dangerous, see
      ##  * http://www.rabbitmq.com/cluster-formation.html#node-health-checks-and-cleanup
      ##  * https://groups.google.com/forum/#!msg/rabbitmq-users/wuOfzEywHXo/k8z_HWIkBgAJ
      cluster_formation.node_cleanup.only_log_warning = true
      cluster_partition_handling = autoheal
      ## See http://www.rabbitmq.com/ha.html#master-migration-data-locality
      queue_master_locator=min-masters
      ## See http://www.rabbitmq.com/access-control.html#loopback-users
      loopback_users.guest = true
      ## Startup delays mostly unnecessary with statefulset
      cluster_formation.randomized_startup_delay_range.min = 0
      cluster_formation.randomized_startup_delay_range.max = 2

Stateful Set

A StatefulSet in k8s allows binding attributes, storage, and some actions to a group of pods in deterministic and dependent ways.

In the case of this cluster, having sequential startup and shutdown ordering and sequentially assigned hostnames allows the cluster peer discovery and erlang connectivity to work out-of-the-box. The deterministically bound storage is also critical for the rmq cluster data persistance.

The primary process in this pod is an rmq daemon, but because of the importance of monitoring, this pod also runs an instance of rmq-exporter to expose perf stats for harvesting.

Storage and Configuration

A persistent volume claim template causes the creation of a persistent volume for storing the mnesia database of each instance. Each of the volumes is bound to the particular pod instance--and deterministic hostname--it was created for, thanks to the stateful set.

The boostrap config map files are mounted at /etc/rabbitmq/, TCP ports are exposed, and a liveness probe is setup to have k8s monitor the runtime health of the rmq daemon.

Environment

A number of instance-specific metadata is mounted into the pod's environment, but the most important is used to configure the discovery service and deduce the hostname and DNS domain of the pod. Take a look at the comments in the yaml for detail on these variables.

Security

The serviceAccountName of the pod is set to the name of the service account created for peer discovery rights.

The cookie value in a k8s secret named rmq-secrets is exported to rmq via the pod's environment and is used by erlang as a shared secret to authenticate peers.

I like to use this command to generate a random erlang cookie on linux:

head -c127 </dev/urandom|xxd -p -u -c 255 | tr -d '\n' | base64 -w0
---

apiVersion: v1
kind: Secret
metadata:
  name: rmq-secrets
type: Opaque
data:
  # generated with head -c127 </dev/urandom|xxd -p -u -c 255 | tr -d '\n' | base64 -w0
  cookie: QkJEQTQ1RDYwRjQxMDYxNzE1RDU4RUQ0NEUyRDA1QkFDNzNEMTlDMEIyQUUzNkFCQ0RGQTJEM0Y0N0Y4MDA0QzQ1NEQwRDYxQkRGMzZBQUIxMjYyMEVDN0FCQkM2MjYyQjE5ODQyOTU4MkI0MjY3QjhGQ0E0RkRBN0IyMDREQzJEN0EyMDJGMzQ2ODhDRDQ4QkE4ODc5MjdFNTE4OEYxNDk4MTE3QjFBNURDNkI0RkVGM0VBNEQ0RDFGRDg4Mjc3QzdDOTI5M0QwNzY3NzBDOUIzNjY0RERGMkQxREJCQzYzODhEOEFBRjI0M0Q4NjY4MTExNEE5MzY5Q0Y5QTM=
---

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rabbitmq
spec:
  serviceName: rabbitmq
  # Number of rmq nodes in the cluster
  replicas: 3
  
  selector:
    matchLabels:
      app: rabbitmq

  # Give each node 5Gi for local disk cache
  volumeClaimTemplates:
    - metadata:
        name: rabbitmq-data
      spec:
        resources:
          requests:
            storage: 5Gi
        accessModes:
          - ReadWriteOnce
        storageClassName: ssd

  template:
    metadata:
      labels:
        app: rabbitmq
    spec:
      # Service account to provide rights for peer discovery plugin
      serviceAccountName: rabbitmq
      terminationGracePeriodSeconds: 10
      containers:

      - name: rabbitmq
        image: rabbitmq:3.8
        volumeMounts:
          - name: config-volume
            mountPath: /etc/rabbitmq
          - name: rabbitmq-data
            mountPath: /var/lib/rabbitmq/mnesia
        ports:
          - name: http
            protocol: TCP
            containerPort: 15672
          - name: amqp
            protocol: TCP
            containerPort: 5672
        livenessProbe:
          exec:
            command: ["rabbitmqctl", "status"]
          initialDelaySeconds: 60
          # See https://www.rabbitmq.com/monitoring.html for monitoring frequency recommendations.
          periodSeconds: 60
          timeoutSeconds: 15
        readinessProbe:
          exec:
            command: ["rabbitmqctl", "status"]
          initialDelaySeconds: 20
          periodSeconds: 60
          timeoutSeconds: 10
        imagePullPolicy: Always
        env:
          # Pass pod name in for calculating FQDN
          - name: MY_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          # Pass the pod namespace in for calculating FQDN
          - name: MY_POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          # Name of K8S headless service that selects the pods in this cluster
          - name: K8S_SERVICE_NAME
            value: "rabbitmq"
          # Using FQDN for host identifier
          - name: RABBITMQ_USE_LONGNAME
            value: "true"
          # Build the rabbitmq host FQDN
          - name: RABBITMQ_NODENAME
            value: "rabbit@$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE).svc.cluster.local"
          # Build the cluster DNS domain name
          - name: K8S_HOSTNAME_SUFFIX # Suffix to match fqdn of rmq instances in the k8s namespace
            value: ".$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE).svc.cluster.local"
          # Import the shared secret cookie for erlang authentication
          - name: RABBITMQ_ERLANG_COOKIE
            valueFrom:
              secretKeyRef:
                name: rmq-secrets
                key: cookie

      - name: rabbitmq-exporter
        image: kbudde/rabbitmq-exporter:v1.0.0-RC6.1
        env:
          - name: RABBIT_CAPABILITIES
            value: "no_sort,bert"
          - name: PUBLISH_PORT
            value: "9419"
        ports:
          - name: metrics
            protocol: TCP
            containerPort: 9419

      volumes:
        - name: config-volume
          configMap:
            name: rabbitmq-config
            items:
            - key: rabbitmq.conf
              path: rabbitmq.conf
            - key: enabled_plugins
              path: enabled_plugins

Monitoring

Infrastructure monitoring is priceless to critical deployments. Without monitoring it is impossible to know that something is broken before your customers do, to figure out how to fix it before they notice, or when to manually or automatically scale it so they always have a happy experience.

To elide monitoring, this section can be ignored, and the rmq-exporter container can be removed from the pod.

Prometheus Operator makes it pretty simple to create k8s-native monitoring configurations that declaratively composes monitoring data with analysis and alerting tools.

---

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    prometheus: monitoring
  name: rabbitmq
  namespace: monitoring
spec:
  endpoints:
  - path: /metrics
    port: metrics
  jobLabel: rabbitmq
  namespaceSelector:
    any: true
  selector:
    matchLabels:
      app: rabbitmq

Fin

This yaml should be all that's needed to spin up a rabbitmq cluster on kubernetes. Auto cluster formation is configured, and modifying the replicas setting on the StatefulSet should work like a knob to dynamically scale to any size rmq cluster that your environment can support.

Production additions/changes:

  • A normal k8s service to create a cluster IP and name for loadbalancing connections across the cluster nodes for protocols that support it (i.e. mqtt, websockets).
  • In most cases, you will not want situations like multiple rmq cluster pods running on the same k8s node (this makes the cluster weak against node failure). Constraints can be added to the podspec to specify the desired behavior here.

A future post may explore federating rmq clusters across other k8s boundaries like regions and zones, and look at using a configuration management tool to automatically configure the cluster members and the cluster itself.