I am a fan of straight-forward declarative infrastructure, and for smoothing the interface between "hardware" infra tools like Terraform and orchestrating workloads on top of that hardware, I have never used a tool I enjoy as much as kubernetes.
RabbitMQ is another project that I've enjoyed using in many systems. My goal here is to see how simply we can take the two of them and build an easily-scalable messaging cluster.
Unlike some clustered workloads, RabbitMQ actually turns out rather pleasant to cluster on k8s without much voodoo. A lot of the pleasantry can be attributed to rmq's erlang underpinnings providing robust inter-machine process handling over IP.
Auto Cluster Formation
To build an rmq cluster, each node needs to know what other hosts are supposed to be in the cluster with it, and a way to resolve the peer hostnames to IPs (reverse is also necessary).
For connectivity, the flat pod IP space in k8s is perfect and its DNS services provide an almost purpose-made tool to automatically sync cluster member entries on the fly.
For security reasons, any kind of daemon should usually have as little access to its environment as practicable.
The peer discovery plugin requires that the pod it runs in has rights to query the k8s API
endpoints resource. Permissions for a pod are rooted in a ServiceAccount which is then bound to role(s) to grant certain rights.
--- # For automatic cluster formation, members of the rmq cluster need access # to query k8s for other pods in the cluster # Attached to cluster member pods apiVersion: v1 kind: ServiceAccount metadata: name: rabbitmq --- # Constrained to read operations on the endpoints resource kind: Role apiVersion: rbac.authorization.k8s.io/v1beta1 metadata: name: rmq-endpoints-reader rules: - apiGroups: [""] resources: ["endpoints"] verbs: ["get"] # NB: This permission can not be constrainted to only enumerating # other rmq instances. --- # Bind the role to the service account kind: RoleBinding apiVersion: rbac.authorization.k8s.io/v1beta1 metadata: name: rmq-member subjects: - kind: ServiceAccount name: rabbitmq roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: rmq-endpoints-reader
Once the pods have the rights to discover what other hosts are expected to be in the cluster with them, it is important to the erlang runtime connectivity that cluster nodes have DNS resolution properly configured. Clever use can be made of what is already the k8s-preferred discovery method of DNS to provide this service.
Because the cluster pods will be managed via a k8s StatefulSet, each pod is guaranteed to be assigned a well-known sequential hostname (i.e.
rabbitmq-01, -02, -03). This guarantee is leveraged to make erlang happy by exporting the hostname and IP of each cluster pod to kube-dns using the, somewhat obscure, headless service.
The name of this service needs to be provided to the discovery plugin via the
$K8S_SERVICE_NAME pod environment variable. It uses the name to query the k8s API for information on all pods selected by the service.
A normal k8s service provides load balanced connectivity to a group of pods via a single cluster-wide IP and DNS name. A headless service creates DNS entries for the hostname of each selected pod and its IP.
--- kind: Service apiVersion: v1 metadata: name: rabbitmq labels: app: rabbitmq spec: clusterIP: None # We need a headless service to allow the pods to discover each selector: # other during autodiscover phase for cluster creation. app: rabbitmq # A ClusterIP will prevent resolving dns requests for other pods ports: # under the same service. - name: http protocol: TCP port: 15672 - name: amqp protocol: TCP port: 5672 - name: metrics protocol: TCP port: 9419
RMQ Node ConfigMap
A k8s ConfigMap is used to feed the rabbitmq daemon in each cluster pod a bootstrap configuration which mainly defines peer discovery and connectivity settings.
The config first tells the cluster formation system that it should discover peers using the
rabbit_peer_discovery_k8s plugin, and that the kubernetes API is available at DNS name
Because erlang cluster members need a stable identifier--something that k8s pod IPs are not--the configuration is set to use hostnames instead.
A number of other small timeout and limit tweaks are also includeded which can be researched at the cited documentation URLs.
--- apiVersion: v1 kind: ConfigMap metadata: name: rabbitmq-config data: enabled_plugins: | [rabbitmq_management,rabbitmq_peer_discovery_k8s]. rabbitmq.conf: | ## Cluster formation. See http://www.rabbitmq.com/cluster-formation.html to learn more. cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s cluster_formation.k8s.host = kubernetes.default.svc.cluster.local ## Should RabbitMQ node name be computed from the pod's hostname or IP address? ## IP addresses are not stable, so using [stable] hostnames is recommended when possible. ## Set to "hostname" to use pod hostnames. ## When this value is changed, so should the variable used to set the RABBITMQ_NODENAME ## environment variable. cluster_formation.k8s.address_type = hostname ## How often should node cleanup checks run? cluster_formation.node_cleanup.interval = 30 ## Set to false if automatic removal of unknown/absent nodes ## is desired. This can be dangerous, see ## * http://www.rabbitmq.com/cluster-formation.html#node-health-checks-and-cleanup ## * https://groups.google.com/forum/#!msg/rabbitmq-users/wuOfzEywHXo/k8z_HWIkBgAJ cluster_formation.node_cleanup.only_log_warning = true cluster_partition_handling = autoheal ## See http://www.rabbitmq.com/ha.html#master-migration-data-locality queue_master_locator=min-masters ## See http://www.rabbitmq.com/access-control.html#loopback-users loopback_users.guest = true ## Startup delays mostly unnecessary with statefulset cluster_formation.randomized_startup_delay_range.min = 0 cluster_formation.randomized_startup_delay_range.max = 2
A StatefulSet in k8s allows binding attributes, storage, and some actions to a group of pods in deterministic and dependent ways.
In the case of this cluster, having sequential startup and shutdown ordering and sequentially assigned hostnames allows the cluster peer discovery and erlang connectivity to work out-of-the-box. The deterministically bound storage is also critical for the rmq cluster data persistance.
The primary process in this pod is an rmq daemon, but because of the importance of monitoring, this pod also runs an instance of rmq-exporter to expose perf stats for harvesting.
Storage and Configuration
A persistent volume claim template causes the creation of a persistent volume for storing the mnesia database of each instance. Each of the volumes is bound to the particular pod instance--and deterministic hostname--it was created for, thanks to the stateful set.
The boostrap config map files are mounted at
/etc/rabbitmq/, TCP ports are exposed, and a liveness probe is setup to have k8s monitor the runtime health of the rmq daemon.
A number of instance-specific metadata is mounted into the pod's environment, but the most important is used to configure the discovery service and deduce the hostname and DNS domain of the pod. Take a look at the comments in the yaml for detail on these variables.
serviceAccountName of the pod is set to the name of the service account created for peer discovery rights.
cookie value in a k8s secret named
rmq-secrets is exported to rmq via the pod's environment and is used by erlang as a shared secret to authenticate peers.
I like to use this command to generate a random erlang cookie on linux:
head -c127 </dev/urandom|xxd -p -u -c 255 | tr -d '\n' | base64 -w0
--- apiVersion: v1 kind: Secret metadata: name: rmq-secrets type: Opaque data: # generated with head -c127 </dev/urandom|xxd -p -u -c 255 | tr -d '\n' | base64 -w0 cookie: QkJEQTQ1RDYwRjQxMDYxNzE1RDU4RUQ0NEUyRDA1QkFDNzNEMTlDMEIyQUUzNkFCQ0RGQTJEM0Y0N0Y4MDA0QzQ1NEQwRDYxQkRGMzZBQUIxMjYyMEVDN0FCQkM2MjYyQjE5ODQyOTU4MkI0MjY3QjhGQ0E0RkRBN0IyMDREQzJEN0EyMDJGMzQ2ODhDRDQ4QkE4ODc5MjdFNTE4OEYxNDk4MTE3QjFBNURDNkI0RkVGM0VBNEQ0RDFGRDg4Mjc3QzdDOTI5M0QwNzY3NzBDOUIzNjY0RERGMkQxREJCQzYzODhEOEFBRjI0M0Q4NjY4MTExNEE5MzY5Q0Y5QTM=
--- apiVersion: apps/v1 kind: StatefulSet metadata: name: rabbitmq spec: serviceName: rabbitmq # Number of rmq nodes in the cluster replicas: 3 selector: matchLabels: app: rabbitmq # Give each node 5Gi for local disk cache volumeClaimTemplates: - metadata: name: rabbitmq-data spec: resources: requests: storage: 5Gi accessModes: - ReadWriteOnce storageClassName: ssd template: metadata: labels: app: rabbitmq spec: # Service account to provide rights for peer discovery plugin serviceAccountName: rabbitmq terminationGracePeriodSeconds: 10 containers: - name: rabbitmq image: rabbitmq:3.8 volumeMounts: - name: config-volume mountPath: /etc/rabbitmq - name: rabbitmq-data mountPath: /var/lib/rabbitmq/mnesia ports: - name: http protocol: TCP containerPort: 15672 - name: amqp protocol: TCP containerPort: 5672 livenessProbe: exec: command: ["rabbitmqctl", "status"] initialDelaySeconds: 60 # See https://www.rabbitmq.com/monitoring.html for monitoring frequency recommendations. periodSeconds: 60 timeoutSeconds: 15 readinessProbe: exec: command: ["rabbitmqctl", "status"] initialDelaySeconds: 20 periodSeconds: 60 timeoutSeconds: 10 imagePullPolicy: Always env: # Pass pod name in for calculating FQDN - name: MY_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name # Pass the pod namespace in for calculating FQDN - name: MY_POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace # Name of K8S headless service that selects the pods in this cluster - name: K8S_SERVICE_NAME value: "rabbitmq" # Using FQDN for host identifier - name: RABBITMQ_USE_LONGNAME value: "true" # Build the rabbitmq host FQDN - name: RABBITMQ_NODENAME value: "rabbit@$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE).svc.cluster.local" # Build the cluster DNS domain name - name: K8S_HOSTNAME_SUFFIX # Suffix to match fqdn of rmq instances in the k8s namespace value: ".$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE).svc.cluster.local" # Import the shared secret cookie for erlang authentication - name: RABBITMQ_ERLANG_COOKIE valueFrom: secretKeyRef: name: rmq-secrets key: cookie - name: rabbitmq-exporter image: kbudde/rabbitmq-exporter:v1.0.0-RC6.1 env: - name: RABBIT_CAPABILITIES value: "no_sort,bert" - name: PUBLISH_PORT value: "9419" ports: - name: metrics protocol: TCP containerPort: 9419 volumes: - name: config-volume configMap: name: rabbitmq-config items: - key: rabbitmq.conf path: rabbitmq.conf - key: enabled_plugins path: enabled_plugins
Infrastructure monitoring is priceless to critical deployments. Without monitoring it is impossible to know that something is broken before your customers do, to figure out how to fix it before they notice, or when to manually or automatically scale it so they always have a happy experience.
To elide monitoring, this section can be ignored, and the rmq-exporter container can be removed from the pod.
Prometheus Operator makes it pretty simple to create k8s-native monitoring configurations that declaratively composes monitoring data with analysis and alerting tools.
--- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: prometheus: monitoring name: rabbitmq namespace: monitoring spec: endpoints: - path: /metrics port: metrics jobLabel: rabbitmq namespaceSelector: any: true selector: matchLabels: app: rabbitmq
This yaml should be all that's needed to spin up a rabbitmq cluster on kubernetes. Auto cluster formation is configured, and modifying the
replicas setting on the StatefulSet should work like a knob to dynamically scale to any size rmq cluster that your environment can support.
- A normal k8s service to create a cluster IP and name for loadbalancing connections across the cluster nodes for protocols that support it (i.e. mqtt, websockets).
- In most cases, you will not want situations like multiple rmq cluster pods running on the same k8s node (this makes the cluster weak against node failure). Constraints can be added to the podspec to specify the desired behavior here.
A future post may explore federating rmq clusters across other k8s boundaries like regions and zones, and look at using a configuration management tool to automatically configure the cluster members and the cluster itself.