Orchestrating blockchain nodes on Kubernetes at Blockdaemon: the design case
- Published on
Stateless assumptions don't survive contact with chain data
The mental model Kubernetes was built around is a stateless HTTP service: start it, stop it, replace it, it doesn't matter. The database holds the state; the process is interchangeable. Restart a pod and it picks up exactly where the user left off.
Blockchain nodes are the extreme opposite. Running them on Kubernetes is in the same category as running databases on Kubernetes: a discussion the industry has been having for years, with no settled consensus, because the answer genuinely depends on scale, team, and tolerance for operational complexity. A Bitcoin full node carries several hundred gigabytes of chain history. An Ethereum archive node exceeds 4 TiB (and grows perpetually as blocks are added). The chain data is the node; the process is just the thing reading and advancing it. Restart the process and it picks up from disk (which is fine). Delete the disk and you're resyncing from genesis, which on a busy mainnet can take weeks.
That's the first mismatch. Kubernetes was designed around ephemeral, stateless workloads, and blockchain nodes are some of the most stateful workloads in existence. Adding an orchestration layer on top of that doesn't make the statefulness go away (it just adds more abstraction between you and the disk).
What follows is a design proposal, developed at Blockdaemon in Q2/2022, that explores whether Kubernetes could serve as the orchestration layer for blockchain nodes at scale, and what it would take to make that work. The questions it raises are not all answered here.
One node, one signer, non-negotiable
Full nodes (those that sync and serve chain data for JSON-RPC calls) can run multiple replicas without issue. Validators cannot.
A validator's job is to sign blocks on behalf of a staker. In proof-of-stake networks, signing the same block twice (from two concurrently running validator instances) triggers slashing: an on-chain penalty that permanently destroys part of the staked funds. There is no remediation. The standard Kubernetes answer to high availability (run N replicas, let the scheduler handle restarts) is actively dangerous for validators.
This is the singleton problem: the validator must be exactly one running process at any given moment. The usual orchestration primitives for availability and zero-downtime upgrades require deliberate adaptation to not violate it. And unlike a web service, where getting this wrong causes a temporary error spike, a misconfigured validator can cause a permanent, irreversible financial loss.
Q: Isn't the risk overstated? Surely Kubernetes won't spin up two replicas of a singleton at the same time. A: It can and does, during rolling updates. The default rolling update strategy starts the new pod before terminating the old one. Without explicit configuration to prevent this, two signing processes can run simultaneously. The question is not whether Kubernetes is generally safe for stateful workloads. It is whether the default behavior, applied without modification to a validator, is safe. It is not.
The case for orchestration (and the case against it)
If you're running two or three validator nodes, dedicated VMs with manual runbooks is probably the right answer. The operational overhead is bounded, the tooling is minimal, and the failure modes are well-understood. You don't need Kubernetes to manage three nodes. You need a good runbook and someone who reads it.
At scale (dozens of protocols, hundreds of nodes across multiple environments) the argument shifts. Every new protocol onboarded adds new runbooks, new on-call burden, and new failure modes to discover in production. Fleet upgrades across unorchestrated nodes require either rolling manual execution (slow, error-prone) or custom per-protocol automation (expensive to build and maintain independently for each chain).
But here is where the counter-argument becomes real: Kubernetes adds layers. Each layer is something that can fail, something that someone needs to understand deeply, and something that interacts with the layers below it in ways that are not always obvious. A platform engineering team that knows Kubernetes well may not know blockchain node operations. A team that knows blockchain operations may not know Kubernetes internals. The overlap between those two skill sets is genuinely narrow, and hiring for it is expensive.
The honest framing: orchestration potentially replaces manual per-node toil with platform-level complexity. Whether that trade is favorable depends entirely on the scale of the fleet, the composition of the team, and whether the platform can be built once and maintained cheaply, or whether it becomes another system requiring constant attention. There is no universal answer.
The problem nobody talks about until the PVC fills up
Storage is where most Kubernetes-for-blockchain designs encounter their first serious wall. The standard PVC model assumes storage is relatively small, fungible, and network-attached. Chain data is none of those things.
The constraints compound:
- Size grows without bound: chain data ranges from tens of gigabytes for newer protocols to multiple terabytes for archive nodes, and adds blocks indefinitely;
- Protocol-specific formats: some chains write data in formats that assume local disk access patterns; the data isn't always portable to a different node without a full resync;
- Sync time as recovery cost: if storage fails and can't be recovered, resyncing from genesis can take days to weeks. Storage reliability matters in a way it doesn't for typical web workloads;
- I/O performance: block validation and state transitions are I/O intensive; network-attached storage latency, acceptable for most applications, can measurably impact node performance on high-throughput chains.
Three storage approaches worth considering for this use case:
OpenEBS with Mayastor
OpenEBS implements Container Attached Storage (CAS): an abstraction layer between Kubernetes' Container Storage Interface and the underlying driver (EBS, NFS, local disk, or otherwise). The premise is that storage should be orchestrated by Kubernetes the same way compute is: as containers, with scheduling, affinity rules, and auto-scaling.
Mayastor is OpenEBS's distributed block storage engine, responsible for orchestrating disk placement across nodes. The design mirrors the Kubernetes control plane. Mayastor runs as containers, manages data node distribution to minimize latency, and would scale storage capacity independently of the workload pods that mount it.
Potential operational benefits for blockchain workloads:
- No cloud vendor lock-in: the same storage class could work across AWS, GCP, bare metal, or any mix;
- Storage auto-scaling decoupled from node auto-scaling: disk capacity could grow without resizing or restarting the workload;
- Consistent interface regardless of the underlying driver.
Known limitation (as of Q2/2022): no native snapshot support. Velero provides a partial workaround but not a complete solution. Snapshots matter for blockchain nodes because they enable bootstrapping new nodes from a recent chain state rather than syncing from genesis.
cStor
cStor is OpenEBS's ZFS-backed storage engine. Where Mayastor focuses on distributed block storage, cStor focuses on consistency and data management capabilities.
cStor implements copy-on-write semantics, RAID replication across nodes, PersistentVolume snapshots and cloning, and ZFS deduplication. The deduplication is specifically relevant for blockchain workloads: multiple nodes in the same cluster holding structurally similar chain data (e.g. multiple full nodes of the same protocol) could reduce actual disk usage significantly.
The trade-off is operational: cStor requires ZFS installed on every Kubernetes node. Adding or replacing cluster nodes requires ZFS provisioning as part of the process, adding friction compared to Mayastor's more self-contained model.
ZFS nodes with NFS provisioner
The longer road, but the most proven. A set of ZFS-backed nodes provisioned across availability zones, sharing volumes to the Kubernetes cluster via the NFS subdir provisioner. More moving parts, more manual configuration (and fewer unknowns in production). When the more automated options encounter edge cases, this approach falls back to primitives that have been running in production for decades.
Q: Why not just use cloud-managed block storage (EBS, Persistent Disk, etc.)? A: Cloud-managed block storage works and is the path of least resistance at small scale. Latency characteristics are acceptable for most workloads, though they may not be for high-throughput validation. More practically: at scale across multiple cloud providers, storage vendor lock-in becomes a real operational constraint. A storage abstraction that works across providers is worth the added complexity if multi-cloud operation is the goal (but only if the team can actually maintain that abstraction).
The storage landscape for Kubernetes has continued to evolve since this design was written. Longhorn has since matured as a further alternative worth evaluating (particularly for its native snapshot support and simpler operational model compared to Mayastor and cStor).
The one rule Kubernetes wasn't designed for
Kubernetes rolling updates work by spinning up new pods before terminating old ones. For stateless services, this is ideal: zero downtime, no dropped requests. For validators, starting the new pod before the old one terminates means two signing processes running simultaneously (double-signing territory).
Two mitigations could compose to close this window.
Pod Disruption Budgets constrain how many pods Kubernetes may terminate
simultaneously. Setting maxUnavailable: 0 prevents termination until a
replacement is ready. The complementary setting maxSurge: 0 prevents the new
pod from starting before the old one is fully terminated. Together these could
enforce a stop-then-start update sequence for the validator (no overlap, no
double-signing window during planned upgrades).
Blocks-behind self-termination addresses the unplanned case. A liveness probe monitors the validator's block height against the current network tip. If the lag exceeds a protocol-specific threshold, the probe fails and Kubernetes terminates the pod. The logic: a validator significantly behind the network tip is not signing correctly anyway; self-termination and restart is safer than continuing. This is sometimes called the harakiri pattern: the process detects it is in an invalid state and exits, rather than risking incorrect behavior that compounds the problem.
The threshold would need to be protocol-specific because block times vary widely across chains. A "blocks behind" value that signals a problem on a 1-second block time chain is normal variance on a 12-second block time chain. Getting this wrong in either direction causes problems: too sensitive and the validator restarts unnecessarily; too loose and it allows a degraded validator to keep running.
Keeping keys out of the cluster
Validator keys are not like application credentials. A leaked database password gets rotated. A leaked validator private key can be used to sign blocks on behalf of the validator before the key is revoked (potentially triggering slashing in the window between leak and rotation). Slashing is irreversible.
Kubernetes Secrets are better than baked-in environment variables, but they are not the right primitive for keys at this sensitivity level. They are base64-encoded (not encrypted by default), accessible to any workload in the same namespace with pod-level permissions, and stored in etcd with whatever encryption posture the cluster has configured.
One approach: separate the key store from the cluster entirely. An external secrets manager holds the key material; the validator pod mounts it as a volume at startup via an operator that handles the fetch. The key is never at rest in the cluster. If the pod is deleted, the key disappears with it. Key rotation does not require a redeployment: update the secret in the store, let the pod restart on its next cycle, and it picks up the new material automatically.
Established tools for this pattern include HashiCorp Vault (via the External Secrets Operator or the Vault Agent Injector), and cloud-native equivalents like AWS Secrets Manager or GCP Secret Manager (both of which integrate with External Secrets Operator using the same operator interface).
The active/standby validator design adds a further constraint to this model: standby nodes would hold no validator key at all. A standby connects to the peer network and syncs blocks, but cannot sign. Promotion to active would involve writing the key to the secrets store, which the newly-promoted pod then mounts at startup. At any given moment, the key exists in exactly one place.
This model is conceptually clean but has operational implications that shouldn't be glossed over. The secrets store itself becomes a critical dependency: if it is unavailable at pod startup, the validator cannot start. That means the secrets store's availability SLA effectively becomes the validator's availability SLA. Designing for this dependency (caching, fallback, secrets store HA) is non-trivial and adds more surface area to maintain.
Protocol upgrades as a deployment problem
With storage, process exclusivity, and key management handled (or at least designed for), protocol upgrades could become a standard deployment problem. ArgoCD could watch for new container image tags pushed by each protocol's release pipeline and trigger rolling updates automatically. Combined with rolling update configuration that limits the batch size (e.g. 10% of the fleet at a time) and health probes that gate each batch, the typical protocol upgrade (previously a manual, per-node operation) could become a version tag update in a repository.
The health probe is what would make this safe: a probe checking blocks-behind against the network reference provides an automated go/no-go signal. A bad protocol upgrade (one where the new version has a regression) could halt before reaching the full fleet, rather than being discovered through monitoring after the fact.
Whether this fully materializes depends on how well the health probes can actually characterize node health for each protocol. For some chains, block lag is a complete signal. For others, a node can be synced but serving incorrect data due to state corruption (a condition that block height alone won't catch). Protocol-specific health checks are more accurate but more expensive to build and maintain across a large number of chains.
Shared or exclusive?
In a multi-tenant context, node isolation becomes a design question. Two approaches with different trade-off profiles:
Kubernetes affinity and anti-affinity rules allow workloads to declare requirements about co-location. This is lightweight and built into the scheduler, adding no runtime overhead. The isolation it provides is container-level: processes on the same host share the kernel.
Firecracker microVMs provide hardware-level boundaries between workloads by running each in its own lightweight virtual machine. Firecracker (via the Containerd plugin) and Kata Containers both offer OCI-compatible runtimes that replace the standard container execution model with VM-backed isolation. The trade-off is real: microVM startup time, additional memory overhead per workload, and a more complex container runtime to configure and operate.
The right choice depends on the threat model. Kernel-level isolation is sufficient for most operational concerns. Hardware-level isolation is warranted when the threat model includes a compromised container breaking out to the host (a higher bar that most deployments don't require but some do).
Is it worth it?
This is the question the design intentionally leaves open.
The argument for: at sufficient scale, manual per-node operations don't compose. Every new protocol is an operational multiplier, and Kubernetes provides a shared foundation that could absorb that multiplier once. Rolling upgrades with health probes, automated key injection, standardized storage interfaces across protocols: these are real benefits that compound as the fleet grows.
The argument against: each of those benefits requires solving a hard problem first, and each solution adds a layer. Storage abstraction, singleton enforcement, external secrets integration, protocol-specific health probes: these are not simple configurations. They require people who understand both the Kubernetes primitives and the blockchain-specific constraints, and that combination is rare and expensive. There is a real risk that the orchestration layer becomes its own operational burden, requiring specialized platform engineering attention that outweighs the toil it was supposed to eliminate.
The break-even point (where the operational benefit of orchestration exceeds the cost of building and maintaining the platform) depends on fleet size, team structure, and how well the health check and upgrade automation can be built generically across protocols. At a handful of nodes, the overhead clearly isn't worth it. At hundreds of nodes, it probably is. The interesting question is where in between that line sits, and whether the complexity stays bounded as the platform scales.
Open questions
- At what fleet size does orchestration overhead become net positive?
- How generalizable are health probes across protocols, and what's the maintenance cost of per-protocol implementations?
- Can storage auto-scaling keep up with chain data growth rates without operator intervention?
- What does the secrets store availability dependency cost in practice, and how is it designed for?
- Does the team structure that can build and maintain this platform exist, or does it need to be built first?
The validator high availability design (the standby/active model that keeps a warm standby synced without double-signing risk) is a related problem that this proposal does not fully address.