[{"data":1,"prerenderedAt":367},["ShallowReactive",2],{"\u002Fen\u002Fpost\u002F2022\u002F04\u002Fkubernetes-blockchain-node-orchestration":3},{"id":4,"title":5,"author":6,"body":7,"createdAt":352,"description":353,"extension":354,"meta":355,"navigation":356,"path":357,"seo":358,"slug":359,"stem":360,"tags":361,"__hash__":366},"posts\u002Fposts\u002F2022\u002F04\u002Fkubernetes-blockchain-node-orchestration.md","Orchestrating blockchain nodes on Kubernetes at Blockdaemon: the design case",null,{"type":8,"value":9,"toc":344},"minimark",[10,15,19,22,25,28,32,35,38,41,44,48,51,54,57,60,64,67,70,86,89,94,104,112,115,126,129,133,140,143,146,150,159,162,175,179,182,185,201,207,212,216,219,222,225,240,243,249,253,262,265,268,272,275,281,299,302,306,309,312,315,318,322,339],[11,12,14],"h1",{"id":13},"stateless-assumptions-dont-survive-contact-with-chain-data","Stateless assumptions don't survive contact with chain data",[16,17,18],"p",{},"The mental model Kubernetes was built around is a stateless HTTP service: start\nit, stop it, replace it, it doesn't matter. The database holds the state; the\nprocess is interchangeable. Restart a pod and it picks up exactly where the\nuser left off.",[16,20,21],{},"Blockchain nodes are the extreme opposite. Running them on Kubernetes is in the\nsame category as running databases on Kubernetes: a discussion the industry has\nbeen having for years, with no settled consensus, because the answer genuinely\ndepends on scale, team, and tolerance for operational complexity. A Bitcoin full\nnode carries several hundred gigabytes of chain history. An Ethereum archive\nnode exceeds 4 TiB (and grows perpetually as blocks are added). The chain data\nis the node; the process is just the thing reading and advancing it. Restart the\nprocess and it picks up from disk (which is fine). Delete the disk and you're\nresyncing from genesis, which on a busy mainnet can take weeks.",[16,23,24],{},"That's the first mismatch. Kubernetes was designed around ephemeral, stateless\nworkloads, and blockchain nodes are some of the most stateful workloads in\nexistence. Adding an orchestration layer on top of that doesn't make the\nstatefulness go away (it just adds more abstraction between you and the disk).",[16,26,27],{},"What follows is a design proposal, developed at Blockdaemon in Q2\u002F2022, that\nexplores whether Kubernetes could serve as the orchestration layer for blockchain\nnodes at scale, and what it would take to make that work. The questions it\nraises are not all answered here.",[11,29,31],{"id":30},"one-node-one-signer-non-negotiable","One node, one signer, non-negotiable",[16,33,34],{},"Full nodes (those that sync and serve chain data for JSON-RPC calls) can run\nmultiple replicas without issue. Validators cannot.",[16,36,37],{},"A validator's job is to sign blocks on behalf of a staker. In proof-of-stake\nnetworks, signing the same block twice (from two concurrently running validator\ninstances) triggers slashing: an on-chain penalty that permanently destroys part\nof the staked funds. There is no remediation. The standard Kubernetes answer to\nhigh availability (run N replicas, let the scheduler handle restarts) is\nactively dangerous for validators.",[16,39,40],{},"This is the singleton problem: the validator must be exactly one running process\nat any given moment. The usual orchestration primitives for availability and\nzero-downtime upgrades require deliberate adaptation to not violate it. And\nunlike a web service, where getting this wrong causes a temporary error spike, a\nmisconfigured validator can cause a permanent, irreversible financial loss.",[16,42,43],{},"Q: Isn't the risk overstated? Surely Kubernetes won't spin up two replicas of a\nsingleton at the same time.\nA: It can and does, during rolling updates. The default rolling update strategy\nstarts the new pod before terminating the old one. Without explicit configuration\nto prevent this, two signing processes can run simultaneously. The question is\nnot whether Kubernetes is generally safe for stateful workloads. It is whether\nthe default behavior, applied without modification to a validator, is safe. It\nis not.",[11,45,47],{"id":46},"the-case-for-orchestration-and-the-case-against-it","The case for orchestration (and the case against it)",[16,49,50],{},"If you're running two or three validator nodes, dedicated VMs with manual\nrunbooks is probably the right answer. The operational overhead is bounded, the\ntooling is minimal, and the failure modes are well-understood. You don't need\nKubernetes to manage three nodes. You need a good runbook and someone who reads\nit.",[16,52,53],{},"At scale (dozens of protocols, hundreds of nodes across multiple environments)\nthe argument shifts. Every new protocol onboarded adds new runbooks, new on-call\nburden, and new failure modes to discover in production. Fleet upgrades across\nunorchestrated nodes require either rolling manual execution (slow, error-prone)\nor custom per-protocol automation (expensive to build and maintain independently\nfor each chain).",[16,55,56],{},"But here is where the counter-argument becomes real: Kubernetes adds layers.\nEach layer is something that can fail, something that someone needs to understand\ndeeply, and something that interacts with the layers below it in ways that are\nnot always obvious. A platform engineering team that knows Kubernetes well may\nnot know blockchain node operations. A team that knows blockchain operations may\nnot know Kubernetes internals. The overlap between those two skill sets is\ngenuinely narrow, and hiring for it is expensive.",[16,58,59],{},"The honest framing: orchestration potentially replaces manual per-node toil with\nplatform-level complexity. Whether that trade is favorable depends entirely on\nthe scale of the fleet, the composition of the team, and whether the platform\ncan be built once and maintained cheaply, or whether it becomes another system\nrequiring constant attention. There is no universal answer.",[11,61,63],{"id":62},"the-problem-nobody-talks-about-until-the-pvc-fills-up","The problem nobody talks about until the PVC fills up",[16,65,66],{},"Storage is where most Kubernetes-for-blockchain designs encounter their first\nserious wall. The standard PVC model assumes storage is relatively small,\nfungible, and network-attached. Chain data is none of those things.",[16,68,69],{},"The constraints compound:",[71,72,73,77,80,83],"ul",{},[74,75,76],"li",{},"Size grows without bound: chain data ranges from tens of gigabytes for newer\nprotocols to multiple terabytes for archive nodes, and adds blocks\nindefinitely;",[74,78,79],{},"Protocol-specific formats: some chains write data in formats that assume local\ndisk access patterns; the data isn't always portable to a different node\nwithout a full resync;",[74,81,82],{},"Sync time as recovery cost: if storage fails and can't be recovered,\nresyncing from genesis can take days to weeks. Storage reliability matters in\na way it doesn't for typical web workloads;",[74,84,85],{},"I\u002FO performance: block validation and state transitions are I\u002FO intensive;\nnetwork-attached storage latency, acceptable for most applications, can\nmeasurably impact node performance on high-throughput chains.",[16,87,88],{},"Three storage approaches worth considering for this use case:",[90,91,93],"h2",{"id":92},"openebs-with-mayastor","OpenEBS with Mayastor",[16,95,96,103],{},[97,98,102],"a",{"href":99,"rel":100},"https:\u002F\u002Fopenebs.io\u002F",[101],"nofollow","OpenEBS"," implements Container Attached Storage (CAS): an\nabstraction layer between Kubernetes' Container Storage Interface and the\nunderlying driver (EBS, NFS, local disk, or otherwise). The premise is that\nstorage should be orchestrated by Kubernetes the same way compute is: as\ncontainers, with scheduling, affinity rules, and auto-scaling.",[16,105,106,111],{},[97,107,110],{"href":108,"rel":109},"https:\u002F\u002Fgithub.com\u002Fopenebs\u002Fmayastor",[101],"Mayastor"," is OpenEBS's distributed block\nstorage engine, responsible for orchestrating disk placement across nodes. The\ndesign mirrors the Kubernetes control plane. Mayastor runs as containers,\nmanages data node distribution to minimize latency, and would scale storage\ncapacity independently of the workload pods that mount it.",[16,113,114],{},"Potential operational benefits for blockchain workloads:",[71,116,117,120,123],{},[74,118,119],{},"No cloud vendor lock-in: the same storage class could work across AWS, GCP,\nbare metal, or any mix;",[74,121,122],{},"Storage auto-scaling decoupled from node auto-scaling: disk capacity could\ngrow without resizing or restarting the workload;",[74,124,125],{},"Consistent interface regardless of the underlying driver.",[16,127,128],{},"Known limitation (as of Q2\u002F2022): no native snapshot support. Velero provides a\npartial workaround but not a complete solution. Snapshots matter for blockchain\nnodes because they enable bootstrapping new nodes from a recent chain state\nrather than syncing from genesis.",[90,130,132],{"id":131},"cstor","cStor",[16,134,135,139],{},[97,136,132],{"href":137,"rel":138},"https:\u002F\u002Fopenebs.io\u002Fdocs\u002Fconcepts\u002Fcstor",[101]," is OpenEBS's ZFS-backed storage\nengine. Where Mayastor focuses on distributed block storage, cStor focuses on\nconsistency and data management capabilities.",[16,141,142],{},"cStor implements copy-on-write semantics, RAID replication across nodes,\nPersistentVolume snapshots and cloning, and ZFS deduplication. The deduplication\nis specifically relevant for blockchain workloads: multiple nodes in the same\ncluster holding structurally similar chain data (e.g. multiple full nodes of the\nsame protocol) could reduce actual disk usage significantly.",[16,144,145],{},"The trade-off is operational: cStor requires ZFS installed on every Kubernetes\nnode. Adding or replacing cluster nodes requires ZFS provisioning as part of the\nprocess, adding friction compared to Mayastor's more self-contained model.",[90,147,149],{"id":148},"zfs-nodes-with-nfs-provisioner","ZFS nodes with NFS provisioner",[16,151,152,153,158],{},"The longer road, but the most proven. A set of ZFS-backed nodes provisioned\nacross availability zones, sharing volumes to the Kubernetes cluster via the\n",[97,154,157],{"href":155,"rel":156},"https:\u002F\u002Fgithub.com\u002Fkubernetes-sigs\u002Fnfs-subdir-external-provisioner",[101],"NFS subdir provisioner",".\nMore moving parts, more manual configuration (and fewer unknowns in production).\nWhen the more automated options encounter edge cases, this approach falls back to\nprimitives that have been running in production for decades.",[16,160,161],{},"Q: Why not just use cloud-managed block storage (EBS, Persistent Disk, etc.)?\nA: Cloud-managed block storage works and is the path of least resistance at\nsmall scale. Latency characteristics are acceptable for most workloads, though\nthey may not be for high-throughput validation. More practically: at scale\nacross multiple cloud providers, storage vendor lock-in becomes a real\noperational constraint. A storage abstraction that works across providers is\nworth the added complexity if multi-cloud operation is the goal (but only if\nthe team can actually maintain that abstraction).",[163,164,166],"callout",{"type":165},"note",[16,167,168,169,174],{},"The storage landscape for Kubernetes has continued to evolve since this\ndesign was written. ",[97,170,173],{"href":171,"rel":172},"https:\u002F\u002Flonghorn.io\u002F",[101],"Longhorn"," has since matured as a\nfurther alternative worth evaluating (particularly for its native snapshot\nsupport and simpler operational model compared to Mayastor and cStor).",[11,176,178],{"id":177},"the-one-rule-kubernetes-wasnt-designed-for","The one rule Kubernetes wasn't designed for",[16,180,181],{},"Kubernetes rolling updates work by spinning up new pods before terminating old\nones. For stateless services, this is ideal: zero downtime, no dropped requests.\nFor validators, starting the new pod before the old one terminates means two\nsigning processes running simultaneously (double-signing territory).",[16,183,184],{},"Two mitigations could compose to close this window.",[16,186,187,191,192,196,197,200],{},[188,189,190],"strong",{},"Pod Disruption Budgets"," constrain how many pods Kubernetes may terminate\nsimultaneously. Setting ",[193,194,195],"code",{},"maxUnavailable: 0"," prevents termination until a\nreplacement is ready. The complementary setting ",[193,198,199],{},"maxSurge: 0"," prevents the new\npod from starting before the old one is fully terminated. Together these could\nenforce a stop-then-start update sequence for the validator (no overlap, no\ndouble-signing window during planned upgrades).",[16,202,203,206],{},[188,204,205],{},"Blocks-behind self-termination"," addresses the unplanned case. A liveness\nprobe monitors the validator's block height against the current network tip. If\nthe lag exceeds a protocol-specific threshold, the probe fails and Kubernetes\nterminates the pod. The logic: a validator significantly behind the network tip\nis not signing correctly anyway; self-termination and restart is safer than\ncontinuing. This is sometimes called the harakiri pattern: the process detects\nit is in an invalid state and exits, rather than risking incorrect behavior that\ncompounds the problem.",[163,208,209],{"type":165},[16,210,211],{},"The threshold would need to be protocol-specific because block times\nvary widely across chains. A \"blocks behind\" value that signals a problem on a\n1-second block time chain is normal variance on a 12-second block time chain.\nGetting this wrong in either direction causes problems: too sensitive and the\nvalidator restarts unnecessarily; too loose and it allows a degraded validator\nto keep running.",[11,213,215],{"id":214},"keeping-keys-out-of-the-cluster","Keeping keys out of the cluster",[16,217,218],{},"Validator keys are not like application credentials. A leaked database password\ngets rotated. A leaked validator private key can be used to sign blocks on\nbehalf of the validator before the key is revoked (potentially triggering\nslashing in the window between leak and rotation). Slashing is irreversible.",[16,220,221],{},"Kubernetes Secrets are better than baked-in environment variables, but they are\nnot the right primitive for keys at this sensitivity level. They are\nbase64-encoded (not encrypted by default), accessible to any workload in the\nsame namespace with pod-level permissions, and stored in etcd with whatever\nencryption posture the cluster has configured.",[16,223,224],{},"One approach: separate the key store from the cluster entirely. An external\nsecrets manager holds the key material; the validator pod mounts it as a volume\nat startup via an operator that handles the fetch. The key is never at rest in\nthe cluster. If the pod is deleted, the key disappears with it. Key rotation\ndoes not require a redeployment: update the secret in the store, let the pod\nrestart on its next cycle, and it picks up the new material automatically.",[16,226,227,228,233,234,239],{},"Established tools for this pattern include ",[97,229,232],{"href":230,"rel":231},"https:\u002F\u002Fwww.vaultproject.io\u002F",[101],"HashiCorp Vault","\n(via the ",[97,235,238],{"href":236,"rel":237},"https:\u002F\u002Fexternal-secrets.io\u002F",[101],"External Secrets Operator"," or the Vault\nAgent Injector), and cloud-native equivalents like AWS Secrets Manager or GCP\nSecret Manager (both of which integrate with External Secrets Operator using the\nsame operator interface).",[16,241,242],{},"The active\u002Fstandby validator design adds a further constraint to this model:\nstandby nodes would hold no validator key at all. A standby connects to the peer\nnetwork and syncs blocks, but cannot sign. Promotion to active would involve\nwriting the key to the secrets store, which the newly-promoted pod then mounts\nat startup. At any given moment, the key exists in exactly one place.",[163,244,246],{"type":245},"warning",[16,247,248],{},"This model is conceptually clean but has operational implications that\nshouldn't be glossed over. The secrets store itself becomes a critical dependency:\nif it is unavailable at pod startup, the validator cannot start. That means the\nsecrets store's availability SLA effectively becomes the validator's availability\nSLA. Designing for this dependency (caching, fallback, secrets store HA) is\nnon-trivial and adds more surface area to maintain.",[11,250,252],{"id":251},"protocol-upgrades-as-a-deployment-problem","Protocol upgrades as a deployment problem",[16,254,255,256,261],{},"With storage, process exclusivity, and key management handled (or at least\ndesigned for), protocol upgrades could become a standard deployment problem.\n",[97,257,260],{"href":258,"rel":259},"https:\u002F\u002Fargo-cd.readthedocs.io\u002F",[101],"ArgoCD"," could watch for new container image\ntags pushed by each protocol's release pipeline and trigger rolling updates\nautomatically. Combined with rolling update configuration that limits the batch\nsize (e.g. 10% of the fleet at a time) and health probes that gate each batch,\nthe typical protocol upgrade (previously a manual, per-node operation) could\nbecome a version tag update in a repository.",[16,263,264],{},"The health probe is what would make this safe: a probe checking blocks-behind\nagainst the network reference provides an automated go\u002Fno-go signal. A bad\nprotocol upgrade (one where the new version has a regression) could halt before\nreaching the full fleet, rather than being discovered through monitoring after\nthe fact.",[16,266,267],{},"Whether this fully materializes depends on how well the health probes can\nactually characterize node health for each protocol. For some chains, block lag\nis a complete signal. For others, a node can be synced but serving incorrect\ndata due to state corruption (a condition that block height alone won't catch).\nProtocol-specific health checks are more accurate but more expensive to build\nand maintain across a large number of chains.",[11,269,271],{"id":270},"shared-or-exclusive","Shared or exclusive?",[16,273,274],{},"In a multi-tenant context, node isolation becomes a design question. Two\napproaches with different trade-off profiles:",[16,276,277,280],{},[188,278,279],{},"Kubernetes affinity and anti-affinity rules"," allow workloads to declare\nrequirements about co-location. This is lightweight and built into the\nscheduler, adding no runtime overhead. The isolation it provides is\ncontainer-level: processes on the same host share the kernel.",[16,282,283,286,287,292,293,298],{},[188,284,285],{},"Firecracker microVMs"," provide hardware-level boundaries between workloads by\nrunning each in its own lightweight virtual machine.\n",[97,288,291],{"href":289,"rel":290},"https:\u002F\u002Ffirecracker-microvm.github.io\u002F",[101],"Firecracker"," (via the Containerd\nplugin) and ",[97,294,297],{"href":295,"rel":296},"https:\u002F\u002Fkatacontainers.io\u002F",[101],"Kata Containers"," both offer\nOCI-compatible runtimes that replace the standard container execution model with\nVM-backed isolation. The trade-off is real: microVM startup time, additional\nmemory overhead per workload, and a more complex container runtime to configure\nand operate.",[16,300,301],{},"The right choice depends on the threat model. Kernel-level isolation is\nsufficient for most operational concerns. Hardware-level isolation is warranted\nwhen the threat model includes a compromised container breaking out to the host\n(a higher bar that most deployments don't require but some do).",[11,303,305],{"id":304},"is-it-worth-it","Is it worth it?",[16,307,308],{},"This is the question the design intentionally leaves open.",[16,310,311],{},"The argument for: at sufficient scale, manual per-node operations don't compose.\nEvery new protocol is an operational multiplier, and Kubernetes provides a shared\nfoundation that could absorb that multiplier once. Rolling upgrades with health\nprobes, automated key injection, standardized storage interfaces across\nprotocols: these are real benefits that compound as the fleet grows.",[16,313,314],{},"The argument against: each of those benefits requires solving a hard problem\nfirst, and each solution adds a layer. Storage abstraction, singleton enforcement,\nexternal secrets integration, protocol-specific health probes: these are not\nsimple configurations. They require people who understand both the Kubernetes\nprimitives and the blockchain-specific constraints, and that combination is rare\nand expensive. There is a real risk that the orchestration layer becomes its own\noperational burden, requiring specialized platform engineering attention that\noutweighs the toil it was supposed to eliminate.",[16,316,317],{},"The break-even point (where the operational benefit of orchestration exceeds\nthe cost of building and maintaining the platform) depends on fleet size, team\nstructure, and how well the health check and upgrade automation can be built\ngenerically across protocols. At a handful of nodes, the overhead clearly isn't\nworth it. At hundreds of nodes, it probably is. The interesting question is\nwhere in between that line sits, and whether the complexity stays bounded as the\nplatform scales.",[90,319,321],{"id":320},"open-questions","Open questions",[71,323,324,327,330,333,336],{},[74,325,326],{},"At what fleet size does orchestration overhead become net positive?",[74,328,329],{},"How generalizable are health probes across protocols, and what's the\nmaintenance cost of per-protocol implementations?",[74,331,332],{},"Can storage auto-scaling keep up with chain data growth rates without\noperator intervention?",[74,334,335],{},"What does the secrets store availability dependency cost in practice, and\nhow is it designed for?",[74,337,338],{},"Does the team structure that can build and maintain this platform exist, or\ndoes it need to be built first?",[163,340,341],{"type":165},[16,342,343],{},"The validator high availability design (the standby\u002Factive model that\nkeeps a warm standby synced without double-signing risk) is a related problem\nthat this proposal does not fully address.",{"title":345,"searchDepth":346,"depth":346,"links":347},"",2,[348,349,350,351],{"id":92,"depth":346,"text":93},{"id":131,"depth":346,"text":132},{"id":148,"depth":346,"text":149},{"id":320,"depth":346,"text":321},"2022-04-09T00:00:00+01:00","Kubernetes was built for stateless workloads. Blockchain nodes are the extreme opposite. A design proposal exploring what it would actually take to run validators and full nodes on Kubernetes at scale: storage, process exclusivity, key injection, and the upgrade automation that might make it worth the investment.","md",{},true,"\u002Fposts\u002F2022\u002F04\u002Fkubernetes-blockchain-node-orchestration",{"title":5,"description":353},"kubernetes-blockchain-node-orchestration","posts\u002F2022\u002F04\u002Fkubernetes-blockchain-node-orchestration",[362,363,364,365],"blockchain","kubernetes","platform-engineering","infrastructure-as-code","rXBdWidYPbFs_YOulcEyxpqMsG7K1LWRXG7M6rKInUs",1778441743697]