[{"data":1,"prerenderedAt":1550},["ShallowReactive",2],{"home-posts":3,"home-projects":1293},[4,368,808],{"id":5,"title":6,"author":7,"body":8,"createdAt":353,"description":354,"extension":355,"meta":356,"navigation":357,"path":358,"seo":359,"slug":360,"stem":361,"tags":362,"__hash__":367},"posts\u002Fposts\u002F2022\u002F04\u002Fkubernetes-blockchain-node-orchestration.md","Orchestrating blockchain nodes on Kubernetes at Blockdaemon: the design case",null,{"type":9,"value":10,"toc":345},"minimark",[11,16,20,23,26,29,33,36,39,42,45,49,52,55,58,61,65,68,71,87,90,95,105,113,116,127,130,134,141,144,147,151,160,163,176,180,183,186,202,208,213,217,220,223,226,241,244,250,254,263,266,269,273,276,282,300,303,307,310,313,316,319,323,340],[12,13,15],"h1",{"id":14},"stateless-assumptions-dont-survive-contact-with-chain-data","Stateless assumptions don't survive contact with chain data",[17,18,19],"p",{},"The mental model Kubernetes was built around is a stateless HTTP service: start\nit, stop it, replace it, it doesn't matter. The database holds the state; the\nprocess is interchangeable. Restart a pod and it picks up exactly where the\nuser left off.",[17,21,22],{},"Blockchain nodes are the extreme opposite. Running them on Kubernetes is in the\nsame category as running databases on Kubernetes: a discussion the industry has\nbeen having for years, with no settled consensus, because the answer genuinely\ndepends on scale, team, and tolerance for operational complexity. A Bitcoin full\nnode carries several hundred gigabytes of chain history. An Ethereum archive\nnode exceeds 4 TiB (and grows perpetually as blocks are added). The chain data\nis the node; the process is just the thing reading and advancing it. Restart the\nprocess and it picks up from disk (which is fine). Delete the disk and you're\nresyncing from genesis, which on a busy mainnet can take weeks.",[17,24,25],{},"That's the first mismatch. Kubernetes was designed around ephemeral, stateless\nworkloads, and blockchain nodes are some of the most stateful workloads in\nexistence. Adding an orchestration layer on top of that doesn't make the\nstatefulness go away (it just adds more abstraction between you and the disk).",[17,27,28],{},"What follows is a design proposal, developed at Blockdaemon in Q2\u002F2022, that\nexplores whether Kubernetes could serve as the orchestration layer for blockchain\nnodes at scale, and what it would take to make that work. The questions it\nraises are not all answered here.",[12,30,32],{"id":31},"one-node-one-signer-non-negotiable","One node, one signer, non-negotiable",[17,34,35],{},"Full nodes (those that sync and serve chain data for JSON-RPC calls) can run\nmultiple replicas without issue. Validators cannot.",[17,37,38],{},"A validator's job is to sign blocks on behalf of a staker. In proof-of-stake\nnetworks, signing the same block twice (from two concurrently running validator\ninstances) triggers slashing: an on-chain penalty that permanently destroys part\nof the staked funds. There is no remediation. The standard Kubernetes answer to\nhigh availability (run N replicas, let the scheduler handle restarts) is\nactively dangerous for validators.",[17,40,41],{},"This is the singleton problem: the validator must be exactly one running process\nat any given moment. The usual orchestration primitives for availability and\nzero-downtime upgrades require deliberate adaptation to not violate it. And\nunlike a web service, where getting this wrong causes a temporary error spike, a\nmisconfigured validator can cause a permanent, irreversible financial loss.",[17,43,44],{},"Q: Isn't the risk overstated? Surely Kubernetes won't spin up two replicas of a\nsingleton at the same time.\nA: It can and does, during rolling updates. The default rolling update strategy\nstarts the new pod before terminating the old one. Without explicit configuration\nto prevent this, two signing processes can run simultaneously. The question is\nnot whether Kubernetes is generally safe for stateful workloads. It is whether\nthe default behavior, applied without modification to a validator, is safe. It\nis not.",[12,46,48],{"id":47},"the-case-for-orchestration-and-the-case-against-it","The case for orchestration (and the case against it)",[17,50,51],{},"If you're running two or three validator nodes, dedicated VMs with manual\nrunbooks is probably the right answer. The operational overhead is bounded, the\ntooling is minimal, and the failure modes are well-understood. You don't need\nKubernetes to manage three nodes. You need a good runbook and someone who reads\nit.",[17,53,54],{},"At scale (dozens of protocols, hundreds of nodes across multiple environments)\nthe argument shifts. Every new protocol onboarded adds new runbooks, new on-call\nburden, and new failure modes to discover in production. Fleet upgrades across\nunorchestrated nodes require either rolling manual execution (slow, error-prone)\nor custom per-protocol automation (expensive to build and maintain independently\nfor each chain).",[17,56,57],{},"But here is where the counter-argument becomes real: Kubernetes adds layers.\nEach layer is something that can fail, something that someone needs to understand\ndeeply, and something that interacts with the layers below it in ways that are\nnot always obvious. A platform engineering team that knows Kubernetes well may\nnot know blockchain node operations. A team that knows blockchain operations may\nnot know Kubernetes internals. The overlap between those two skill sets is\ngenuinely narrow, and hiring for it is expensive.",[17,59,60],{},"The honest framing: orchestration potentially replaces manual per-node toil with\nplatform-level complexity. Whether that trade is favorable depends entirely on\nthe scale of the fleet, the composition of the team, and whether the platform\ncan be built once and maintained cheaply, or whether it becomes another system\nrequiring constant attention. There is no universal answer.",[12,62,64],{"id":63},"the-problem-nobody-talks-about-until-the-pvc-fills-up","The problem nobody talks about until the PVC fills up",[17,66,67],{},"Storage is where most Kubernetes-for-blockchain designs encounter their first\nserious wall. The standard PVC model assumes storage is relatively small,\nfungible, and network-attached. Chain data is none of those things.",[17,69,70],{},"The constraints compound:",[72,73,74,78,81,84],"ul",{},[75,76,77],"li",{},"Size grows without bound: chain data ranges from tens of gigabytes for newer\nprotocols to multiple terabytes for archive nodes, and adds blocks\nindefinitely;",[75,79,80],{},"Protocol-specific formats: some chains write data in formats that assume local\ndisk access patterns; the data isn't always portable to a different node\nwithout a full resync;",[75,82,83],{},"Sync time as recovery cost: if storage fails and can't be recovered,\nresyncing from genesis can take days to weeks. Storage reliability matters in\na way it doesn't for typical web workloads;",[75,85,86],{},"I\u002FO performance: block validation and state transitions are I\u002FO intensive;\nnetwork-attached storage latency, acceptable for most applications, can\nmeasurably impact node performance on high-throughput chains.",[17,88,89],{},"Three storage approaches worth considering for this use case:",[91,92,94],"h2",{"id":93},"openebs-with-mayastor","OpenEBS with Mayastor",[17,96,97,104],{},[98,99,103],"a",{"href":100,"rel":101},"https:\u002F\u002Fopenebs.io\u002F",[102],"nofollow","OpenEBS"," implements Container Attached Storage (CAS): an\nabstraction layer between Kubernetes' Container Storage Interface and the\nunderlying driver (EBS, NFS, local disk, or otherwise). The premise is that\nstorage should be orchestrated by Kubernetes the same way compute is: as\ncontainers, with scheduling, affinity rules, and auto-scaling.",[17,106,107,112],{},[98,108,111],{"href":109,"rel":110},"https:\u002F\u002Fgithub.com\u002Fopenebs\u002Fmayastor",[102],"Mayastor"," is OpenEBS's distributed block\nstorage engine, responsible for orchestrating disk placement across nodes. The\ndesign mirrors the Kubernetes control plane. Mayastor runs as containers,\nmanages data node distribution to minimize latency, and would scale storage\ncapacity independently of the workload pods that mount it.",[17,114,115],{},"Potential operational benefits for blockchain workloads:",[72,117,118,121,124],{},[75,119,120],{},"No cloud vendor lock-in: the same storage class could work across AWS, GCP,\nbare metal, or any mix;",[75,122,123],{},"Storage auto-scaling decoupled from node auto-scaling: disk capacity could\ngrow without resizing or restarting the workload;",[75,125,126],{},"Consistent interface regardless of the underlying driver.",[17,128,129],{},"Known limitation (as of Q2\u002F2022): no native snapshot support. Velero provides a\npartial workaround but not a complete solution. Snapshots matter for blockchain\nnodes because they enable bootstrapping new nodes from a recent chain state\nrather than syncing from genesis.",[91,131,133],{"id":132},"cstor","cStor",[17,135,136,140],{},[98,137,133],{"href":138,"rel":139},"https:\u002F\u002Fopenebs.io\u002Fdocs\u002Fconcepts\u002Fcstor",[102]," is OpenEBS's ZFS-backed storage\nengine. Where Mayastor focuses on distributed block storage, cStor focuses on\nconsistency and data management capabilities.",[17,142,143],{},"cStor implements copy-on-write semantics, RAID replication across nodes,\nPersistentVolume snapshots and cloning, and ZFS deduplication. The deduplication\nis specifically relevant for blockchain workloads: multiple nodes in the same\ncluster holding structurally similar chain data (e.g. multiple full nodes of the\nsame protocol) could reduce actual disk usage significantly.",[17,145,146],{},"The trade-off is operational: cStor requires ZFS installed on every Kubernetes\nnode. Adding or replacing cluster nodes requires ZFS provisioning as part of the\nprocess, adding friction compared to Mayastor's more self-contained model.",[91,148,150],{"id":149},"zfs-nodes-with-nfs-provisioner","ZFS nodes with NFS provisioner",[17,152,153,154,159],{},"The longer road, but the most proven. A set of ZFS-backed nodes provisioned\nacross availability zones, sharing volumes to the Kubernetes cluster via the\n",[98,155,158],{"href":156,"rel":157},"https:\u002F\u002Fgithub.com\u002Fkubernetes-sigs\u002Fnfs-subdir-external-provisioner",[102],"NFS subdir provisioner",".\nMore moving parts, more manual configuration (and fewer unknowns in production).\nWhen the more automated options encounter edge cases, this approach falls back to\nprimitives that have been running in production for decades.",[17,161,162],{},"Q: Why not just use cloud-managed block storage (EBS, Persistent Disk, etc.)?\nA: Cloud-managed block storage works and is the path of least resistance at\nsmall scale. Latency characteristics are acceptable for most workloads, though\nthey may not be for high-throughput validation. More practically: at scale\nacross multiple cloud providers, storage vendor lock-in becomes a real\noperational constraint. A storage abstraction that works across providers is\nworth the added complexity if multi-cloud operation is the goal (but only if\nthe team can actually maintain that abstraction).",[164,165,167],"callout",{"type":166},"note",[17,168,169,170,175],{},"The storage landscape for Kubernetes has continued to evolve since this\ndesign was written. ",[98,171,174],{"href":172,"rel":173},"https:\u002F\u002Flonghorn.io\u002F",[102],"Longhorn"," has since matured as a\nfurther alternative worth evaluating (particularly for its native snapshot\nsupport and simpler operational model compared to Mayastor and cStor).",[12,177,179],{"id":178},"the-one-rule-kubernetes-wasnt-designed-for","The one rule Kubernetes wasn't designed for",[17,181,182],{},"Kubernetes rolling updates work by spinning up new pods before terminating old\nones. For stateless services, this is ideal: zero downtime, no dropped requests.\nFor validators, starting the new pod before the old one terminates means two\nsigning processes running simultaneously (double-signing territory).",[17,184,185],{},"Two mitigations could compose to close this window.",[17,187,188,192,193,197,198,201],{},[189,190,191],"strong",{},"Pod Disruption Budgets"," constrain how many pods Kubernetes may terminate\nsimultaneously. Setting ",[194,195,196],"code",{},"maxUnavailable: 0"," prevents termination until a\nreplacement is ready. The complementary setting ",[194,199,200],{},"maxSurge: 0"," prevents the new\npod from starting before the old one is fully terminated. Together these could\nenforce a stop-then-start update sequence for the validator (no overlap, no\ndouble-signing window during planned upgrades).",[17,203,204,207],{},[189,205,206],{},"Blocks-behind self-termination"," addresses the unplanned case. A liveness\nprobe monitors the validator's block height against the current network tip. If\nthe lag exceeds a protocol-specific threshold, the probe fails and Kubernetes\nterminates the pod. The logic: a validator significantly behind the network tip\nis not signing correctly anyway; self-termination and restart is safer than\ncontinuing. This is sometimes called the harakiri pattern: the process detects\nit is in an invalid state and exits, rather than risking incorrect behavior that\ncompounds the problem.",[164,209,210],{"type":166},[17,211,212],{},"The threshold would need to be protocol-specific because block times\nvary widely across chains. A \"blocks behind\" value that signals a problem on a\n1-second block time chain is normal variance on a 12-second block time chain.\nGetting this wrong in either direction causes problems: too sensitive and the\nvalidator restarts unnecessarily; too loose and it allows a degraded validator\nto keep running.",[12,214,216],{"id":215},"keeping-keys-out-of-the-cluster","Keeping keys out of the cluster",[17,218,219],{},"Validator keys are not like application credentials. A leaked database password\ngets rotated. A leaked validator private key can be used to sign blocks on\nbehalf of the validator before the key is revoked (potentially triggering\nslashing in the window between leak and rotation). Slashing is irreversible.",[17,221,222],{},"Kubernetes Secrets are better than baked-in environment variables, but they are\nnot the right primitive for keys at this sensitivity level. They are\nbase64-encoded (not encrypted by default), accessible to any workload in the\nsame namespace with pod-level permissions, and stored in etcd with whatever\nencryption posture the cluster has configured.",[17,224,225],{},"One approach: separate the key store from the cluster entirely. An external\nsecrets manager holds the key material; the validator pod mounts it as a volume\nat startup via an operator that handles the fetch. The key is never at rest in\nthe cluster. If the pod is deleted, the key disappears with it. Key rotation\ndoes not require a redeployment: update the secret in the store, let the pod\nrestart on its next cycle, and it picks up the new material automatically.",[17,227,228,229,234,235,240],{},"Established tools for this pattern include ",[98,230,233],{"href":231,"rel":232},"https:\u002F\u002Fwww.vaultproject.io\u002F",[102],"HashiCorp Vault","\n(via the ",[98,236,239],{"href":237,"rel":238},"https:\u002F\u002Fexternal-secrets.io\u002F",[102],"External Secrets Operator"," or the Vault\nAgent Injector), and cloud-native equivalents like AWS Secrets Manager or GCP\nSecret Manager (both of which integrate with External Secrets Operator using the\nsame operator interface).",[17,242,243],{},"The active\u002Fstandby validator design adds a further constraint to this model:\nstandby nodes would hold no validator key at all. A standby connects to the peer\nnetwork and syncs blocks, but cannot sign. Promotion to active would involve\nwriting the key to the secrets store, which the newly-promoted pod then mounts\nat startup. At any given moment, the key exists in exactly one place.",[164,245,247],{"type":246},"warning",[17,248,249],{},"This model is conceptually clean but has operational implications that\nshouldn't be glossed over. The secrets store itself becomes a critical dependency:\nif it is unavailable at pod startup, the validator cannot start. That means the\nsecrets store's availability SLA effectively becomes the validator's availability\nSLA. Designing for this dependency (caching, fallback, secrets store HA) is\nnon-trivial and adds more surface area to maintain.",[12,251,253],{"id":252},"protocol-upgrades-as-a-deployment-problem","Protocol upgrades as a deployment problem",[17,255,256,257,262],{},"With storage, process exclusivity, and key management handled (or at least\ndesigned for), protocol upgrades could become a standard deployment problem.\n",[98,258,261],{"href":259,"rel":260},"https:\u002F\u002Fargo-cd.readthedocs.io\u002F",[102],"ArgoCD"," could watch for new container image\ntags pushed by each protocol's release pipeline and trigger rolling updates\nautomatically. Combined with rolling update configuration that limits the batch\nsize (e.g. 10% of the fleet at a time) and health probes that gate each batch,\nthe typical protocol upgrade (previously a manual, per-node operation) could\nbecome a version tag update in a repository.",[17,264,265],{},"The health probe is what would make this safe: a probe checking blocks-behind\nagainst the network reference provides an automated go\u002Fno-go signal. A bad\nprotocol upgrade (one where the new version has a regression) could halt before\nreaching the full fleet, rather than being discovered through monitoring after\nthe fact.",[17,267,268],{},"Whether this fully materializes depends on how well the health probes can\nactually characterize node health for each protocol. For some chains, block lag\nis a complete signal. For others, a node can be synced but serving incorrect\ndata due to state corruption (a condition that block height alone won't catch).\nProtocol-specific health checks are more accurate but more expensive to build\nand maintain across a large number of chains.",[12,270,272],{"id":271},"shared-or-exclusive","Shared or exclusive?",[17,274,275],{},"In a multi-tenant context, node isolation becomes a design question. Two\napproaches with different trade-off profiles:",[17,277,278,281],{},[189,279,280],{},"Kubernetes affinity and anti-affinity rules"," allow workloads to declare\nrequirements about co-location. This is lightweight and built into the\nscheduler, adding no runtime overhead. The isolation it provides is\ncontainer-level: processes on the same host share the kernel.",[17,283,284,287,288,293,294,299],{},[189,285,286],{},"Firecracker microVMs"," provide hardware-level boundaries between workloads by\nrunning each in its own lightweight virtual machine.\n",[98,289,292],{"href":290,"rel":291},"https:\u002F\u002Ffirecracker-microvm.github.io\u002F",[102],"Firecracker"," (via the Containerd\nplugin) and ",[98,295,298],{"href":296,"rel":297},"https:\u002F\u002Fkatacontainers.io\u002F",[102],"Kata Containers"," both offer\nOCI-compatible runtimes that replace the standard container execution model with\nVM-backed isolation. The trade-off is real: microVM startup time, additional\nmemory overhead per workload, and a more complex container runtime to configure\nand operate.",[17,301,302],{},"The right choice depends on the threat model. Kernel-level isolation is\nsufficient for most operational concerns. Hardware-level isolation is warranted\nwhen the threat model includes a compromised container breaking out to the host\n(a higher bar that most deployments don't require but some do).",[12,304,306],{"id":305},"is-it-worth-it","Is it worth it?",[17,308,309],{},"This is the question the design intentionally leaves open.",[17,311,312],{},"The argument for: at sufficient scale, manual per-node operations don't compose.\nEvery new protocol is an operational multiplier, and Kubernetes provides a shared\nfoundation that could absorb that multiplier once. Rolling upgrades with health\nprobes, automated key injection, standardized storage interfaces across\nprotocols: these are real benefits that compound as the fleet grows.",[17,314,315],{},"The argument against: each of those benefits requires solving a hard problem\nfirst, and each solution adds a layer. Storage abstraction, singleton enforcement,\nexternal secrets integration, protocol-specific health probes: these are not\nsimple configurations. They require people who understand both the Kubernetes\nprimitives and the blockchain-specific constraints, and that combination is rare\nand expensive. There is a real risk that the orchestration layer becomes its own\noperational burden, requiring specialized platform engineering attention that\noutweighs the toil it was supposed to eliminate.",[17,317,318],{},"The break-even point (where the operational benefit of orchestration exceeds\nthe cost of building and maintaining the platform) depends on fleet size, team\nstructure, and how well the health check and upgrade automation can be built\ngenerically across protocols. At a handful of nodes, the overhead clearly isn't\nworth it. At hundreds of nodes, it probably is. The interesting question is\nwhere in between that line sits, and whether the complexity stays bounded as the\nplatform scales.",[91,320,322],{"id":321},"open-questions","Open questions",[72,324,325,328,331,334,337],{},[75,326,327],{},"At what fleet size does orchestration overhead become net positive?",[75,329,330],{},"How generalizable are health probes across protocols, and what's the\nmaintenance cost of per-protocol implementations?",[75,332,333],{},"Can storage auto-scaling keep up with chain data growth rates without\noperator intervention?",[75,335,336],{},"What does the secrets store availability dependency cost in practice, and\nhow is it designed for?",[75,338,339],{},"Does the team structure that can build and maintain this platform exist, or\ndoes it need to be built first?",[164,341,342],{"type":166},[17,343,344],{},"The validator high availability design (the standby\u002Factive model that\nkeeps a warm standby synced without double-signing risk) is a related problem\nthat this proposal does not fully address.",{"title":346,"searchDepth":347,"depth":347,"links":348},"",2,[349,350,351,352],{"id":93,"depth":347,"text":94},{"id":132,"depth":347,"text":133},{"id":149,"depth":347,"text":150},{"id":321,"depth":347,"text":322},"2022-04-09T00:00:00+01:00","Kubernetes was built for stateless workloads. Blockchain nodes are the extreme opposite. A design proposal exploring what it would actually take to run validators and full nodes on Kubernetes at scale: storage, process exclusivity, key injection, and the upgrade automation that might make it worth the investment.","md",{},true,"\u002Fposts\u002F2022\u002F04\u002Fkubernetes-blockchain-node-orchestration",{"title":6,"description":354},"kubernetes-blockchain-node-orchestration","posts\u002F2022\u002F04\u002Fkubernetes-blockchain-node-orchestration",[363,364,365,366],"blockchain","kubernetes","platform-engineering","infrastructure-as-code","rXBdWidYPbFs_YOulcEyxpqMsG7K1LWRXG7M6rKInUs",{"id":369,"title":370,"author":7,"body":371,"createdAt":798,"description":799,"extension":355,"meta":800,"navigation":357,"path":801,"seo":802,"slug":803,"stem":804,"tags":805,"__hash__":807},"posts\u002Fposts\u002F2021\u002F08\u002Freal-life-terraform-refactoring-guide.md","Real-life Terraform Refactoring Guide",{"type":9,"value":372,"toc":788},[373,377,392,395,404,410,413,417,437,446,458,461,479,483,486,495,502,505,516,523,532,538,546,549,561,566,573,581,585,588,594,598,601,607,611,621,627,631,638,648,654,666,669,686,692,696,708,725],[91,374,376],{"id":375},"intro","Intro",[17,378,379,380,385,386,391],{},"As reality hits, the unavoidable fact of dealing with a hard-to-manage Terraform\n",[98,381,384],{"href":382,"rel":383},"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBig%5Fball%5Fof%5Fmud",[102],"Big ball of mud"," code base comes in. There is no way around natural growth and\nevolution of code bases and the design flaws that come with it. Our Agile\nmindset is to ",[98,387,390],{"href":388,"rel":389},"https:\u002F\u002Fwww.brainyquote.com\u002Fquotes\u002Fmark%5Fzuckerberg%5F453439",[102],"\"move fast and break things\"",", implement something as simple as\npossible and let the design decisions for the next iterations (if any).",[17,393,394],{},"Refactoring Terraform code is actually as natural as developing it, time and\ntime again you will be faced with a situation where a better structure or\norganization can be achieved, maybe you want to upgrade from a home-made module\nto an open-source\u002Fcommunity alternative, maybe you just want to segregate your\nresources into different states to speed-up development. Regardless of the goal,\nonce you get into it, you will realize that Terraform code refactoring is\nactually a basic missing step on the development process that no one told you\nbefore.",[17,396,397,398,403],{},"As the ",[98,399,402],{"href":400,"rel":401},"http:\u002F\u002Fnathanmarz.com\u002Fblog\u002Fsuffering-oriented-programming.html",[102],"Suffering-Oriented Programming"," mantra dictates:",[405,406,407],"blockquote",{},[17,408,409],{},"\"First make it possible. Then make it beautiful. Then make it fast.\"",[17,411,412],{},"So, time to make the Terraform code beautiful!",[91,414,416],{"id":415},"how-to-break-a-big-ball-of-mud-strangle-it","How to break a big ball of mud? STRANGLE IT",[17,418,419,422,423,426,427,432,433,436],{},[194,420,421],{},"\u003Cjoke>"," Martin Fowler has already written everything there is to write about\n(early 2000s) DevOps, Agile, and Software Development. Therefore, we could\nreference Martin Fowler for virtually anything Software related ",[194,424,425],{},"\u003C\u002Fjoke>",", but\nreally, the ",[98,428,431],{"href":429,"rel":430},"https:\u002F\u002Fmartinfowler.com\u002Fbooks\u002Frefactoring.html",[102],"Refactoring book"," is ",[189,434,435],{},"THE"," reference on this subject.",[17,438,439,440,445],{},"Martin Fowler shared the ",[98,441,444],{"href":442,"rel":443},"https:\u002F\u002Fmartinfowler.com\u002Fbliki\u002FStranglerFigApplication.html",[102],"Stangler (Fig) Pattern",", which describes a strategy to\nrefactor a legacy code base by re-implementing the same features (sometimes even\nthe bugs) on another application.",[405,447,448,455],{},[17,449,450,454],{},[451,452,453],"span",{},"..."," the huge strangler figs. They seed in the upper branches of a tree and\ngradually work their way down the tree until they root in the soil. Over many\nyears they grow into fantastic and beautiful shapes, meanwhile strangling and\nkilling the tree that was their host.",[17,456,457],{},"This metaphor struck me as a way of describing a way of doing a rewrite of an\nimportant system.",[17,459,460],{},"In this document we are going to follow the same idea:",[462,463,464,473,476],"ol",{},[75,465,466,467,472],{},"implement the same feature on a different ",[98,468,471],{"href":469,"rel":470},"https:\u002F\u002Fwww.terraform-best-practices.com\u002Fkey-concepts#composition",[102],"Terraform composition",";",[75,474,475],{},"migrate the Terraform state;",[75,477,478],{},"delete (kill) the previous implementation.",[91,480,482],{"id":481},"the-mono-repository-monorepo-approach-to-legacy","The mono-repository (monorepo) approach to Legacy",[17,484,485],{},"Let's suppose that your Terraform code base is versioned in a single repository\n(a.k.a. monorepo), following the random structure displayed below (just to help\nillustrate)",[487,488,493],"pre",{"className":489,"code":491,"language":492},[490],"language-text",".\n├── modules\u002F    # Definition of TF modules used by underlying compositions\n├── global\u002F     # Resources that aren't restricted to one environment\n│   ├── aws\u002F\n├── production\u002F # Production environment resources\n│   └── aws\u002F\n└── staging\u002F    # Staging environment resources\n    └── aws\u002F\n","text",[194,494,491],{"__ignoreMap":346},[17,496,497,498,501],{},"On this example each directory corresponds to a Terraform state. In order to\napply changes you have to walk to a path and execute ",[194,499,500],{},"terraform",".",[17,503,504],{},"The structure on this example repository was created a few hypothetical years\nago when the number of existing microservices and resources (DB, message queues,\netc) was significantly smaller. At the time, it was feasible to keep Terraform\ndefinitions together because it was easier to maintain, Cloud resources were\nmanaged with one-shot!",[17,506,507,508,511,512,515],{},"As the time went by, the number of Products and the team grew, and engineers\nstarted facing concurrency issues: Terraform lock executions on a shared storage\nwhen someone else is running ",[194,509,510],{},"terraform apply"," as well as a general slowness on\n",[189,513,514],{},"every execution"," since the number of data sources to sync is frightening.",[17,517,518,519,501],{},"A mono-repository approach is not necessarily bad, versioning is actually\nsimpler when performed in one single repository. Ideally, there won't be many\nchanges on the scale of GiB meaning that it is safe to proceed on this one ",[520,521,522],"em",{},"as\nlong as the Terraform remote states are divided",[524,525,527,528,531],"h3",{"id":526},"splitting-the-modules-sub-path-to-its-own-repository","Splitting the ",[194,529,530],{},"modules"," sub-path to its own repository",[17,533,534,535,537],{},"One thing to mention though is the ",[194,536,530],{}," sub-path, this one could be stored\nin a different git repository to leverage its own versioning. Since Terraform\nmodules and its implementations don't always evolve in the same pace, keeping\ntwo distinct version trees is beneficial. Additionally, a separated repository\nfor Terraform modules allows the specification of \"pinned versions\", e.g.:",[487,539,544],{"className":540,"code":542,"language":543,"meta":346},[541],"language-hcl","module \"aws_main_vpc\" {\n  source = \"git::https:\u002F\u002Fgithub.com\u002Fterraform-aws-modules\u002Fterraform-aws-vpc.git?ref=2ca733d\"\n  # Note the ref=${GIT_REVISION_DIGEST}\n}\n","hcl",[194,545,542],{"__ignoreMap":346},[17,547,548],{},"That reference for a module's version should always be specified, regardless if\nit comes from an internal\u002Fprivate repository or public. When you specify the\nversion, you are ensuring reproducibility.",[17,550,551,552,554,555,560],{},"Therefore, let's move the ",[194,553,530],{}," sub-path to another git repository,\nfollowing instructions from ",[98,556,559],{"href":557,"rel":558},"https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F359424\u002Fdetach-move-subdirectory-into-separate-git-repository\u002F17864475#17864475",[102],"this StackOverflow answer"," so that the git commit\nhistory is preserved:",[562,563,565],"h4",{"id":564},"_0","0.",[17,567,568,569,572],{},"Walk to the monorepo path and create a branch from the commits at\n",[194,570,571],{},"monorepo\u002Fmodules"," path",[487,574,579],{"className":575,"code":577,"language":578,"meta":346},[576],"language-bash","MAIN_BIGGER_REPO=\u002Fpath\u002Fto\u002Fthe\u002Fmonorepo\ncd \"${MAIN_BIGGER_REPO}\"\ngit subtree split -P modules -b refact-modules\n","bash",[194,580,577],{"__ignoreMap":346},[562,582,584],{"id":583},"_1","1.",[17,586,587],{},"Create the new repository",[487,589,592],{"className":590,"code":591,"language":578,"meta":346},[576],"mkdir \u002Fpath\u002Fto\u002Fthe\u002Fterraform-modules && cd $_\ngit init\ngit pull \"${MAIN_BIGGER_REPO}\" refact-modules\n",[194,593,591],{"__ignoreMap":346},[562,595,597],{"id":596},"_2","2.",[17,599,600],{},"Link the new repository to your remote Git (server)",[487,602,605],{"className":603,"code":604,"language":578,"meta":346},[576],"git remote add origin \u003Cgit@git.com:user\u002Fterraform-modules.git>\ngit push -u origin master\n",[194,606,604],{"__ignoreMap":346},[562,608,610],{"id":609},"_3","3.",[17,612,613,616,617,620],{},[451,614,615],{},"OPTIONAL"," Cleanup inside ",[194,618,619],{},"$MAIN_BIGGER_REPO",", if desired",[487,622,625],{"className":623,"code":624,"language":578,"meta":346},[576],"cd ${MAIN_BIGGER_REPO}\ngit rm -rf modules\ngit filter-branch --prune-empty \\\n    --tree-filter \"rm -rf modules\" -f HEAD\n",[194,626,624],{"__ignoreMap":346},[524,628,630],{"id":629},"lets-start-strangling-the-repository","Let's start strangling the repository",[17,632,633,634,637],{},"Now that a substantial piece of code was moved somewhere else, it is time to\nput the ",[98,635,444],{"href":442,"rel":636},[102]," in practice.",[17,639,640,641,644,645,647],{},"Move all the existing content as-is to the ",[194,642,643],{},"legacy"," sub-path, keeping the same\nrepository and change history (commits). It also allows applying the ",[194,646,643],{},"\ncode as it used to be from one of those paths.",[487,649,652],{"className":650,"code":651,"language":492},[490],".\n└── legacy\n    ├── global\n    │   └── aws\n    ├── production\n    │   └── aws\n    └── staging\n        └── aws\n",[194,653,651],{"__ignoreMap":346},[17,655,656,657,662,663,665],{},"Once the content is moved to legacy, the idea is to follow the ",[98,658,661],{"href":659,"rel":660},"https:\u002F\u002Fwww.oreilly.com\u002Flibrary\u002Fview\u002F97-things-every\u002F9780596809515\u002Fch08.html",[102],"Boy Scout rule","\nin order to strangle the ",[194,664,643],{}," content little by little (unless you are\nreally committed to migrating it all at once, which is going to be exhaustive).",[17,667,668],{},"The Boy Scout rule goes like:",[462,670,671,678,681],{},[75,672,673,674,472],{},"every time a task that involves deprecated code appears, we implement it on\n",[98,675,677],{"href":676},"..\u002Fstdout\u002Fblog\u002F2021\u002F03\u002Fterraform-best-practices.org","the new structure",[75,679,680],{},"import the Terraform state to keep the Cloud resources that a given code\nrepresents\u002Fdescribes;",[75,682,683,684,501],{},"remove the state and the code from ",[194,685,643],{},[17,687,688,689,691],{},"Until there is nothing left inside ",[194,690,643],{}," (or there are only unused\nresources\u002Fleft-behinds that could be destroyed\u002Fgarbage collected either way).",[562,693,695],{"id":694},"import-state-remove-state-and-code-from-what-where","Import state? Remove state and code from what? Where?",[17,697,698,699,702,703,501],{},"That will depend on the kind of resource we are migrating from the remote state,\non the bottom of each ",[194,700,701],{},"resource"," on Terraform's provider documentation you can\nfind a reference command to import existing resources into your Terraform code\nspecification. e.g.: ",[98,704,707],{"href":705,"rel":706},"https:\u002F\u002Fregistry.terraform.io\u002Fproviders\u002Fhashicorp\u002Faws\u002Flatest\u002Fdocs\u002Fresources\u002Fdb%5Finstance#import",[102],"AWS RDS DB instance",[17,709,710,711,714,715,720,721,724],{},"Suppose we want to replace the code of the AWS RDS Aurora defined in\n",[194,712,713],{},"production\u002Faws"," and then re-implement the same using ",[98,716,719],{"href":717,"rel":718},"https:\u002F\u002Fgithub.com\u002Fterraform-aws-modules\u002Fterraform-aws-rds-aurora",[102],"the community module",".\nAfter creating the corresponding sub-path to the monorepo according to your\npreference, provisioning the bucket and initializing the Terraform ",[194,722,723],{},"backend",":",[462,726,727,741,764],{},[75,728,729,730,734,735],{},"implement the definition of the community module\n",[98,731,733],{"href":717,"rel":732},[102],"github.com\u002Fterraform-aws-modules\u002Fterraform-aws-rds-aurora"," with the closest\nparameters from the existing one; e.g.:",[487,736,739],{"className":737,"code":738,"language":543,"meta":346},[541],"module \"aws_aurora_main_cluster\" {\n  source  = \"terraform-aws-modules\u002Frds-aurora\u002Faws\"\n  version = \"~> 5.2\"\n\n  # ...\n}\n",[194,740,738],{"__ignoreMap":346},[75,742,743,744,750,753,754,757,758],{},"import the Terraform states from the previous (existing) cluster",[487,745,748],{"className":746,"code":747,"language":578,"meta":346},[576],"terraform import 'aws_aurora_main_cluster.aws_rds_cluster.this[0]' main-database-name\nterraform import 'aws_aurora_main_cluster.aws_rds_cluster_instance.this[0]' main-database-instance-name-01\nterraform import 'aws_aurora_main_cluster.aws_rds_cluster_instance.this[1]' main-database-instance-name-02\n\n# ...\n",[194,749,747],{"__ignoreMap":346},[751,752],"br",{},"then if you haven't yet and would like to \"match reality\" between the\nexisting and the specified resource, run ",[194,755,756],{},"terraform plan"," a few times and\nadjust the parameters until Terraform reports:",[487,759,762],{"className":760,"code":761,"language":492,"meta":346},[490],"No changes. Your infrastructure matches the configuration.\n",[194,763,761],{"__ignoreMap":346},[75,765,766,767,769,770,776,778,779,781,782],{},"last but not least, remove the corresponding resources from the ",[194,768,643],{},"\nTerraform state so that it doesn't try to keep track of the changes and also\ndon't try to destroy once the resource definition is no longer in that code\nbase:",[487,771,774],{"className":772,"code":773,"language":578,"meta":346},[576],"# Hypothetical name of the resource inside production\u002Faws\u002Fmain.tf\nterraform state rm aws_rds_cluster.default \\\n    'aws_rds_cluster_instance.default[0]' 'aws_rds_cluster_instance.default[1]'\n\n# ...\n",[194,775,773],{"__ignoreMap":346},[751,777],{},"once that is performed, feel free to remove the corresponding resource's\ndefinition from the ",[194,780,643],{}," code.",[487,783,786],{"className":784,"code":785,"language":543,"meta":346},[541],"resource \"aws_rds_cluster\" \"default\" {\n  # ...\n}\n\nresource \"aws_rds_cluster_instance\" \"default\" {\n  count = var.number_of_database_instances\n\n  # ...\n}\n",[194,787,785],{"__ignoreMap":346},{"title":346,"searchDepth":347,"depth":347,"links":789},[790,791,792],{"id":375,"depth":347,"text":376},{"id":415,"depth":347,"text":416},{"id":481,"depth":347,"text":482,"children":793},[794,797],{"id":526,"depth":795,"text":796},3,"Splitting the modules sub-path to its own repository",{"id":629,"depth":795,"text":630},"2021-08-11T00:00:00+02:00","Want to know how to better organize existing Terraform code? If you grasp these ideas, it could even serve for not-yet Infrastructure as Code resources. Jump in and take a look.",{},"\u002Fposts\u002F2021\u002F08\u002Freal-life-terraform-refactoring-guide",{"title":370,"description":799},"real-life-terraform-refactoring-guide","posts\u002F2021\u002F08\u002Freal-life-terraform-refactoring-guide",[366,806,500],"cloud","jPrVDAjqTerXgzb6UVNqa_9QMs0RvtE05oSR-ywTlgQ",{"id":809,"title":810,"author":7,"body":811,"createdAt":1284,"description":1285,"extension":355,"meta":1286,"navigation":357,"path":1287,"seo":1288,"slug":1289,"stem":1290,"tags":1291,"__hash__":1292},"posts\u002Fposts\u002F2021\u002F06\u002Fterraform-atomic-design.md","Terraform: Atomic Design",{"type":9,"value":812,"toc":1273},[813,815,824,832,835,838,843,855,858,867,870,873,884,887,890,894,897,914,917,921,931,940,943,949,956,989,991,997,1000,1004,1007,1010,1014,1017,1020,1026,1036,1048,1052,1058,1067,1070,1081,1088,1092,1106,1113,1119,1133,1136,1147,1158,1162,1171,1179,1190,1198,1204,1211,1214,1220,1224,1227,1235,1238,1246,1250],[91,814,376],{"id":375},[17,816,817,818,823],{},"Following ",[98,819,822],{"href":820,"rel":821},"https:\u002F\u002Fpragprog.com\u002Ftitles\u002Ftpp20\u002Fthe-pragmatic-programmer-20th-anniversary-edition\u002F",[102],"The Pragmatic Programmer"," mantra, I do my best to ...",[405,825,826],{},[17,827,828,831],{},[189,829,830],{},"Learn at least one new language every year."," Different languages solve the same\nproblems in different ways. By learning several different approaches, you can\nhelp broaden your thinking and avoid getting stuck in a rut.",[17,833,834],{},"Not necessarily to show it off or to be capable of talking about random\ntechnologies, but to expand and train my problem-solving skills, to get new\nperspectives when approaching a challenge.",[17,836,837],{},"We might not notice it but when we learn (or have learned) to code we aren't\njust learning to type some characters that a compiler\u002Finterpreter can\nunderstand, it is a new way of thinking, a new way of breaking down solutions\n(into sequential steps).",[405,839,840],{},[17,841,842],{},"It doesn't matter whether you ever use any of these technologies on a project,\nor even whether you put them on your resume. The process of learning will expand\nyour thinking, opening you to new possibilities and new ways of doing things.\nThe cross-pollination of ideas is important;",[17,844,845,846,849,850,501],{},"As someone who works intensively with infrastructure components (servers,\ndatabases, Kubernetes, CI\u002FCD, etc) I aimed for something completely different\nthis year. Something that stands on ",[520,847,848],{},"a whole different spectrum"," of the system,\nthis year I decided to learn ",[98,851,854],{"href":852,"rel":853},"https:\u002F\u002Fflutter.dev\u002F",[102],"Flutter",[17,856,857],{},"In-a-nutshell, Flutter is a better React Native. A framework that enables\nimplementation of GUI applications for multiple platforms with a single code\nbase.",[17,859,860,861,866],{},"Then it reminded me a discussion I had with a friend in the past about React\ncomponents and the ",[98,862,865],{"href":863,"rel":864},"https:\u002F\u002Fbradfrost.com\u002Fblog\u002Fpost\u002Fatomic-web-design\u002F",[102],"Atomic Design"," methodology, which helps to structure web\ncomponents into modules.",[17,868,869],{},"In the Atomic Design methodology, the granularity of modules is distinguished by\nusing chemistry inspired names: atoms, molecules and organisms.",[17,871,872],{},"Then the connection of the ideas from",[72,874,875,878,881],{},[75,876,877],{},"Pragmatic Programmer's cross-pollination to",[75,879,880],{},"Atomic Design (on Flutter components) to",[75,882,883],{},"Terraform modules",[17,885,886],{},"came almost like a thunderbolt, striking me with this insight when I was working\nwith a huge legacy Terraform code base refactoring with lots of code duplication\n(read: copy+paste, \"we fix it later\", then the author quits the company and\nnever fix anything).",[17,888,889],{},"Although initially proposed as a Web UI methodology, Infrastructure as Code\ntools such as Terraform that makes heavy usage of modules can benefit from\nAtomic Design to improve its code reusability and massively reduce duplication.",[91,891,893],{"id":892},"details","Details",[17,895,896],{},"The Atomic Design methodology proposes five distinct levels, listed from the\nfinest to the thickest granularity:",[462,898,899,902,905,908,911],{},[75,900,901],{},"Atom;",[75,903,904],{},"Molecules;",[75,906,907],{},"Organisms;",[75,909,910],{},"Templates;",[75,912,913],{},"Pages.",[17,915,916],{},"However, to extract the gist, we'll only be focusing on Atoms, Molecules, and\nOrganisms (from 1. to 3.). Templates and Pages are too specialized for Web UI\ndevelopment.",[524,918,920],{"id":919},"atoms","Atoms",[17,922,923,924,926,927,930],{},"Atoms represent the finest grain in terms of granularity in the design. When\nreferring specifically to its implementation in Terraform a ",[194,925,701],{}," and a\nsmall scoped single-purpose ",[194,928,929],{},"module"," could be used interchangeably.",[17,932,933,934,936,937,939],{},"Sometimes the idea of turning a simple resource into a module makes sense to\nease parameterization and reusability, especially when it is necessary to parse\ninputs. Although, due to its extreme limited scope it might not look attractive\nto convert the ",[194,935,701],{}," into a ",[194,938,929],{}," at first sight, on the long run it\npays off to do so in order to achieve scalability and reproducibility.",[17,941,942],{},"e.g.:",[487,944,947],{"className":945,"code":946,"language":543,"meta":346},[541],"data \"aws_route53_zone\" \"default\" {\n  zone_id = var.zone_id\n  name    = var.zone_name\n}\n\nresource \"aws_route53_record\" \"default\" {\n  zone_id = data.aws_route53_zone.default.zone_id\n  name    = var.name\n\n  ttl  = var.ttl\n  type = var.record_type\n\n  records = var.records\n\n  dynamic \"alias\" {\n    for_each = [var.alias]\n\n    content {\n      name = each.value.name\n      zone_id = try(each.value.zone_id, data.aws_route53_zone.default.zone_id)\n\n      evaluate_target_health = lookup(\n        each.value,\n        \"evaluate_target_health\",\n        false,\n      )\n    }\n  }\n}\n",[194,948,946],{"__ignoreMap":346},[17,950,951,952,955],{},"In this case, even though ",[194,953,954],{},"aws_route53_record"," is a simple resource that might\nfeel too narrow in scope to write a module, the implementation of the module\nallows to bundle the AWS Route53 Zone data source together, which helps to:",[462,957,958,965,978],{},[75,959,960,961,964],{},"provide a simpler contract by allowing the usage of ",[194,962,963],{},"zone_name"," alone;",[75,966,967,968,970,971,973,974,977],{},"validate the ",[194,969,963],{}," input, ensuring that a given ",[194,972,963],{}," corresponds to an\nactual ",[189,975,976],{},"existing and valid"," AWS resource;",[75,979,980,981,984,985,988],{},"same goes to ",[194,982,983],{},"zone_id",", which will feel (and oftentimes, be) redundant,\n",[520,986,987],{},"when"," specified as an input Terraform will read the data from AWS API\nensuring consistency.",[17,990,942],{},[487,992,995],{"className":993,"code":994,"language":543,"meta":346},[541],"module \"awesome_dns_fqdn\" {\n  source = \"path\u002Fto\u002Fmodules\u002Fatoms\u002Faws_route53_record\"\n  version = \"~> 1.0\"\n\n  name      = \"record.example.com\"\n  zone_name = \"example.com.\"\n\n  record_type = \"CNAME\"\n  records     = [\"1.2.3.4\"]\n}\n",[194,996,994],{"__ignoreMap":346},[17,998,999],{},"Hence, resources and modules are sometimes interchangeable as they deliver the\nsame outcome for the finest resources' granularity.",[524,1001,1003],{"id":1002},"molecules","Molecules",[17,1005,1006],{},"When groups of atoms are bounded together, they create a molecule which is the\nsmallest fundamental unit of a compound.",[17,1008,1009],{},"Contrary to the original Atomic Design for Web UI, in Terraform, Atoms are\nuseful on their own. However, the usage of atoms comes with a high price on\nscalability: code duplication. Actually, duplication is an understatement, it is\nmore like code exponentiation (more on this later).",[562,1011,1013],{"id":1012},"implementation-example","Implementation example",[17,1015,1016],{},"Suppose we are creating a public facing API Gateway that needs a DNS record.",[17,1018,1019],{},"Let's compose it with the previous example:",[487,1021,1024],{"className":1022,"code":1023,"language":543,"meta":346},[541],"data \"aws_route53_zone\" \"default\" {\n  name = var.zone_name\n}\n\nmodule \"awesome_api_gateway_certificate\" {\n  source  = \"terraform-aws-modules\u002Facm\u002Faws\"\n  version = \"~> v3.0\"\n\n  domain_name = var.domain_name\n  zone_id     = data.aws_route53_zone.default.zone_id\n\n  wait_for_validation = true\n}\n\nmodule \"awesome_api_gateway\" {\n  source = \"terraform-aws-modules\u002Fapigateway-v2\u002Faws\"\n  version = \"~> 1.0\"\n\n  name          = var.api_gateway_name\n  description   = var.api_gateway_description\n  protocol_type = \"HTTP\"\n\n  cors_configuration = {\n    allow_headers = [\n      \"content-type\",\n      \"x-amz-date\",\n      \"authorization\",\n      \"x-api-key\",\n      \"x-amz-security-token\",\n      \"x-amz-user-agent\",\n    ]\n    allow_methods = [\"*\"]\n    allow_origins = [\"*\"]\n  }\n\n  # Custom domain\n  domain_name                 = var.domain_name\n  domain_name_certificate_arn = module.awesome_api_gateway_certificate.acm_certificate_arn\n\n  # Routes and integrations\n  integrations = var.api_gateway_integrations\n}\n\nmodule \"awesome_dns_fqdn\" {\n  source  = \"path\u002Fto\u002Fmodules\u002Fatoms\u002Faws_route53_record\"\n  version = \"~> 1.0\"\n\n  name    = var.domain_name\n  zone_id = data.aws_route53_zone.default.zone_id\n\n  record_type = \"CNAME\"\n  alias     = {\n    name    = module.awesome_api_gateway.apigatewayv2_domain_name_configuration[0].target_domain_name\n    zone_id = module.awesome_api_gateway.apigatewayv2_domain_name_configuration[0].hosted_zone_id\n  }\n}\n",[194,1025,1023],{"__ignoreMap":346},[17,1027,1028,1029,1031,1032,1035],{},"This helps illustrating an example in which the ",[194,1030,954],{}," atom could\nbe easily replaced with its equivalent resource and it would still provide the\n",[189,1033,1034],{},"same"," outcome.",[17,1037,1038,1039,1041,1042,1044,1045,1047],{},"Commonly it is possible to use ",[194,1040,929],{}," and ",[194,1043,701],{}," interchangeably as Atoms,\nthe decision of whether or not to implement a ",[194,1046,929],{}," is ultimately defined by\nthe need of parsing and\u002For validating the inputs (variables).",[562,1049,1051],{"id":1050},"usage-example","Usage example",[487,1053,1056],{"className":1054,"code":1055,"language":543,"meta":346},[541],"module \"awesome_lambda\" {\n  source  = \"path\u002Fto\u002Fmodules\u002Fmolecules\u002Faws_lambda_function\"\n  version = \"~> 1.0\"\n\n  function_name = \"awesome\"\n  description   = \"An Awesome lambda function for the Awesome API Gateway\"\n  handler       = \"index.lambda_handler\"\n  runtime       = \"python3.8\"\n\n  # Incomplete implementation, don't use this on production\n}\n\nmodule \"another_awesome_lambda\" {\n  source  = \"path\u002Fto\u002Fmodules\u002Fmolecules\u002Faws_lambda_function\"\n  version = \"~> 1.0\"\n\n  function_name = \"awesome\"\n  description   = \"An Awesome lambda function for the Awesome API Gateway\"\n  handler       = \"index.lambda_handler\"\n  runtime       = \"python3.8\"\n\n  # Incomplete implementation, don't use this on production\n}\n\nmodule \"awesome_api_gateway\" {\n  source  = \"path\u002Fto\u002Fmodules\u002Fmolecules\u002Faws_api_gateway\"\n  version = \"~> 1.0\"\n\n  domain_name = \"record.example.com\"\n  zone_name   = \"example.com.\"\n\n  api_gateway_name        = \"awesome-api-gateway\"\n  api_gateway_description = \"An Awesome API Gateway\"\n\n  api_gateway_integrations = {\n    \"POST \u002F\" = {\n      lambda_arn             = module.awesome_lambda.function_arn\n      payload_format_version = \"2.0\"\n    }\n\n    \"$default\" = {\n      lambda_arn = module.another_awesome_lambda.function_arn\n    }\n  }\n}\n",[194,1057,1055],{"__ignoreMap":346},[17,1059,1060,1061,1066],{},"As you probably have already realized, when the level of abstraction goes up\n(e.g. from atom to molecule) the module implementation is in itself a good\nimplementation example (i.e. as in ",[98,1062,1065],{"href":1063,"rel":1064},"https:\u002F\u002Fgithub.com\u002Fterraform-aws-modules\u002Fterraform-aws-lambda\u002Fblob\u002Fmaster\u002Fmain.tf",[102],"community modules examples",").",[17,1068,1069],{},"They help to self-document the usage and implementation of a given module and\nthrough generic implementations it allows us to have multiple molecules\nimplementing multiple distinct use-cases. e.g.:",[462,1071,1072,1075,1078],{},[75,1073,1074],{},"Public API Gateway with DNS record + TLS certificate;",[75,1076,1077],{},"Public API Gateway v1, no DNS record;",[75,1079,1080],{},"Private API Gateway.",[17,1082,1083,1084,1087],{},"Why would we chose to implement multiple times the Atom modules in order to\ncreate multiple distinct use-cases? We are getting closer to the ",[520,1085,1086],{},"code\nexponentiation"," problem and solution proposal. Can you feel it?",[524,1089,1091],{"id":1090},"organisms","Organisms",[17,1093,1094,1095,1099,1100,1105],{},"Going further, the ",[98,1096,1098],{"href":1097},"#usage-example","example of composition for molecules"," can have its hard-coded\nvalues turned into variables in order to compose an Organism, which can\nfacilitate the implementation of the same definition across different\nenvironments. Thus, achieving reproducibility as well as the ",[98,1101,1104],{"href":1102,"rel":1103},"https:\u002F\u002F12factor.net\u002Fdev-prod-parity",[102],"Factor X."," of the\nTwelve Factor App.",[17,1107,1108,1109,1112],{},"However, it is important to note that the level of abstraction between Organisms\nand Molecules can be easily confused or misunderstood. Generally speaking, as a\nrule of thumb an Organism is the composition of Molecules that allow parameterization for\nbusiness or domain-specific logic (e.g. the actual ",[194,1110,1111],{},"awesome_api"," configuration).\nTherefore, in comparison with the previous, Organisms (usually) have a lower\nlevel of generalization since they are business-specialized modules.",[17,1114,1115,1116,1118],{},"Iterating over our implementation example, the Organism would implement the\n",[194,1117,1111],{},", creating the following resources:",[72,1120,1121,1124,1127,1130],{},[75,1122,1123],{},"AWS Lambda function;",[75,1125,1126],{},"AWS API Gateway;",[75,1128,1129],{},"TLS Certificate on AWS ACM;",[75,1131,1132],{},"DNS record on AWS Route53.",[17,1134,1135],{},"By implementing the previous examples as organisms we:",[462,1137,1138,1141,1144],{},[75,1139,1140],{},"reduce the amount of boilerplate code;",[75,1142,1143],{},"foster reusability of modules;",[75,1145,1146],{},"provide a simple interface for non-operators to manage TF code.",[17,1148,1149,1150,1153,1154,1157],{},"When you sum it all up, you will notice that it is ",[189,1151,1152],{},"all about autonomy"," and\n\"DevOps\" through encouragement of self-service Ops. One wouldn't need to know a\nlot about Terraform to grab a module and pass some parameters to it, followed by\na code review process Operators and Software Developers can manage the\nInfrastructure in harmony, ",[189,1155,1156],{},"together",". (:",[524,1159,1161],{"id":1160},"code-exponentiation-what","Code Exponentiation? What?",[17,1163,1164,1165,1170],{},"Read that as a dramatization of the ",[98,1166,1169],{"href":1167,"rel":1168},"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDuplicate%5Fcode",[102],"\"code duplication\""," term.",[17,1172,1173,1174,501],{},"When it comes to Infrastructure as Code, there is no easy way around the jungle\nof resources that grows over time. Fast pacing tech companies are \"moving fast\nand breaking things\", oftentimes the Operators are worried about a massive\namount of challenges at once: keep the servers up and running, with a consistent\nresponse time, low error rate, and all that ",[98,1175,1178],{"href":1176,"rel":1177},"https:\u002F\u002Fsre.google\u002Fsre-book\u002Ftable-of-contents\u002F",[102],"playbook from Google's SRE wisdom",[17,1180,1181,1182,1185,1186,1189],{},"All things considered, a good Infrastructure as Code design is generally\na first-world problem. However, as the time passes it evolves into a real issue\nthat slows down the implementation of resources as code. Either that or there\nwill be a ",[189,1183,1184],{},"huge ton"," of copy+paste to keep up with the pace, followed by a\nroutine of find+replace when changes are applied, ",[520,1187,1188],{},"then"," harder to track pull\nrequests and slower code reviews.",[17,1191,1192,1193,1195,1196,724],{},"Lets take our ",[194,1194,1111],{}," example and scale it up to multiple environments\nfollowed by a second ",[194,1197,1111],{},[487,1199,1202],{"className":1200,"code":1201,"language":492,"meta":346},[490],".\n├── development\n│   ├── an-awesome-api\n│   │   └── main.tf\n│   └── another-awesome-api\n│       └── main.tf\n├── staging\n│   ├── an-awesome-api\n│   │   └── main.tf\n│   └── another-awesome-api\n│       └── main.tf\n└── production\n    ├── an-awesome-api\n    │   └── main.tf\n    └── another-awesome-api\n        └── main.tf\n",[194,1203,1201],{"__ignoreMap":346},[17,1205,1206,1207,1066],{},"Note that this directory structure is inspired on the proposed ideas from the\n[Terraform best practices post](",[1208,1209],"binding",{"value":1210},"\u003C relref \"terraform-best-practices\" >",[17,1212,1213],{},"In order to replicate the configuration and ensure consistency, the following is\nway simpler to implement (and review) than copy+paste huge chunks of Terraform\ndefinitions",[487,1215,1218],{"className":1216,"code":1217,"language":543,"meta":346},[541],"module \"awesome_api\" {\n  source = \"path\u002Fto\u002Fmodules\u002Forganisms\u002Faws_lambda_with_api_gateway\"\n  version = \"~> 1.0\"\n\n  domain_name = \"record.example.com\"\n  zone_name   = \"example.com.\"\n\n  lambda_functions = [\n    # Index 0 -- An Awesome Lambda Function, used for POST\n    {\n      name        = \"an-awesome\"\n      description = \"An Awesome lambda function for the Awesome API Gateway\"\n      handler     = \"an_awesome.lambda_handler\"\n      runtime     = \"python3.8\"\n    },\n    # Index 1 -- Another Awesome Lambda Function, used as $default\n    {\n      name        = \"another-awesome\"\n      description = \"Another Awesome lambda function for the Awesome API Gateway\"\n      handler     = \"another_awesome.lambda_handler\"\n      runtime     = \"python3.8\"\n    },\n  ]\n\n  api_gateway_name = \"awesome-api-gateway\"\n  api_gateway_description = \"An Awesome API Gateway\"\n\n  api_gateway_integrations = {\n    \"POST \u002F\" = {\n      lambda_function_index  = 0\n      payload_format_version = \"2.0\"\n    }\n\n    \"$default\" = {\n      lambda_function_index = 1\n    }\n  }\n}\n",[194,1219,1217],{"__ignoreMap":346},[91,1221,1223],{"id":1222},"conclusion","Conclusion",[17,1225,1226],{},"At the end of the day we get an ugly Terraform state containing many",[487,1228,1233],{"className":1229,"code":1231,"language":1232,"meta":346},[1230],"language-ruby","module.something.module.something_else.module.yet_another_thing...\n","ruby",[194,1234,1231],{"__ignoreMap":346},[17,1236,1237],{},"But the productivity boost gained by merging modules based on context is a worth\ninvestment. Especially for huge Terraform repositories with multiple teams\ncollaborating and managing a lot of resources.",[17,1239,1240,1241,501],{},"Cross-team collaboration is fostered by applying the Atomic Design methodology\nfor Terraform modules, code reusability becomes an important factor over\ncopy+paste and the repository gravitates towards the ",[98,1242,1245],{"href":1243,"rel":1244},"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDon%27t%5Frepeat%5Fyourself",[102],"DRY principle",[91,1247,1249],{"id":1248},"same-post-different-places","Same post, different places",[72,1251,1252,1259,1266],{},[75,1253,1254,472],{},[98,1255,1258],{"href":1256,"rel":1257},"https:\u002F\u002Fwww.reddit.com\u002Fr\u002FTerraform\u002Fcomments\u002Fpd708z\u002Fterraform%5Fmodules%5Fatomic%5Fdesign\u002F",[102],"reddit.com: Terraform Modules: Atomic Design - r\u002FTerraform",[75,1260,1261,472],{},[98,1262,1265],{"href":1263,"rel":1264},"https:\u002F\u002Fdev.to\u002Fmacunha\u002Fterraform-modules-atomic-design-3i7m",[102],"dev.to: Terraform Modules: Atomic Design - DEV Community",[75,1267,1268,472],{},[98,1269,1272],{"href":1270,"rel":1271},"https:\u002F\u002Fweekly.tf\u002Fissues\u002Fweekly-tf-issue-51-terraform-atomic-design-ec2-image-builder-736257",[102],"weekly.tf: #51 - Terraform Atomic Design, EC2 Image Builder",{"title":346,"searchDepth":347,"depth":347,"links":1274},[1275,1276,1282,1283],{"id":375,"depth":347,"text":376},{"id":892,"depth":347,"text":893,"children":1277},[1278,1279,1280,1281],{"id":919,"depth":795,"text":920},{"id":1002,"depth":795,"text":1003},{"id":1090,"depth":795,"text":1091},{"id":1160,"depth":795,"text":1161},{"id":1222,"depth":347,"text":1223},{"id":1248,"depth":347,"text":1249},"2021-06-29T00:00:00+02:00","Adapting the Atomic Design methodology to Infrastructure as Code components to help foster code reusability, ease of maintenance and agile development of the infrastructure. Creates standardization, validates inputs and brings the Terraform definitions closer to the developers (self-service Ops).",{},"\u002Fposts\u002F2021\u002F06\u002Fterraform-atomic-design",{"title":810,"description":1285},"terraform-atomic-design","posts\u002F2021\u002F06\u002Fterraform-atomic-design",[366,806,500],"51i47DRdlOrtRDt7XIlsZ2X7XP5klw9_dS9yyn5J_3A",[1294,1473],{"id":1295,"title":1296,"body":1297,"createdAt":1455,"description":1456,"extension":355,"meta":1457,"navigation":357,"path":1466,"seo":1467,"slug":1468,"stem":1469,"tags":1470,"website":7,"__hash__":1472},"projects\u002Fprojects\u002F2021\u002Ffreeletics-jenkins-redesign.md","Freeletics: Jenkins CI\u002FCD Redesign",{"type":9,"value":1298,"toc":1448},[1299,1303,1307,1310,1314,1317,1337,1341,1345,1348,1354,1372,1378,1384,1388,1418,1422,1425],[12,1300,1302],{"id":1301},"introduction","Introduction",[91,1304,1306],{"id":1305},"summary","Summary",[17,1308,1309],{},"Freeletics ran three fragmented CI\u002FCD systems in parallel: Jenkins for back-end\nand web, CircleCI for mobile, and Travis for tests. Jenkins itself was at least\nfive years out-of-date, built on a customized Jenkins Job Builder (JJB) fork\nthat didn't support Jenkins Pipelines, deployed through a Helm Chart that\nembedded secrets directly in its values file (coupling every configuration\nchange to a secrets release).",[91,1311,1313],{"id":1312},"problem","Problem",[17,1315,1316],{},"Three specific failure modes drove the redesign:",[72,1318,1319,1325,1331],{},[75,1320,1321,1324],{},[189,1322,1323],{},"Morning build storms:"," Dependabot merged PRs in batches at the start of\nthe day, triggering simultaneous Docker image builds that overwhelmed\nJenkins master-to-slave HTTP communication and caused widespread job hangs;",[75,1326,1327,1330],{},[189,1328,1329],{},"Modernization blocked:"," the outdated JJB fork rejected Pipeline definitions\nat deploy time, making it impossible to adopt any Jenkins feature released in\nthe past two years;",[75,1332,1333,1336],{},[189,1334,1335],{},"Untestable, untouchable Helm Chart:"," JJB YAML was rendered inside Go\ntemplates and executed during chart install. Any change carried the risk of a\nbroken Jenkins release with no rollback path that didn't also revert secrets.",[12,1338,1340],{"id":1339},"solution","Solution",[91,1342,1344],{"id":1343},"technical-implementation","Technical Implementation",[17,1346,1347],{},"Executed in four sequential phases:",[17,1349,1350,1353],{},[189,1351,1352],{},"Phase 1 - Tool evaluation (Feb-Mar\u002F2020):"," benchmarked Docker image build\ntimes across CircleCI, GitLab CI (shared and self-hosted runners with Kaniko),\nand Jenkins. Jenkins produced the fastest server-side builds due to lower\nlatency and full control over runner hardware sizing. Decision: invest in\nJenkins, redesign from scratch.",[17,1355,1356,1359,1360,1363,1364,1367,1368,1371],{},[189,1357,1358],{},"Phase 2 - Pipeline modernization (Aug\u002F2020):"," replaced JJB with Jenkins\nConfiguration as Code (JCasC) and Job DSL templates managed through Terraform,\nmaking every job definition a reviewable pull request. Migrated all Docker image\nbuilds from Docker-in-Docker to Kaniko (running as unprivileged ephemeral\nKubernetes pods). Redesigned the Jenkins Groovy Shared Library around a\ncomposable ",[194,1361,1362],{},"KanikoBuilder"," class, reducing per-repository Jenkinsfiles to\ndeclarative build specifications. Introduced image multi-tagging (",[194,1365,1366],{},"qa-\u003CSHA1>",",\n",[194,1369,1370],{},"qa-latest-master",") to support the QA stack's tag-based image resolution.",[17,1373,1374,1377],{},[189,1375,1376],{},"Phase 3 - Authorization (Sep\u002F2020):"," implemented GitHub OAuth, mapping\nJenkins RBAC roles directly to GitHub team membership. Replaced open admin\naccess (any G-Suite account) with a reviewable, auditable access model using\nthe same workflow as the rest of the infrastructure.",[17,1379,1380,1383],{},[189,1381,1382],{},"Phase 4 - Secrets decoupling (Sep\u002F2020):"," separated secrets management from\nthe Helm Chart release cycle. Static credentials (AWS IAM keys, API tokens)\nare Sops-encrypted in the repository and synced to Jenkins Credentials Store\nthrough JCasC. Runtime secrets (Kubeconfigs, Kubernetes credentials) are stored\nin AWS Secrets Manager and read on-the-fly by pipelines via the credentials\nprovider plugin. Jenkins Helm releases became configuration-only operations.",[91,1385,1387],{"id":1386},"impact-and-results","Impact and results",[72,1389,1390,1396,1406,1412],{},[75,1391,1392,1395],{},[189,1393,1394],{},"Fully reproducible deployments:"," the Jenkins Helm release can be deleted\nand recreated from Terraform + JCasC with complete fidelity (no manual state,\nno out-of-band configuration);",[75,1397,1398,1401,1402,1405],{},[189,1399,1400],{},"Build time advantage preserved:"," migrating from Docker-in-Docker to Kaniko\nmaintained Jenkins' benchmark advantage over alternatives (~2:00 for\n",[194,1403,1404],{},"fl-backend-rails"," vs ~10:01 on CircleCI, as of the Feb\u002F2020 evaluation);",[75,1407,1408,1411],{},[189,1409,1410],{},"Unified pipelines:"," a single Groovy Shared Library now covers back-end,\nweb, coach, and tracking applications (previously each had independent\nad-hoc Jenkinsfile implementations with duplicated logic);",[75,1413,1414,1417],{},[189,1415,1416],{},"Auditable secrets:"," Sops-encrypted catalog in version control provides full\nchange history for credentials, replacing opaque values embedded in a Helm\nrelease.",[91,1419,1421],{"id":1420},"write-up","Write-up",[17,1423,1424],{},"The full story is documented in a three-part series:",[72,1426,1427,1434,1441],{},[75,1428,1429,1433],{},[98,1430,1432],{"href":1431},"\u002Fposts\u002F2021\u002F01\u002Fjenkins-five-years-of-cicd-debt","Part 1: Freeletics CI\u002FCD: five years of debt (and why we kept Jenkins)","\n-- what we inherited, the benchmark data behind the decision to invest, and the\ndesign goals that shaped the rebuild.",[75,1435,1436,1440],{},[98,1437,1439],{"href":1438},"\u002Fposts\u002F2021\u002F01\u002Fjenkins-boring-security-by-design","Part 2: Boring security on Freeletics Jenkins, by design","\n-- authorization that doesn't require a spreadsheet, and secrets decoupled from\nthe configuration release cycle.",[75,1442,1443,1447],{},[98,1444,1446],{"href":1445},"\u002Fposts\u002F2021\u002F02\u002Fjenkins-rebuilding-it-phase-by-phase","Part 3: The Freeletics CI\u002FCD rebuild, phase by phase","\n-- the build system itself: Kaniko migration, Groovy Shared Library redesign,\nand the change that made Dependabot Monday mornings a non-event.",{"title":346,"searchDepth":347,"depth":347,"links":1449},[1450,1451,1452,1453,1454],{"id":1305,"depth":347,"text":1306},{"id":1312,"depth":347,"text":1313},{"id":1343,"depth":347,"text":1344},{"id":1386,"depth":347,"text":1387},{"id":1420,"depth":347,"text":1421},"2021-02-01T00:00:00","End-to-end redesign of a 5-year-old Jenkins CI\u002FCD platform: replaced Jenkins Job Builder with Pipelines as Code managed through Terraform, migrated all Docker builds to Kaniko on Kubernetes, decoupled secrets management from the Helm Chart, and unified back-end and web CI\u002FCD through a Groovy Shared Library.",{"duration":1458,"tools":1460},{"from":1459,"to":1455},"2020-08-01T00:00:00",[1461,364,1462,500,1463,1464,1465],"jenkins","kaniko","groovy","aws secrets manager","helm","\u002Fprojects\u002F2021\u002Ffreeletics-jenkins-redesign",{"title":1296,"description":1456},"freeletics-jenkins-cicd-redesign","projects\u002F2021\u002Ffreeletics-jenkins-redesign",[1471,365,366],"ci-cd","asnvLELu8UgsGnhCEGdHWKoLa26quHFLWuAgs_N6Mic",{"id":1474,"title":1475,"body":1476,"createdAt":1531,"description":1532,"extension":355,"meta":1533,"navigation":357,"path":1542,"seo":1543,"slug":1544,"stem":1545,"tags":1546,"website":7,"__hash__":1549},"projects\u002Fprojects\u002F2019\u002Freclameaqui-data-lake.md","ReclameAQUI Data Lake",{"type":9,"value":1477,"toc":1525},[1478,1480,1482,1485,1488,1490,1493,1496,1498,1502,1505,1508,1511,1514,1517,1519,1522],[12,1479,1302],{"id":1301},[91,1481,1306],{"id":1305},[17,1483,1484],{},"ReclameAQUI (Portuguese for \"complain here\") is an interesting and unique\nbusiness. They're a content aggregator for customers' experience sharing\n(especially bad experiences) about shopping (online and offline). However, it\ngoes further than a mere \"complaints website\" offering an interface for\ncompanies to answers complaints, helping customers with their issues.",[17,1486,1487],{},"The service is simply the biggest in this regard (worldwide) receiving 600K\nunique visitors each day, searching for a company's reputation before closing a\ndeal\u002Fpurchase.",[91,1489,1313],{"id":1312},[17,1491,1492],{},"Even though they are already advanced in the digital approach to business,\nhaving most services hosted on Cloud computing and analytical culture, their\ndata lake needed some upgrades. The most relevant motivator of this project was\nthe sky-high bills from GCP especially related to BigQuery data consumption.",[17,1494,1495],{},"Apart from the cost-reduction tasks and data ingestion process optimization, we\ntook the opportunity to implement data cryptograph at-rest, governance, and\nobfuscation during query executions against the data lake. Making data\naccessible by everyone in the company, controlling identity access and\nmanagement through LDAP (auditing each access, to be fully compliant with\nGDPR), we could offer a self-service data lake so different business actors\ncould satisfy their needs \"drinking\" from the lake.",[12,1497,1340],{"id":1339},[91,1499,1501],{"id":1500},"tech-implementation","Tech implementation",[17,1503,1504],{},"Key objectives were cost-optimization of the existing Data Lake, improvement\n(and extension) of existing data ingestion pipelines, and security enhancements.",[17,1506,1507],{},"Starting from Data Lake's cost optimization, we redesigned the data ingestion,\nusing a \"landing\" area for raw data, making data transformations later to suit\nthe desired data models. Saving the results in other Data Lake layers to achieve\ngreater performance in queries.",[17,1509,1510],{},"We shifted away from the Streaming inserts in BigQuery by adding a step to load\ndata at the end of the ingestion pipeline. Apache NiFi was the main software\nresponsible for orchestrating and executing the pipeline, covering also the\nimprovements in data ingestion through processes re-engineering.",[17,1512,1513],{},"Auditing in the Data Lake was managed through Apache Ranger. In order to have\nit fully supported we implemented a JDBC driver using a component from Apache\nCalcite called Avatica. Authentication for Apache Ranger went through a custom\nplugin (also developed during the project) for LDAP consuming user info from\nGoogle Cloud Identity, reflecting the existing organization's users and groups\nfrom Google Suite.",[17,1515,1516],{},"To make the game more interesting, we containerized the workflow and heavily\nused Kubernetes (GKE) to manage these components. Most of the Apache projects\ndidn't have Helm Charts at the time and we developed and made some\nof them open-source.",[91,1518,1387],{"id":1386},[17,1520,1521],{},"During project time we could measure an estimative of roughly 56% in Data Lake\ncost-optimization through reengineering of processes and resources, especially\nthe removal of streaming inserts to BigQuery.",[17,1523,1524],{},"We made relevant progress in security and governance during the project with the\nintroduction of Apache Ranger and Data Lake auditing for access and usage,\nproviding advanced security capabilities to ReclameAQUI, which anticipated itself\ntowards GDPR and data privacy concerns.",{"title":346,"searchDepth":347,"depth":347,"links":1526},[1527,1528,1529,1530],{"id":1305,"depth":347,"text":1306},{"id":1312,"depth":347,"text":1313},{"id":1500,"depth":347,"text":1501},{"id":1386,"depth":347,"text":1387},"2019-10-02T00:00:00","Containerized Data Lake running on GCP, using Kubernetes (GKE) to orchestrate Apache ecosystem components, with GCS for data storage and BigQuery as the analytical interface.\nGovernance and security fully implemented using existing Google Suite groups and users through LDAP, giving stakeholders full autonomy to consume data from the Lake (with auditing).",{"duration":1534,"tools":1537},{"from":1535,"to":1536},"2019-05-01T00:00:00","2019-09-30T00:00:00",[1538,364,1539,1540,1541],"apache spark","python","google bigquery","apache nifi","\u002Fprojects\u002F2019\u002Freclameaqui-data-lake",{"title":1475,"description":1532},"reclameaqui-data-lake","projects\u002F2019\u002Freclameaqui-data-lake",[1547,1548],"cloud-native","data lake","QKyuci8jk1a_mXWiDZPW8IrSycgLC-_Ho1g0ydP1aL8",1778441743546]