[{"data":1,"prerenderedAt":3782},["ShallowReactive",2],{"all-posts":3},[4,368,808,1293,1624,1908,2072,2401,2939,3208,3568],{"id":5,"title":6,"author":7,"body":8,"createdAt":353,"description":354,"extension":355,"meta":356,"navigation":357,"path":358,"seo":359,"slug":360,"stem":361,"tags":362,"__hash__":367},"posts\u002Fposts\u002F2022\u002F04\u002Fkubernetes-blockchain-node-orchestration.md","Orchestrating blockchain nodes on Kubernetes at Blockdaemon: the design case",null,{"type":9,"value":10,"toc":345},"minimark",[11,16,20,23,26,29,33,36,39,42,45,49,52,55,58,61,65,68,71,87,90,95,105,113,116,127,130,134,141,144,147,151,160,163,176,180,183,186,202,208,213,217,220,223,226,241,244,250,254,263,266,269,273,276,282,300,303,307,310,313,316,319,323,340],[12,13,15],"h1",{"id":14},"stateless-assumptions-dont-survive-contact-with-chain-data","Stateless assumptions don't survive contact with chain data",[17,18,19],"p",{},"The mental model Kubernetes was built around is a stateless HTTP service: start\nit, stop it, replace it, it doesn't matter. The database holds the state; the\nprocess is interchangeable. Restart a pod and it picks up exactly where the\nuser left off.",[17,21,22],{},"Blockchain nodes are the extreme opposite. Running them on Kubernetes is in the\nsame category as running databases on Kubernetes: a discussion the industry has\nbeen having for years, with no settled consensus, because the answer genuinely\ndepends on scale, team, and tolerance for operational complexity. A Bitcoin full\nnode carries several hundred gigabytes of chain history. An Ethereum archive\nnode exceeds 4 TiB (and grows perpetually as blocks are added). The chain data\nis the node; the process is just the thing reading and advancing it. Restart the\nprocess and it picks up from disk (which is fine). Delete the disk and you're\nresyncing from genesis, which on a busy mainnet can take weeks.",[17,24,25],{},"That's the first mismatch. Kubernetes was designed around ephemeral, stateless\nworkloads, and blockchain nodes are some of the most stateful workloads in\nexistence. Adding an orchestration layer on top of that doesn't make the\nstatefulness go away (it just adds more abstraction between you and the disk).",[17,27,28],{},"What follows is a design proposal, developed at Blockdaemon in Q2\u002F2022, that\nexplores whether Kubernetes could serve as the orchestration layer for blockchain\nnodes at scale, and what it would take to make that work. The questions it\nraises are not all answered here.",[12,30,32],{"id":31},"one-node-one-signer-non-negotiable","One node, one signer, non-negotiable",[17,34,35],{},"Full nodes (those that sync and serve chain data for JSON-RPC calls) can run\nmultiple replicas without issue. Validators cannot.",[17,37,38],{},"A validator's job is to sign blocks on behalf of a staker. In proof-of-stake\nnetworks, signing the same block twice (from two concurrently running validator\ninstances) triggers slashing: an on-chain penalty that permanently destroys part\nof the staked funds. There is no remediation. The standard Kubernetes answer to\nhigh availability (run N replicas, let the scheduler handle restarts) is\nactively dangerous for validators.",[17,40,41],{},"This is the singleton problem: the validator must be exactly one running process\nat any given moment. The usual orchestration primitives for availability and\nzero-downtime upgrades require deliberate adaptation to not violate it. And\nunlike a web service, where getting this wrong causes a temporary error spike, a\nmisconfigured validator can cause a permanent, irreversible financial loss.",[17,43,44],{},"Q: Isn't the risk overstated? Surely Kubernetes won't spin up two replicas of a\nsingleton at the same time.\nA: It can and does, during rolling updates. The default rolling update strategy\nstarts the new pod before terminating the old one. Without explicit configuration\nto prevent this, two signing processes can run simultaneously. The question is\nnot whether Kubernetes is generally safe for stateful workloads. It is whether\nthe default behavior, applied without modification to a validator, is safe. It\nis not.",[12,46,48],{"id":47},"the-case-for-orchestration-and-the-case-against-it","The case for orchestration (and the case against it)",[17,50,51],{},"If you're running two or three validator nodes, dedicated VMs with manual\nrunbooks is probably the right answer. The operational overhead is bounded, the\ntooling is minimal, and the failure modes are well-understood. You don't need\nKubernetes to manage three nodes. You need a good runbook and someone who reads\nit.",[17,53,54],{},"At scale (dozens of protocols, hundreds of nodes across multiple environments)\nthe argument shifts. Every new protocol onboarded adds new runbooks, new on-call\nburden, and new failure modes to discover in production. Fleet upgrades across\nunorchestrated nodes require either rolling manual execution (slow, error-prone)\nor custom per-protocol automation (expensive to build and maintain independently\nfor each chain).",[17,56,57],{},"But here is where the counter-argument becomes real: Kubernetes adds layers.\nEach layer is something that can fail, something that someone needs to understand\ndeeply, and something that interacts with the layers below it in ways that are\nnot always obvious. A platform engineering team that knows Kubernetes well may\nnot know blockchain node operations. A team that knows blockchain operations may\nnot know Kubernetes internals. The overlap between those two skill sets is\ngenuinely narrow, and hiring for it is expensive.",[17,59,60],{},"The honest framing: orchestration potentially replaces manual per-node toil with\nplatform-level complexity. Whether that trade is favorable depends entirely on\nthe scale of the fleet, the composition of the team, and whether the platform\ncan be built once and maintained cheaply, or whether it becomes another system\nrequiring constant attention. There is no universal answer.",[12,62,64],{"id":63},"the-problem-nobody-talks-about-until-the-pvc-fills-up","The problem nobody talks about until the PVC fills up",[17,66,67],{},"Storage is where most Kubernetes-for-blockchain designs encounter their first\nserious wall. The standard PVC model assumes storage is relatively small,\nfungible, and network-attached. Chain data is none of those things.",[17,69,70],{},"The constraints compound:",[72,73,74,78,81,84],"ul",{},[75,76,77],"li",{},"Size grows without bound: chain data ranges from tens of gigabytes for newer\nprotocols to multiple terabytes for archive nodes, and adds blocks\nindefinitely;",[75,79,80],{},"Protocol-specific formats: some chains write data in formats that assume local\ndisk access patterns; the data isn't always portable to a different node\nwithout a full resync;",[75,82,83],{},"Sync time as recovery cost: if storage fails and can't be recovered,\nresyncing from genesis can take days to weeks. Storage reliability matters in\na way it doesn't for typical web workloads;",[75,85,86],{},"I\u002FO performance: block validation and state transitions are I\u002FO intensive;\nnetwork-attached storage latency, acceptable for most applications, can\nmeasurably impact node performance on high-throughput chains.",[17,88,89],{},"Three storage approaches worth considering for this use case:",[91,92,94],"h2",{"id":93},"openebs-with-mayastor","OpenEBS with Mayastor",[17,96,97,104],{},[98,99,103],"a",{"href":100,"rel":101},"https:\u002F\u002Fopenebs.io\u002F",[102],"nofollow","OpenEBS"," implements Container Attached Storage (CAS): an\nabstraction layer between Kubernetes' Container Storage Interface and the\nunderlying driver (EBS, NFS, local disk, or otherwise). The premise is that\nstorage should be orchestrated by Kubernetes the same way compute is: as\ncontainers, with scheduling, affinity rules, and auto-scaling.",[17,106,107,112],{},[98,108,111],{"href":109,"rel":110},"https:\u002F\u002Fgithub.com\u002Fopenebs\u002Fmayastor",[102],"Mayastor"," is OpenEBS's distributed block\nstorage engine, responsible for orchestrating disk placement across nodes. The\ndesign mirrors the Kubernetes control plane. Mayastor runs as containers,\nmanages data node distribution to minimize latency, and would scale storage\ncapacity independently of the workload pods that mount it.",[17,114,115],{},"Potential operational benefits for blockchain workloads:",[72,117,118,121,124],{},[75,119,120],{},"No cloud vendor lock-in: the same storage class could work across AWS, GCP,\nbare metal, or any mix;",[75,122,123],{},"Storage auto-scaling decoupled from node auto-scaling: disk capacity could\ngrow without resizing or restarting the workload;",[75,125,126],{},"Consistent interface regardless of the underlying driver.",[17,128,129],{},"Known limitation (as of Q2\u002F2022): no native snapshot support. Velero provides a\npartial workaround but not a complete solution. Snapshots matter for blockchain\nnodes because they enable bootstrapping new nodes from a recent chain state\nrather than syncing from genesis.",[91,131,133],{"id":132},"cstor","cStor",[17,135,136,140],{},[98,137,133],{"href":138,"rel":139},"https:\u002F\u002Fopenebs.io\u002Fdocs\u002Fconcepts\u002Fcstor",[102]," is OpenEBS's ZFS-backed storage\nengine. Where Mayastor focuses on distributed block storage, cStor focuses on\nconsistency and data management capabilities.",[17,142,143],{},"cStor implements copy-on-write semantics, RAID replication across nodes,\nPersistentVolume snapshots and cloning, and ZFS deduplication. The deduplication\nis specifically relevant for blockchain workloads: multiple nodes in the same\ncluster holding structurally similar chain data (e.g. multiple full nodes of the\nsame protocol) could reduce actual disk usage significantly.",[17,145,146],{},"The trade-off is operational: cStor requires ZFS installed on every Kubernetes\nnode. Adding or replacing cluster nodes requires ZFS provisioning as part of the\nprocess, adding friction compared to Mayastor's more self-contained model.",[91,148,150],{"id":149},"zfs-nodes-with-nfs-provisioner","ZFS nodes with NFS provisioner",[17,152,153,154,159],{},"The longer road, but the most proven. A set of ZFS-backed nodes provisioned\nacross availability zones, sharing volumes to the Kubernetes cluster via the\n",[98,155,158],{"href":156,"rel":157},"https:\u002F\u002Fgithub.com\u002Fkubernetes-sigs\u002Fnfs-subdir-external-provisioner",[102],"NFS subdir provisioner",".\nMore moving parts, more manual configuration (and fewer unknowns in production).\nWhen the more automated options encounter edge cases, this approach falls back to\nprimitives that have been running in production for decades.",[17,161,162],{},"Q: Why not just use cloud-managed block storage (EBS, Persistent Disk, etc.)?\nA: Cloud-managed block storage works and is the path of least resistance at\nsmall scale. Latency characteristics are acceptable for most workloads, though\nthey may not be for high-throughput validation. More practically: at scale\nacross multiple cloud providers, storage vendor lock-in becomes a real\noperational constraint. A storage abstraction that works across providers is\nworth the added complexity if multi-cloud operation is the goal (but only if\nthe team can actually maintain that abstraction).",[164,165,167],"callout",{"type":166},"note",[17,168,169,170,175],{},"The storage landscape for Kubernetes has continued to evolve since this\ndesign was written. ",[98,171,174],{"href":172,"rel":173},"https:\u002F\u002Flonghorn.io\u002F",[102],"Longhorn"," has since matured as a\nfurther alternative worth evaluating (particularly for its native snapshot\nsupport and simpler operational model compared to Mayastor and cStor).",[12,177,179],{"id":178},"the-one-rule-kubernetes-wasnt-designed-for","The one rule Kubernetes wasn't designed for",[17,181,182],{},"Kubernetes rolling updates work by spinning up new pods before terminating old\nones. For stateless services, this is ideal: zero downtime, no dropped requests.\nFor validators, starting the new pod before the old one terminates means two\nsigning processes running simultaneously (double-signing territory).",[17,184,185],{},"Two mitigations could compose to close this window.",[17,187,188,192,193,197,198,201],{},[189,190,191],"strong",{},"Pod Disruption Budgets"," constrain how many pods Kubernetes may terminate\nsimultaneously. Setting ",[194,195,196],"code",{},"maxUnavailable: 0"," prevents termination until a\nreplacement is ready. The complementary setting ",[194,199,200],{},"maxSurge: 0"," prevents the new\npod from starting before the old one is fully terminated. Together these could\nenforce a stop-then-start update sequence for the validator (no overlap, no\ndouble-signing window during planned upgrades).",[17,203,204,207],{},[189,205,206],{},"Blocks-behind self-termination"," addresses the unplanned case. A liveness\nprobe monitors the validator's block height against the current network tip. If\nthe lag exceeds a protocol-specific threshold, the probe fails and Kubernetes\nterminates the pod. The logic: a validator significantly behind the network tip\nis not signing correctly anyway; self-termination and restart is safer than\ncontinuing. This is sometimes called the harakiri pattern: the process detects\nit is in an invalid state and exits, rather than risking incorrect behavior that\ncompounds the problem.",[164,209,210],{"type":166},[17,211,212],{},"The threshold would need to be protocol-specific because block times\nvary widely across chains. A \"blocks behind\" value that signals a problem on a\n1-second block time chain is normal variance on a 12-second block time chain.\nGetting this wrong in either direction causes problems: too sensitive and the\nvalidator restarts unnecessarily; too loose and it allows a degraded validator\nto keep running.",[12,214,216],{"id":215},"keeping-keys-out-of-the-cluster","Keeping keys out of the cluster",[17,218,219],{},"Validator keys are not like application credentials. A leaked database password\ngets rotated. A leaked validator private key can be used to sign blocks on\nbehalf of the validator before the key is revoked (potentially triggering\nslashing in the window between leak and rotation). Slashing is irreversible.",[17,221,222],{},"Kubernetes Secrets are better than baked-in environment variables, but they are\nnot the right primitive for keys at this sensitivity level. They are\nbase64-encoded (not encrypted by default), accessible to any workload in the\nsame namespace with pod-level permissions, and stored in etcd with whatever\nencryption posture the cluster has configured.",[17,224,225],{},"One approach: separate the key store from the cluster entirely. An external\nsecrets manager holds the key material; the validator pod mounts it as a volume\nat startup via an operator that handles the fetch. The key is never at rest in\nthe cluster. If the pod is deleted, the key disappears with it. Key rotation\ndoes not require a redeployment: update the secret in the store, let the pod\nrestart on its next cycle, and it picks up the new material automatically.",[17,227,228,229,234,235,240],{},"Established tools for this pattern include ",[98,230,233],{"href":231,"rel":232},"https:\u002F\u002Fwww.vaultproject.io\u002F",[102],"HashiCorp Vault","\n(via the ",[98,236,239],{"href":237,"rel":238},"https:\u002F\u002Fexternal-secrets.io\u002F",[102],"External Secrets Operator"," or the Vault\nAgent Injector), and cloud-native equivalents like AWS Secrets Manager or GCP\nSecret Manager (both of which integrate with External Secrets Operator using the\nsame operator interface).",[17,242,243],{},"The active\u002Fstandby validator design adds a further constraint to this model:\nstandby nodes would hold no validator key at all. A standby connects to the peer\nnetwork and syncs blocks, but cannot sign. Promotion to active would involve\nwriting the key to the secrets store, which the newly-promoted pod then mounts\nat startup. At any given moment, the key exists in exactly one place.",[164,245,247],{"type":246},"warning",[17,248,249],{},"This model is conceptually clean but has operational implications that\nshouldn't be glossed over. The secrets store itself becomes a critical dependency:\nif it is unavailable at pod startup, the validator cannot start. That means the\nsecrets store's availability SLA effectively becomes the validator's availability\nSLA. Designing for this dependency (caching, fallback, secrets store HA) is\nnon-trivial and adds more surface area to maintain.",[12,251,253],{"id":252},"protocol-upgrades-as-a-deployment-problem","Protocol upgrades as a deployment problem",[17,255,256,257,262],{},"With storage, process exclusivity, and key management handled (or at least\ndesigned for), protocol upgrades could become a standard deployment problem.\n",[98,258,261],{"href":259,"rel":260},"https:\u002F\u002Fargo-cd.readthedocs.io\u002F",[102],"ArgoCD"," could watch for new container image\ntags pushed by each protocol's release pipeline and trigger rolling updates\nautomatically. Combined with rolling update configuration that limits the batch\nsize (e.g. 10% of the fleet at a time) and health probes that gate each batch,\nthe typical protocol upgrade (previously a manual, per-node operation) could\nbecome a version tag update in a repository.",[17,264,265],{},"The health probe is what would make this safe: a probe checking blocks-behind\nagainst the network reference provides an automated go\u002Fno-go signal. A bad\nprotocol upgrade (one where the new version has a regression) could halt before\nreaching the full fleet, rather than being discovered through monitoring after\nthe fact.",[17,267,268],{},"Whether this fully materializes depends on how well the health probes can\nactually characterize node health for each protocol. For some chains, block lag\nis a complete signal. For others, a node can be synced but serving incorrect\ndata due to state corruption (a condition that block height alone won't catch).\nProtocol-specific health checks are more accurate but more expensive to build\nand maintain across a large number of chains.",[12,270,272],{"id":271},"shared-or-exclusive","Shared or exclusive?",[17,274,275],{},"In a multi-tenant context, node isolation becomes a design question. Two\napproaches with different trade-off profiles:",[17,277,278,281],{},[189,279,280],{},"Kubernetes affinity and anti-affinity rules"," allow workloads to declare\nrequirements about co-location. This is lightweight and built into the\nscheduler, adding no runtime overhead. The isolation it provides is\ncontainer-level: processes on the same host share the kernel.",[17,283,284,287,288,293,294,299],{},[189,285,286],{},"Firecracker microVMs"," provide hardware-level boundaries between workloads by\nrunning each in its own lightweight virtual machine.\n",[98,289,292],{"href":290,"rel":291},"https:\u002F\u002Ffirecracker-microvm.github.io\u002F",[102],"Firecracker"," (via the Containerd\nplugin) and ",[98,295,298],{"href":296,"rel":297},"https:\u002F\u002Fkatacontainers.io\u002F",[102],"Kata Containers"," both offer\nOCI-compatible runtimes that replace the standard container execution model with\nVM-backed isolation. The trade-off is real: microVM startup time, additional\nmemory overhead per workload, and a more complex container runtime to configure\nand operate.",[17,301,302],{},"The right choice depends on the threat model. Kernel-level isolation is\nsufficient for most operational concerns. Hardware-level isolation is warranted\nwhen the threat model includes a compromised container breaking out to the host\n(a higher bar that most deployments don't require but some do).",[12,304,306],{"id":305},"is-it-worth-it","Is it worth it?",[17,308,309],{},"This is the question the design intentionally leaves open.",[17,311,312],{},"The argument for: at sufficient scale, manual per-node operations don't compose.\nEvery new protocol is an operational multiplier, and Kubernetes provides a shared\nfoundation that could absorb that multiplier once. Rolling upgrades with health\nprobes, automated key injection, standardized storage interfaces across\nprotocols: these are real benefits that compound as the fleet grows.",[17,314,315],{},"The argument against: each of those benefits requires solving a hard problem\nfirst, and each solution adds a layer. Storage abstraction, singleton enforcement,\nexternal secrets integration, protocol-specific health probes: these are not\nsimple configurations. They require people who understand both the Kubernetes\nprimitives and the blockchain-specific constraints, and that combination is rare\nand expensive. There is a real risk that the orchestration layer becomes its own\noperational burden, requiring specialized platform engineering attention that\noutweighs the toil it was supposed to eliminate.",[17,317,318],{},"The break-even point (where the operational benefit of orchestration exceeds\nthe cost of building and maintaining the platform) depends on fleet size, team\nstructure, and how well the health check and upgrade automation can be built\ngenerically across protocols. At a handful of nodes, the overhead clearly isn't\nworth it. At hundreds of nodes, it probably is. The interesting question is\nwhere in between that line sits, and whether the complexity stays bounded as the\nplatform scales.",[91,320,322],{"id":321},"open-questions","Open questions",[72,324,325,328,331,334,337],{},[75,326,327],{},"At what fleet size does orchestration overhead become net positive?",[75,329,330],{},"How generalizable are health probes across protocols, and what's the\nmaintenance cost of per-protocol implementations?",[75,332,333],{},"Can storage auto-scaling keep up with chain data growth rates without\noperator intervention?",[75,335,336],{},"What does the secrets store availability dependency cost in practice, and\nhow is it designed for?",[75,338,339],{},"Does the team structure that can build and maintain this platform exist, or\ndoes it need to be built first?",[164,341,342],{"type":166},[17,343,344],{},"The validator high availability design (the standby\u002Factive model that\nkeeps a warm standby synced without double-signing risk) is a related problem\nthat this proposal does not fully address.",{"title":346,"searchDepth":347,"depth":347,"links":348},"",2,[349,350,351,352],{"id":93,"depth":347,"text":94},{"id":132,"depth":347,"text":133},{"id":149,"depth":347,"text":150},{"id":321,"depth":347,"text":322},"2022-04-09T00:00:00+01:00","Kubernetes was built for stateless workloads. Blockchain nodes are the extreme opposite. A design proposal exploring what it would actually take to run validators and full nodes on Kubernetes at scale: storage, process exclusivity, key injection, and the upgrade automation that might make it worth the investment.","md",{},true,"\u002Fposts\u002F2022\u002F04\u002Fkubernetes-blockchain-node-orchestration",{"title":6,"description":354},"kubernetes-blockchain-node-orchestration","posts\u002F2022\u002F04\u002Fkubernetes-blockchain-node-orchestration",[363,364,365,366],"blockchain","kubernetes","platform-engineering","infrastructure-as-code","rXBdWidYPbFs_YOulcEyxpqMsG7K1LWRXG7M6rKInUs",{"id":369,"title":370,"author":7,"body":371,"createdAt":798,"description":799,"extension":355,"meta":800,"navigation":357,"path":801,"seo":802,"slug":803,"stem":804,"tags":805,"__hash__":807},"posts\u002Fposts\u002F2021\u002F08\u002Freal-life-terraform-refactoring-guide.md","Real-life Terraform Refactoring Guide",{"type":9,"value":372,"toc":788},[373,377,392,395,404,410,413,417,437,446,458,461,479,483,486,495,502,505,516,523,532,538,546,549,561,566,573,581,585,588,594,598,601,607,611,621,627,631,638,648,654,666,669,686,692,696,708,725],[91,374,376],{"id":375},"intro","Intro",[17,378,379,380,385,386,391],{},"As reality hits, the unavoidable fact of dealing with a hard-to-manage Terraform\n",[98,381,384],{"href":382,"rel":383},"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBig%5Fball%5Fof%5Fmud",[102],"Big ball of mud"," code base comes in. There is no way around natural growth and\nevolution of code bases and the design flaws that come with it. Our Agile\nmindset is to ",[98,387,390],{"href":388,"rel":389},"https:\u002F\u002Fwww.brainyquote.com\u002Fquotes\u002Fmark%5Fzuckerberg%5F453439",[102],"\"move fast and break things\"",", implement something as simple as\npossible and let the design decisions for the next iterations (if any).",[17,393,394],{},"Refactoring Terraform code is actually as natural as developing it, time and\ntime again you will be faced with a situation where a better structure or\norganization can be achieved, maybe you want to upgrade from a home-made module\nto an open-source\u002Fcommunity alternative, maybe you just want to segregate your\nresources into different states to speed-up development. Regardless of the goal,\nonce you get into it, you will realize that Terraform code refactoring is\nactually a basic missing step on the development process that no one told you\nbefore.",[17,396,397,398,403],{},"As the ",[98,399,402],{"href":400,"rel":401},"http:\u002F\u002Fnathanmarz.com\u002Fblog\u002Fsuffering-oriented-programming.html",[102],"Suffering-Oriented Programming"," mantra dictates:",[405,406,407],"blockquote",{},[17,408,409],{},"\"First make it possible. Then make it beautiful. Then make it fast.\"",[17,411,412],{},"So, time to make the Terraform code beautiful!",[91,414,416],{"id":415},"how-to-break-a-big-ball-of-mud-strangle-it","How to break a big ball of mud? STRANGLE IT",[17,418,419,422,423,426,427,432,433,436],{},[194,420,421],{},"\u003Cjoke>"," Martin Fowler has already written everything there is to write about\n(early 2000s) DevOps, Agile, and Software Development. Therefore, we could\nreference Martin Fowler for virtually anything Software related ",[194,424,425],{},"\u003C\u002Fjoke>",", but\nreally, the ",[98,428,431],{"href":429,"rel":430},"https:\u002F\u002Fmartinfowler.com\u002Fbooks\u002Frefactoring.html",[102],"Refactoring book"," is ",[189,434,435],{},"THE"," reference on this subject.",[17,438,439,440,445],{},"Martin Fowler shared the ",[98,441,444],{"href":442,"rel":443},"https:\u002F\u002Fmartinfowler.com\u002Fbliki\u002FStranglerFigApplication.html",[102],"Stangler (Fig) Pattern",", which describes a strategy to\nrefactor a legacy code base by re-implementing the same features (sometimes even\nthe bugs) on another application.",[405,447,448,455],{},[17,449,450,454],{},[451,452,453],"span",{},"..."," the huge strangler figs. They seed in the upper branches of a tree and\ngradually work their way down the tree until they root in the soil. Over many\nyears they grow into fantastic and beautiful shapes, meanwhile strangling and\nkilling the tree that was their host.",[17,456,457],{},"This metaphor struck me as a way of describing a way of doing a rewrite of an\nimportant system.",[17,459,460],{},"In this document we are going to follow the same idea:",[462,463,464,473,476],"ol",{},[75,465,466,467,472],{},"implement the same feature on a different ",[98,468,471],{"href":469,"rel":470},"https:\u002F\u002Fwww.terraform-best-practices.com\u002Fkey-concepts#composition",[102],"Terraform composition",";",[75,474,475],{},"migrate the Terraform state;",[75,477,478],{},"delete (kill) the previous implementation.",[91,480,482],{"id":481},"the-mono-repository-monorepo-approach-to-legacy","The mono-repository (monorepo) approach to Legacy",[17,484,485],{},"Let's suppose that your Terraform code base is versioned in a single repository\n(a.k.a. monorepo), following the random structure displayed below (just to help\nillustrate)",[487,488,493],"pre",{"className":489,"code":491,"language":492},[490],"language-text",".\n├── modules\u002F    # Definition of TF modules used by underlying compositions\n├── global\u002F     # Resources that aren't restricted to one environment\n│   ├── aws\u002F\n├── production\u002F # Production environment resources\n│   └── aws\u002F\n└── staging\u002F    # Staging environment resources\n    └── aws\u002F\n","text",[194,494,491],{"__ignoreMap":346},[17,496,497,498,501],{},"On this example each directory corresponds to a Terraform state. In order to\napply changes you have to walk to a path and execute ",[194,499,500],{},"terraform",".",[17,503,504],{},"The structure on this example repository was created a few hypothetical years\nago when the number of existing microservices and resources (DB, message queues,\netc) was significantly smaller. At the time, it was feasible to keep Terraform\ndefinitions together because it was easier to maintain, Cloud resources were\nmanaged with one-shot!",[17,506,507,508,511,512,515],{},"As the time went by, the number of Products and the team grew, and engineers\nstarted facing concurrency issues: Terraform lock executions on a shared storage\nwhen someone else is running ",[194,509,510],{},"terraform apply"," as well as a general slowness on\n",[189,513,514],{},"every execution"," since the number of data sources to sync is frightening.",[17,517,518,519,501],{},"A mono-repository approach is not necessarily bad, versioning is actually\nsimpler when performed in one single repository. Ideally, there won't be many\nchanges on the scale of GiB meaning that it is safe to proceed on this one ",[520,521,522],"em",{},"as\nlong as the Terraform remote states are divided",[524,525,527,528,531],"h3",{"id":526},"splitting-the-modules-sub-path-to-its-own-repository","Splitting the ",[194,529,530],{},"modules"," sub-path to its own repository",[17,533,534,535,537],{},"One thing to mention though is the ",[194,536,530],{}," sub-path, this one could be stored\nin a different git repository to leverage its own versioning. Since Terraform\nmodules and its implementations don't always evolve in the same pace, keeping\ntwo distinct version trees is beneficial. Additionally, a separated repository\nfor Terraform modules allows the specification of \"pinned versions\", e.g.:",[487,539,544],{"className":540,"code":542,"language":543,"meta":346},[541],"language-hcl","module \"aws_main_vpc\" {\n  source = \"git::https:\u002F\u002Fgithub.com\u002Fterraform-aws-modules\u002Fterraform-aws-vpc.git?ref=2ca733d\"\n  # Note the ref=${GIT_REVISION_DIGEST}\n}\n","hcl",[194,545,542],{"__ignoreMap":346},[17,547,548],{},"That reference for a module's version should always be specified, regardless if\nit comes from an internal\u002Fprivate repository or public. When you specify the\nversion, you are ensuring reproducibility.",[17,550,551,552,554,555,560],{},"Therefore, let's move the ",[194,553,530],{}," sub-path to another git repository,\nfollowing instructions from ",[98,556,559],{"href":557,"rel":558},"https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F359424\u002Fdetach-move-subdirectory-into-separate-git-repository\u002F17864475#17864475",[102],"this StackOverflow answer"," so that the git commit\nhistory is preserved:",[562,563,565],"h4",{"id":564},"_0","0.",[17,567,568,569,572],{},"Walk to the monorepo path and create a branch from the commits at\n",[194,570,571],{},"monorepo\u002Fmodules"," path",[487,574,579],{"className":575,"code":577,"language":578,"meta":346},[576],"language-bash","MAIN_BIGGER_REPO=\u002Fpath\u002Fto\u002Fthe\u002Fmonorepo\ncd \"${MAIN_BIGGER_REPO}\"\ngit subtree split -P modules -b refact-modules\n","bash",[194,580,577],{"__ignoreMap":346},[562,582,584],{"id":583},"_1","1.",[17,586,587],{},"Create the new repository",[487,589,592],{"className":590,"code":591,"language":578,"meta":346},[576],"mkdir \u002Fpath\u002Fto\u002Fthe\u002Fterraform-modules && cd $_\ngit init\ngit pull \"${MAIN_BIGGER_REPO}\" refact-modules\n",[194,593,591],{"__ignoreMap":346},[562,595,597],{"id":596},"_2","2.",[17,599,600],{},"Link the new repository to your remote Git (server)",[487,602,605],{"className":603,"code":604,"language":578,"meta":346},[576],"git remote add origin \u003Cgit@git.com:user\u002Fterraform-modules.git>\ngit push -u origin master\n",[194,606,604],{"__ignoreMap":346},[562,608,610],{"id":609},"_3","3.",[17,612,613,616,617,620],{},[451,614,615],{},"OPTIONAL"," Cleanup inside ",[194,618,619],{},"$MAIN_BIGGER_REPO",", if desired",[487,622,625],{"className":623,"code":624,"language":578,"meta":346},[576],"cd ${MAIN_BIGGER_REPO}\ngit rm -rf modules\ngit filter-branch --prune-empty \\\n    --tree-filter \"rm -rf modules\" -f HEAD\n",[194,626,624],{"__ignoreMap":346},[524,628,630],{"id":629},"lets-start-strangling-the-repository","Let's start strangling the repository",[17,632,633,634,637],{},"Now that a substantial piece of code was moved somewhere else, it is time to\nput the ",[98,635,444],{"href":442,"rel":636},[102]," in practice.",[17,639,640,641,644,645,647],{},"Move all the existing content as-is to the ",[194,642,643],{},"legacy"," sub-path, keeping the same\nrepository and change history (commits). It also allows applying the ",[194,646,643],{},"\ncode as it used to be from one of those paths.",[487,649,652],{"className":650,"code":651,"language":492},[490],".\n└── legacy\n    ├── global\n    │   └── aws\n    ├── production\n    │   └── aws\n    └── staging\n        └── aws\n",[194,653,651],{"__ignoreMap":346},[17,655,656,657,662,663,665],{},"Once the content is moved to legacy, the idea is to follow the ",[98,658,661],{"href":659,"rel":660},"https:\u002F\u002Fwww.oreilly.com\u002Flibrary\u002Fview\u002F97-things-every\u002F9780596809515\u002Fch08.html",[102],"Boy Scout rule","\nin order to strangle the ",[194,664,643],{}," content little by little (unless you are\nreally committed to migrating it all at once, which is going to be exhaustive).",[17,667,668],{},"The Boy Scout rule goes like:",[462,670,671,678,681],{},[75,672,673,674,472],{},"every time a task that involves deprecated code appears, we implement it on\n",[98,675,677],{"href":676},"..\u002Fstdout\u002Fblog\u002F2021\u002F03\u002Fterraform-best-practices.org","the new structure",[75,679,680],{},"import the Terraform state to keep the Cloud resources that a given code\nrepresents\u002Fdescribes;",[75,682,683,684,501],{},"remove the state and the code from ",[194,685,643],{},[17,687,688,689,691],{},"Until there is nothing left inside ",[194,690,643],{}," (or there are only unused\nresources\u002Fleft-behinds that could be destroyed\u002Fgarbage collected either way).",[562,693,695],{"id":694},"import-state-remove-state-and-code-from-what-where","Import state? Remove state and code from what? Where?",[17,697,698,699,702,703,501],{},"That will depend on the kind of resource we are migrating from the remote state,\non the bottom of each ",[194,700,701],{},"resource"," on Terraform's provider documentation you can\nfind a reference command to import existing resources into your Terraform code\nspecification. e.g.: ",[98,704,707],{"href":705,"rel":706},"https:\u002F\u002Fregistry.terraform.io\u002Fproviders\u002Fhashicorp\u002Faws\u002Flatest\u002Fdocs\u002Fresources\u002Fdb%5Finstance#import",[102],"AWS RDS DB instance",[17,709,710,711,714,715,720,721,724],{},"Suppose we want to replace the code of the AWS RDS Aurora defined in\n",[194,712,713],{},"production\u002Faws"," and then re-implement the same using ",[98,716,719],{"href":717,"rel":718},"https:\u002F\u002Fgithub.com\u002Fterraform-aws-modules\u002Fterraform-aws-rds-aurora",[102],"the community module",".\nAfter creating the corresponding sub-path to the monorepo according to your\npreference, provisioning the bucket and initializing the Terraform ",[194,722,723],{},"backend",":",[462,726,727,741,764],{},[75,728,729,730,734,735],{},"implement the definition of the community module\n",[98,731,733],{"href":717,"rel":732},[102],"github.com\u002Fterraform-aws-modules\u002Fterraform-aws-rds-aurora"," with the closest\nparameters from the existing one; e.g.:",[487,736,739],{"className":737,"code":738,"language":543,"meta":346},[541],"module \"aws_aurora_main_cluster\" {\n  source  = \"terraform-aws-modules\u002Frds-aurora\u002Faws\"\n  version = \"~> 5.2\"\n\n  # ...\n}\n",[194,740,738],{"__ignoreMap":346},[75,742,743,744,750,753,754,757,758],{},"import the Terraform states from the previous (existing) cluster",[487,745,748],{"className":746,"code":747,"language":578,"meta":346},[576],"terraform import 'aws_aurora_main_cluster.aws_rds_cluster.this[0]' main-database-name\nterraform import 'aws_aurora_main_cluster.aws_rds_cluster_instance.this[0]' main-database-instance-name-01\nterraform import 'aws_aurora_main_cluster.aws_rds_cluster_instance.this[1]' main-database-instance-name-02\n\n# ...\n",[194,749,747],{"__ignoreMap":346},[751,752],"br",{},"then if you haven't yet and would like to \"match reality\" between the\nexisting and the specified resource, run ",[194,755,756],{},"terraform plan"," a few times and\nadjust the parameters until Terraform reports:",[487,759,762],{"className":760,"code":761,"language":492,"meta":346},[490],"No changes. Your infrastructure matches the configuration.\n",[194,763,761],{"__ignoreMap":346},[75,765,766,767,769,770,776,778,779,781,782],{},"last but not least, remove the corresponding resources from the ",[194,768,643],{},"\nTerraform state so that it doesn't try to keep track of the changes and also\ndon't try to destroy once the resource definition is no longer in that code\nbase:",[487,771,774],{"className":772,"code":773,"language":578,"meta":346},[576],"# Hypothetical name of the resource inside production\u002Faws\u002Fmain.tf\nterraform state rm aws_rds_cluster.default \\\n    'aws_rds_cluster_instance.default[0]' 'aws_rds_cluster_instance.default[1]'\n\n# ...\n",[194,775,773],{"__ignoreMap":346},[751,777],{},"once that is performed, feel free to remove the corresponding resource's\ndefinition from the ",[194,780,643],{}," code.",[487,783,786],{"className":784,"code":785,"language":543,"meta":346},[541],"resource \"aws_rds_cluster\" \"default\" {\n  # ...\n}\n\nresource \"aws_rds_cluster_instance\" \"default\" {\n  count = var.number_of_database_instances\n\n  # ...\n}\n",[194,787,785],{"__ignoreMap":346},{"title":346,"searchDepth":347,"depth":347,"links":789},[790,791,792],{"id":375,"depth":347,"text":376},{"id":415,"depth":347,"text":416},{"id":481,"depth":347,"text":482,"children":793},[794,797],{"id":526,"depth":795,"text":796},3,"Splitting the modules sub-path to its own repository",{"id":629,"depth":795,"text":630},"2021-08-11T00:00:00+02:00","Want to know how to better organize existing Terraform code? If you grasp these ideas, it could even serve for not-yet Infrastructure as Code resources. Jump in and take a look.",{},"\u002Fposts\u002F2021\u002F08\u002Freal-life-terraform-refactoring-guide",{"title":370,"description":799},"real-life-terraform-refactoring-guide","posts\u002F2021\u002F08\u002Freal-life-terraform-refactoring-guide",[366,806,500],"cloud","jPrVDAjqTerXgzb6UVNqa_9QMs0RvtE05oSR-ywTlgQ",{"id":809,"title":810,"author":7,"body":811,"createdAt":1284,"description":1285,"extension":355,"meta":1286,"navigation":357,"path":1287,"seo":1288,"slug":1289,"stem":1290,"tags":1291,"__hash__":1292},"posts\u002Fposts\u002F2021\u002F06\u002Fterraform-atomic-design.md","Terraform: Atomic Design",{"type":9,"value":812,"toc":1273},[813,815,824,832,835,838,843,855,858,867,870,873,884,887,890,894,897,914,917,921,931,940,943,949,956,989,991,997,1000,1004,1007,1010,1014,1017,1020,1026,1036,1048,1052,1058,1067,1070,1081,1088,1092,1106,1113,1119,1133,1136,1147,1158,1162,1171,1179,1190,1198,1204,1211,1214,1220,1224,1227,1235,1238,1246,1250],[91,814,376],{"id":375},[17,816,817,818,823],{},"Following ",[98,819,822],{"href":820,"rel":821},"https:\u002F\u002Fpragprog.com\u002Ftitles\u002Ftpp20\u002Fthe-pragmatic-programmer-20th-anniversary-edition\u002F",[102],"The Pragmatic Programmer"," mantra, I do my best to ...",[405,825,826],{},[17,827,828,831],{},[189,829,830],{},"Learn at least one new language every year."," Different languages solve the same\nproblems in different ways. By learning several different approaches, you can\nhelp broaden your thinking and avoid getting stuck in a rut.",[17,833,834],{},"Not necessarily to show it off or to be capable of talking about random\ntechnologies, but to expand and train my problem-solving skills, to get new\nperspectives when approaching a challenge.",[17,836,837],{},"We might not notice it but when we learn (or have learned) to code we aren't\njust learning to type some characters that a compiler\u002Finterpreter can\nunderstand, it is a new way of thinking, a new way of breaking down solutions\n(into sequential steps).",[405,839,840],{},[17,841,842],{},"It doesn't matter whether you ever use any of these technologies on a project,\nor even whether you put them on your resume. The process of learning will expand\nyour thinking, opening you to new possibilities and new ways of doing things.\nThe cross-pollination of ideas is important;",[17,844,845,846,849,850,501],{},"As someone who works intensively with infrastructure components (servers,\ndatabases, Kubernetes, CI\u002FCD, etc) I aimed for something completely different\nthis year. Something that stands on ",[520,847,848],{},"a whole different spectrum"," of the system,\nthis year I decided to learn ",[98,851,854],{"href":852,"rel":853},"https:\u002F\u002Fflutter.dev\u002F",[102],"Flutter",[17,856,857],{},"In-a-nutshell, Flutter is a better React Native. A framework that enables\nimplementation of GUI applications for multiple platforms with a single code\nbase.",[17,859,860,861,866],{},"Then it reminded me a discussion I had with a friend in the past about React\ncomponents and the ",[98,862,865],{"href":863,"rel":864},"https:\u002F\u002Fbradfrost.com\u002Fblog\u002Fpost\u002Fatomic-web-design\u002F",[102],"Atomic Design"," methodology, which helps to structure web\ncomponents into modules.",[17,868,869],{},"In the Atomic Design methodology, the granularity of modules is distinguished by\nusing chemistry inspired names: atoms, molecules and organisms.",[17,871,872],{},"Then the connection of the ideas from",[72,874,875,878,881],{},[75,876,877],{},"Pragmatic Programmer's cross-pollination to",[75,879,880],{},"Atomic Design (on Flutter components) to",[75,882,883],{},"Terraform modules",[17,885,886],{},"came almost like a thunderbolt, striking me with this insight when I was working\nwith a huge legacy Terraform code base refactoring with lots of code duplication\n(read: copy+paste, \"we fix it later\", then the author quits the company and\nnever fix anything).",[17,888,889],{},"Although initially proposed as a Web UI methodology, Infrastructure as Code\ntools such as Terraform that makes heavy usage of modules can benefit from\nAtomic Design to improve its code reusability and massively reduce duplication.",[91,891,893],{"id":892},"details","Details",[17,895,896],{},"The Atomic Design methodology proposes five distinct levels, listed from the\nfinest to the thickest granularity:",[462,898,899,902,905,908,911],{},[75,900,901],{},"Atom;",[75,903,904],{},"Molecules;",[75,906,907],{},"Organisms;",[75,909,910],{},"Templates;",[75,912,913],{},"Pages.",[17,915,916],{},"However, to extract the gist, we'll only be focusing on Atoms, Molecules, and\nOrganisms (from 1. to 3.). Templates and Pages are too specialized for Web UI\ndevelopment.",[524,918,920],{"id":919},"atoms","Atoms",[17,922,923,924,926,927,930],{},"Atoms represent the finest grain in terms of granularity in the design. When\nreferring specifically to its implementation in Terraform a ",[194,925,701],{}," and a\nsmall scoped single-purpose ",[194,928,929],{},"module"," could be used interchangeably.",[17,932,933,934,936,937,939],{},"Sometimes the idea of turning a simple resource into a module makes sense to\nease parameterization and reusability, especially when it is necessary to parse\ninputs. Although, due to its extreme limited scope it might not look attractive\nto convert the ",[194,935,701],{}," into a ",[194,938,929],{}," at first sight, on the long run it\npays off to do so in order to achieve scalability and reproducibility.",[17,941,942],{},"e.g.:",[487,944,947],{"className":945,"code":946,"language":543,"meta":346},[541],"data \"aws_route53_zone\" \"default\" {\n  zone_id = var.zone_id\n  name    = var.zone_name\n}\n\nresource \"aws_route53_record\" \"default\" {\n  zone_id = data.aws_route53_zone.default.zone_id\n  name    = var.name\n\n  ttl  = var.ttl\n  type = var.record_type\n\n  records = var.records\n\n  dynamic \"alias\" {\n    for_each = [var.alias]\n\n    content {\n      name = each.value.name\n      zone_id = try(each.value.zone_id, data.aws_route53_zone.default.zone_id)\n\n      evaluate_target_health = lookup(\n        each.value,\n        \"evaluate_target_health\",\n        false,\n      )\n    }\n  }\n}\n",[194,948,946],{"__ignoreMap":346},[17,950,951,952,955],{},"In this case, even though ",[194,953,954],{},"aws_route53_record"," is a simple resource that might\nfeel too narrow in scope to write a module, the implementation of the module\nallows to bundle the AWS Route53 Zone data source together, which helps to:",[462,957,958,965,978],{},[75,959,960,961,964],{},"provide a simpler contract by allowing the usage of ",[194,962,963],{},"zone_name"," alone;",[75,966,967,968,970,971,973,974,977],{},"validate the ",[194,969,963],{}," input, ensuring that a given ",[194,972,963],{}," corresponds to an\nactual ",[189,975,976],{},"existing and valid"," AWS resource;",[75,979,980,981,984,985,988],{},"same goes to ",[194,982,983],{},"zone_id",", which will feel (and oftentimes, be) redundant,\n",[520,986,987],{},"when"," specified as an input Terraform will read the data from AWS API\nensuring consistency.",[17,990,942],{},[487,992,995],{"className":993,"code":994,"language":543,"meta":346},[541],"module \"awesome_dns_fqdn\" {\n  source = \"path\u002Fto\u002Fmodules\u002Fatoms\u002Faws_route53_record\"\n  version = \"~> 1.0\"\n\n  name      = \"record.example.com\"\n  zone_name = \"example.com.\"\n\n  record_type = \"CNAME\"\n  records     = [\"1.2.3.4\"]\n}\n",[194,996,994],{"__ignoreMap":346},[17,998,999],{},"Hence, resources and modules are sometimes interchangeable as they deliver the\nsame outcome for the finest resources' granularity.",[524,1001,1003],{"id":1002},"molecules","Molecules",[17,1005,1006],{},"When groups of atoms are bounded together, they create a molecule which is the\nsmallest fundamental unit of a compound.",[17,1008,1009],{},"Contrary to the original Atomic Design for Web UI, in Terraform, Atoms are\nuseful on their own. However, the usage of atoms comes with a high price on\nscalability: code duplication. Actually, duplication is an understatement, it is\nmore like code exponentiation (more on this later).",[562,1011,1013],{"id":1012},"implementation-example","Implementation example",[17,1015,1016],{},"Suppose we are creating a public facing API Gateway that needs a DNS record.",[17,1018,1019],{},"Let's compose it with the previous example:",[487,1021,1024],{"className":1022,"code":1023,"language":543,"meta":346},[541],"data \"aws_route53_zone\" \"default\" {\n  name = var.zone_name\n}\n\nmodule \"awesome_api_gateway_certificate\" {\n  source  = \"terraform-aws-modules\u002Facm\u002Faws\"\n  version = \"~> v3.0\"\n\n  domain_name = var.domain_name\n  zone_id     = data.aws_route53_zone.default.zone_id\n\n  wait_for_validation = true\n}\n\nmodule \"awesome_api_gateway\" {\n  source = \"terraform-aws-modules\u002Fapigateway-v2\u002Faws\"\n  version = \"~> 1.0\"\n\n  name          = var.api_gateway_name\n  description   = var.api_gateway_description\n  protocol_type = \"HTTP\"\n\n  cors_configuration = {\n    allow_headers = [\n      \"content-type\",\n      \"x-amz-date\",\n      \"authorization\",\n      \"x-api-key\",\n      \"x-amz-security-token\",\n      \"x-amz-user-agent\",\n    ]\n    allow_methods = [\"*\"]\n    allow_origins = [\"*\"]\n  }\n\n  # Custom domain\n  domain_name                 = var.domain_name\n  domain_name_certificate_arn = module.awesome_api_gateway_certificate.acm_certificate_arn\n\n  # Routes and integrations\n  integrations = var.api_gateway_integrations\n}\n\nmodule \"awesome_dns_fqdn\" {\n  source  = \"path\u002Fto\u002Fmodules\u002Fatoms\u002Faws_route53_record\"\n  version = \"~> 1.0\"\n\n  name    = var.domain_name\n  zone_id = data.aws_route53_zone.default.zone_id\n\n  record_type = \"CNAME\"\n  alias     = {\n    name    = module.awesome_api_gateway.apigatewayv2_domain_name_configuration[0].target_domain_name\n    zone_id = module.awesome_api_gateway.apigatewayv2_domain_name_configuration[0].hosted_zone_id\n  }\n}\n",[194,1025,1023],{"__ignoreMap":346},[17,1027,1028,1029,1031,1032,1035],{},"This helps illustrating an example in which the ",[194,1030,954],{}," atom could\nbe easily replaced with its equivalent resource and it would still provide the\n",[189,1033,1034],{},"same"," outcome.",[17,1037,1038,1039,1041,1042,1044,1045,1047],{},"Commonly it is possible to use ",[194,1040,929],{}," and ",[194,1043,701],{}," interchangeably as Atoms,\nthe decision of whether or not to implement a ",[194,1046,929],{}," is ultimately defined by\nthe need of parsing and\u002For validating the inputs (variables).",[562,1049,1051],{"id":1050},"usage-example","Usage example",[487,1053,1056],{"className":1054,"code":1055,"language":543,"meta":346},[541],"module \"awesome_lambda\" {\n  source  = \"path\u002Fto\u002Fmodules\u002Fmolecules\u002Faws_lambda_function\"\n  version = \"~> 1.0\"\n\n  function_name = \"awesome\"\n  description   = \"An Awesome lambda function for the Awesome API Gateway\"\n  handler       = \"index.lambda_handler\"\n  runtime       = \"python3.8\"\n\n  # Incomplete implementation, don't use this on production\n}\n\nmodule \"another_awesome_lambda\" {\n  source  = \"path\u002Fto\u002Fmodules\u002Fmolecules\u002Faws_lambda_function\"\n  version = \"~> 1.0\"\n\n  function_name = \"awesome\"\n  description   = \"An Awesome lambda function for the Awesome API Gateway\"\n  handler       = \"index.lambda_handler\"\n  runtime       = \"python3.8\"\n\n  # Incomplete implementation, don't use this on production\n}\n\nmodule \"awesome_api_gateway\" {\n  source  = \"path\u002Fto\u002Fmodules\u002Fmolecules\u002Faws_api_gateway\"\n  version = \"~> 1.0\"\n\n  domain_name = \"record.example.com\"\n  zone_name   = \"example.com.\"\n\n  api_gateway_name        = \"awesome-api-gateway\"\n  api_gateway_description = \"An Awesome API Gateway\"\n\n  api_gateway_integrations = {\n    \"POST \u002F\" = {\n      lambda_arn             = module.awesome_lambda.function_arn\n      payload_format_version = \"2.0\"\n    }\n\n    \"$default\" = {\n      lambda_arn = module.another_awesome_lambda.function_arn\n    }\n  }\n}\n",[194,1057,1055],{"__ignoreMap":346},[17,1059,1060,1061,1066],{},"As you probably have already realized, when the level of abstraction goes up\n(e.g. from atom to molecule) the module implementation is in itself a good\nimplementation example (i.e. as in ",[98,1062,1065],{"href":1063,"rel":1064},"https:\u002F\u002Fgithub.com\u002Fterraform-aws-modules\u002Fterraform-aws-lambda\u002Fblob\u002Fmaster\u002Fmain.tf",[102],"community modules examples",").",[17,1068,1069],{},"They help to self-document the usage and implementation of a given module and\nthrough generic implementations it allows us to have multiple molecules\nimplementing multiple distinct use-cases. e.g.:",[462,1071,1072,1075,1078],{},[75,1073,1074],{},"Public API Gateway with DNS record + TLS certificate;",[75,1076,1077],{},"Public API Gateway v1, no DNS record;",[75,1079,1080],{},"Private API Gateway.",[17,1082,1083,1084,1087],{},"Why would we chose to implement multiple times the Atom modules in order to\ncreate multiple distinct use-cases? We are getting closer to the ",[520,1085,1086],{},"code\nexponentiation"," problem and solution proposal. Can you feel it?",[524,1089,1091],{"id":1090},"organisms","Organisms",[17,1093,1094,1095,1099,1100,1105],{},"Going further, the ",[98,1096,1098],{"href":1097},"#usage-example","example of composition for molecules"," can have its hard-coded\nvalues turned into variables in order to compose an Organism, which can\nfacilitate the implementation of the same definition across different\nenvironments. Thus, achieving reproducibility as well as the ",[98,1101,1104],{"href":1102,"rel":1103},"https:\u002F\u002F12factor.net\u002Fdev-prod-parity",[102],"Factor X."," of the\nTwelve Factor App.",[17,1107,1108,1109,1112],{},"However, it is important to note that the level of abstraction between Organisms\nand Molecules can be easily confused or misunderstood. Generally speaking, as a\nrule of thumb an Organism is the composition of Molecules that allow parameterization for\nbusiness or domain-specific logic (e.g. the actual ",[194,1110,1111],{},"awesome_api"," configuration).\nTherefore, in comparison with the previous, Organisms (usually) have a lower\nlevel of generalization since they are business-specialized modules.",[17,1114,1115,1116,1118],{},"Iterating over our implementation example, the Organism would implement the\n",[194,1117,1111],{},", creating the following resources:",[72,1120,1121,1124,1127,1130],{},[75,1122,1123],{},"AWS Lambda function;",[75,1125,1126],{},"AWS API Gateway;",[75,1128,1129],{},"TLS Certificate on AWS ACM;",[75,1131,1132],{},"DNS record on AWS Route53.",[17,1134,1135],{},"By implementing the previous examples as organisms we:",[462,1137,1138,1141,1144],{},[75,1139,1140],{},"reduce the amount of boilerplate code;",[75,1142,1143],{},"foster reusability of modules;",[75,1145,1146],{},"provide a simple interface for non-operators to manage TF code.",[17,1148,1149,1150,1153,1154,1157],{},"When you sum it all up, you will notice that it is ",[189,1151,1152],{},"all about autonomy"," and\n\"DevOps\" through encouragement of self-service Ops. One wouldn't need to know a\nlot about Terraform to grab a module and pass some parameters to it, followed by\na code review process Operators and Software Developers can manage the\nInfrastructure in harmony, ",[189,1155,1156],{},"together",". (:",[524,1159,1161],{"id":1160},"code-exponentiation-what","Code Exponentiation? What?",[17,1163,1164,1165,1170],{},"Read that as a dramatization of the ",[98,1166,1169],{"href":1167,"rel":1168},"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDuplicate%5Fcode",[102],"\"code duplication\""," term.",[17,1172,1173,1174,501],{},"When it comes to Infrastructure as Code, there is no easy way around the jungle\nof resources that grows over time. Fast pacing tech companies are \"moving fast\nand breaking things\", oftentimes the Operators are worried about a massive\namount of challenges at once: keep the servers up and running, with a consistent\nresponse time, low error rate, and all that ",[98,1175,1178],{"href":1176,"rel":1177},"https:\u002F\u002Fsre.google\u002Fsre-book\u002Ftable-of-contents\u002F",[102],"playbook from Google's SRE wisdom",[17,1180,1181,1182,1185,1186,1189],{},"All things considered, a good Infrastructure as Code design is generally\na first-world problem. However, as the time passes it evolves into a real issue\nthat slows down the implementation of resources as code. Either that or there\nwill be a ",[189,1183,1184],{},"huge ton"," of copy+paste to keep up with the pace, followed by a\nroutine of find+replace when changes are applied, ",[520,1187,1188],{},"then"," harder to track pull\nrequests and slower code reviews.",[17,1191,1192,1193,1195,1196,724],{},"Lets take our ",[194,1194,1111],{}," example and scale it up to multiple environments\nfollowed by a second ",[194,1197,1111],{},[487,1199,1202],{"className":1200,"code":1201,"language":492,"meta":346},[490],".\n├── development\n│   ├── an-awesome-api\n│   │   └── main.tf\n│   └── another-awesome-api\n│       └── main.tf\n├── staging\n│   ├── an-awesome-api\n│   │   └── main.tf\n│   └── another-awesome-api\n│       └── main.tf\n└── production\n    ├── an-awesome-api\n    │   └── main.tf\n    └── another-awesome-api\n        └── main.tf\n",[194,1203,1201],{"__ignoreMap":346},[17,1205,1206,1207,1066],{},"Note that this directory structure is inspired on the proposed ideas from the\n[Terraform best practices post](",[1208,1209],"binding",{"value":1210},"\u003C relref \"terraform-best-practices\" >",[17,1212,1213],{},"In order to replicate the configuration and ensure consistency, the following is\nway simpler to implement (and review) than copy+paste huge chunks of Terraform\ndefinitions",[487,1215,1218],{"className":1216,"code":1217,"language":543,"meta":346},[541],"module \"awesome_api\" {\n  source = \"path\u002Fto\u002Fmodules\u002Forganisms\u002Faws_lambda_with_api_gateway\"\n  version = \"~> 1.0\"\n\n  domain_name = \"record.example.com\"\n  zone_name   = \"example.com.\"\n\n  lambda_functions = [\n    # Index 0 -- An Awesome Lambda Function, used for POST\n    {\n      name        = \"an-awesome\"\n      description = \"An Awesome lambda function for the Awesome API Gateway\"\n      handler     = \"an_awesome.lambda_handler\"\n      runtime     = \"python3.8\"\n    },\n    # Index 1 -- Another Awesome Lambda Function, used as $default\n    {\n      name        = \"another-awesome\"\n      description = \"Another Awesome lambda function for the Awesome API Gateway\"\n      handler     = \"another_awesome.lambda_handler\"\n      runtime     = \"python3.8\"\n    },\n  ]\n\n  api_gateway_name = \"awesome-api-gateway\"\n  api_gateway_description = \"An Awesome API Gateway\"\n\n  api_gateway_integrations = {\n    \"POST \u002F\" = {\n      lambda_function_index  = 0\n      payload_format_version = \"2.0\"\n    }\n\n    \"$default\" = {\n      lambda_function_index = 1\n    }\n  }\n}\n",[194,1219,1217],{"__ignoreMap":346},[91,1221,1223],{"id":1222},"conclusion","Conclusion",[17,1225,1226],{},"At the end of the day we get an ugly Terraform state containing many",[487,1228,1233],{"className":1229,"code":1231,"language":1232,"meta":346},[1230],"language-ruby","module.something.module.something_else.module.yet_another_thing...\n","ruby",[194,1234,1231],{"__ignoreMap":346},[17,1236,1237],{},"But the productivity boost gained by merging modules based on context is a worth\ninvestment. Especially for huge Terraform repositories with multiple teams\ncollaborating and managing a lot of resources.",[17,1239,1240,1241,501],{},"Cross-team collaboration is fostered by applying the Atomic Design methodology\nfor Terraform modules, code reusability becomes an important factor over\ncopy+paste and the repository gravitates towards the ",[98,1242,1245],{"href":1243,"rel":1244},"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDon%27t%5Frepeat%5Fyourself",[102],"DRY principle",[91,1247,1249],{"id":1248},"same-post-different-places","Same post, different places",[72,1251,1252,1259,1266],{},[75,1253,1254,472],{},[98,1255,1258],{"href":1256,"rel":1257},"https:\u002F\u002Fwww.reddit.com\u002Fr\u002FTerraform\u002Fcomments\u002Fpd708z\u002Fterraform%5Fmodules%5Fatomic%5Fdesign\u002F",[102],"reddit.com: Terraform Modules: Atomic Design - r\u002FTerraform",[75,1260,1261,472],{},[98,1262,1265],{"href":1263,"rel":1264},"https:\u002F\u002Fdev.to\u002Fmacunha\u002Fterraform-modules-atomic-design-3i7m",[102],"dev.to: Terraform Modules: Atomic Design - DEV Community",[75,1267,1268,472],{},[98,1269,1272],{"href":1270,"rel":1271},"https:\u002F\u002Fweekly.tf\u002Fissues\u002Fweekly-tf-issue-51-terraform-atomic-design-ec2-image-builder-736257",[102],"weekly.tf: #51 - Terraform Atomic Design, EC2 Image Builder",{"title":346,"searchDepth":347,"depth":347,"links":1274},[1275,1276,1282,1283],{"id":375,"depth":347,"text":376},{"id":892,"depth":347,"text":893,"children":1277},[1278,1279,1280,1281],{"id":919,"depth":795,"text":920},{"id":1002,"depth":795,"text":1003},{"id":1090,"depth":795,"text":1091},{"id":1160,"depth":795,"text":1161},{"id":1222,"depth":347,"text":1223},{"id":1248,"depth":347,"text":1249},"2021-06-29T00:00:00+02:00","Adapting the Atomic Design methodology to Infrastructure as Code components to help foster code reusability, ease of maintenance and agile development of the infrastructure. Creates standardization, validates inputs and brings the Terraform definitions closer to the developers (self-service Ops).",{},"\u002Fposts\u002F2021\u002F06\u002Fterraform-atomic-design",{"title":810,"description":1285},"terraform-atomic-design","posts\u002F2021\u002F06\u002Fterraform-atomic-design",[366,806,500],"51i47DRdlOrtRDt7XIlsZ2X7XP5klw9_dS9yyn5J_3A",{"id":1294,"title":1295,"author":7,"body":1296,"createdAt":1615,"description":1616,"extension":355,"meta":1617,"navigation":357,"path":1618,"seo":1619,"slug":1620,"stem":1621,"tags":1622,"__hash__":1623},"posts\u002Fposts\u002F2021\u002F03\u002Fterraform-design-best-practices.md","Terraform Design Best Practices",{"type":9,"value":1297,"toc":1603},[1298,1300,1303,1306,1326,1329,1338,1347,1352,1356,1363,1368,1373,1376,1380,1384,1393,1406,1409,1413,1427,1430,1433,1449,1452,1454,1460,1467,1475,1479,1482,1485,1491,1495,1514,1552,1556,1563,1566,1583,1590],[91,1299,376],{"id":375},[17,1301,1302],{},"As someone who believes in empowering people and distributing power in order to\nachieve higher outcomes I always felt that the best existing best-practices\nproposals don't touch some key aspects (IMHO) on code evolution and business\nstructures.",[17,1304,1305],{},"Therefore, this document shall compose on the previous ones and extend them with\nsome self-service Ops and micro-services spice to the mix.",[17,1307,1308,1309,1314,1315,1320,1321,501],{},"On ",[98,1310,1313],{"href":1311,"rel":1312},"https:\u002F\u002Fwww.terraform-best-practices.com",[102],"Terraform best practices"," great insights on how to write code inside a\nmodule is provided, e.g. ",[98,1316,1319],{"href":1317,"rel":1318},"https:\u002F\u002Fwww.terraform-best-practices.com\u002Fnaming",[102],"naming conventions",", ",[98,1322,1325],{"href":1323,"rel":1324},"https:\u002F\u002Fwww.terraform-best-practices.com\u002Fcode-structure#getting-started-with-structuring-of-terraform-configurations",[102],"Terraform file naming",[17,1327,1328],{},"We can't leave Terragrunt epic blog post unmentioned:",[72,1330,1331],{},[75,1332,1333,472],{},[98,1334,1337],{"href":1335,"rel":1336},"https:\u002F\u002Fblog.gruntwork.io\u002F5-lessons-learned-from-writing-over-300-000-lines-of-infrastructure-code-36ba7fadeac1",[102],"5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code",[17,1339,1340,1341,1346],{},"As well as the ",[98,1342,1345],{"href":1343,"rel":1344},"https:\u002F\u002Fterragrunt.gruntwork.io\u002Fdocs\u002Fgetting-started\u002Fquick-start\u002F#promote-immutable-versioned-terraform-modules-across-environments",[102],"Terragrunt documentation pointing"," \"one of the most important\nlessons\" is that:",[405,1348,1349],{},[17,1350,1351],{},"large modules should be considered harmful. That is, it is a Bad Idea to define\nall of your environments (dev, stage, prod, etc), or even a large amount of\ninfrastructure (servers, databases, load balancers, DNS, etc), in a single\nTerraform module. Large modules are slow, insecure, hard to update, hard to code\nreview, hard to test, and brittle (i.e., you have all your eggs in one basket).",[524,1353,1355],{"id":1354},"bad-idea-capitalized","\"Bad Idea\" capitalized!",[17,1357,1358,1359,501],{},"Which is totally true, as this \"Bad Idea\" usually coming from a lack of care\ntowards Terraform code design tend to be harmful in the long run, with a\ntendency towards making the implementation a ",[98,1360,1362],{"href":382,"rel":1361},[102],"big ball of mud",[405,1364,1365],{},[17,1366,1367],{},"A Big Ball of Mud is a haphazardly structured, sprawling, sloppy,\nduct-tape-and-baling-wire, spaghetti-code jungle. These systems show\nunmistakable signs of unregulated growth, and repeated, expedient repair.\nInformation is shared promiscuously among distant elements of the system,\noften to the point where nearly all the important information becomes global\nor duplicated.",[405,1369,1370],{},[17,1371,1372],{},"The overall structure of the system may never have been well defined.",[17,1374,1375],{},"Oftentimes, Terraform code implementation fluctuate towards mono-repositories\n(a.k.a. monorepos) containing all the specification in a single place. In order\nto tame the chaos, the Terraform state needs to be at least sub-divided into\nlogical sections.",[91,1377,1379],{"id":1378},"design","Design",[524,1381,1383],{"id":1382},"shallow-tree-of-shared-resources","Shallow \"tree\" of shared resources",[17,1385,1386,1387,1392],{},"Following the ",[98,1388,1391],{"href":1389,"rel":1390},"https:\u002F\u002Fwww.terraform-best-practices.com\u002Fcode-structure#common-recommendations-for-structuring-code",[102],"recommendations for structuring code"," one of the proposals is to\nkeep a shallow \"tree\" of resources and modules. This tree produces a small and\nclear distribution of Terraform code.",[17,1394,1395,1396,1401,1402,1405],{},"Why a shallow \"tree\" of resources? It helps achieving a short amount of\nresources and modules that result in a small ",[98,1397,1400],{"href":1398,"rel":1399},"https:\u002F\u002Fwww.terraform.io\u002Fdocs\u002Flanguage\u002Fstate\u002Fremote.html",[102],"remote state"," file. With a small\nremote state we speed-up the development process and reduce waste (",[520,1403,1404],{},"Muda"," in the\nToyota 3M model), as the shallow tree enables faster executions of Terraform\n(less data to sync and compare).",[17,1407,1408],{},"The granularity level will be defined for each specific case (no silver bullet)\nbalancing the smallest and most feasible composition possible.",[524,1410,1412],{"id":1411},"product-areas-aka-business-capabilities-structure-and-ownership","Product areas (a.k.a. Business capabilities) structure and ownership",[17,1414,1415,1416,1420,1421,1426],{},"Ideally, the ",[98,1417,1419],{"href":469,"rel":1418},[102],"composition level"," would be organized around Product Areas (either\nsquads\u002Fcrews or guilds) with a fallback to shared technologies (e.g. vpc,\ndatabases). Therefore, Terraform compositions are designed around what Martin\nFowler ",[98,1422,1425],{"href":1423,"rel":1424},"https:\u002F\u002Fyoutu.be\u002FwgdBVIX9ifA?t=388",[102],"calls \"Business capabilities\""," in micro-services terminology, ideally the\nTerraform composition will follow the organizational structure so that each team\n\"owns\" (in both senses: ownership and freedom) its own state.",[17,1428,1429],{},"The main goal here is to structure the Terraform code as a reflection of the\norganization so that is fosters self-service Ops. If the Infrastructure as Code\nis mature enough to the point of having well-described Terraform modules,\neveryone should be empowered to define these modules by setting the parameters\naccording to their needs, without centralizing power on a Operations team.",[17,1431,1432],{},"The resource composition must gravitate towards the following (ordered by\npriority from higher to lower):",[462,1434,1435,1446],{},[75,1436,1437,1438],{},"Product Areas (ownership) directory structure:",[462,1439,1440,1443],{},[75,1441,1442],{},"squad\u002Fcrew OR guild;",[75,1444,1445],{},"product.",[75,1447,1448],{},"Shared resources, around technologies.",[17,1450,1451],{},"Looking on the structure from bottom-up it starts from the product and then\nattributes the product to a crew through the directory tree.",[17,1453,942],{},[487,1455,1458],{"className":1456,"code":1457,"language":492,"meta":346},[490],"# Squad or Crew\nred-team\n└── payment     # Product (i.e. micro-service) name\n    └── main.tf # Any resource used by the payment product\n\n# Guild (organized around technology)\nback-end\n└─ monolith    # Shared application in terms of ownership\n   └── main.tf # Cloud resources used by the monolith\n",[194,1459,1457],{"__ignoreMap":346},[17,1461,1462,1463,1466],{},"On the example above, we can't ignore that ",[194,1464,1465],{},"monolith"," is a product with shared\nownership among back-end developers and therefore it is organized to follow the\nbusiness structure.",[17,1468,1469,1470,1474],{},"The structure is inspired ",[98,1471,1473],{"href":1343,"rel":1472},[102],"on Terragrunt's best-practices"," to some extend.\nHowever, it distinct from Terragrunt proposal in the way resources are divided,\nrather than organizing resources exclusively around technologies.",[524,1476,1478],{"id":1477},"shared-resources-organized-around-technologies","Shared resources, organized around technologies",[17,1480,1481],{},"Oftentimes in organizations we will face shared resources among products, there\nis no way around reality. e.g. a shared VPC or SQL database.",[17,1483,1484],{},"However, these situations should be the exception and not the norm. Dealt\nsimilar to the organization of Terraform compositions around\nguilds\u002Ftechnologies.",[487,1486,1489],{"className":1487,"code":1488,"language":492,"meta":346},[490],"platform # as in Platform Engineering\n└── vpc\n    └── main.tf\n\nback-end\n└── database\n    └── main.tf\n",[194,1490,1488],{"__ignoreMap":346},[524,1492,1494],{"id":1493},"files-inside-the-composition","Files inside the composition?",[17,1496,1497,1498,1502,1503,1320,1506,1509,1510,1513],{},"Ideally the files in the sub-directory (which specify the composition) are going\nto partially ",[98,1499,1501],{"href":1323,"rel":1500},[102],"follow this spec"," and include ",[194,1504,1505],{},"data.tf",[194,1507,1508],{},"terraform.tf"," and\n",[194,1511,1512],{},"providers.tf"," on top of that.",[72,1515,1516,1522,1528,1534,1540,1546],{},[75,1517,1518,1521],{},[189,1519,1520],{},"main.tf:"," contains locals, module and resource definitions;",[75,1523,1524,1527],{},[189,1525,1526],{},"variables.tf:"," contains declarations of variables (i.e. inputs\u002Fparameters)\nused in main.tf;",[75,1529,1530,1533],{},[189,1531,1532],{},"data.tf:"," contains data-resources for input data used in main.tf;",[75,1535,1536,1539],{},[189,1537,1538],{},"outputs.tf:"," contains outputs from the resources created in main.tf;",[75,1541,1542,1545],{},[189,1543,1544],{},"providers.tf:"," contains provider and provider's versions definitions;",[75,1547,1548,1551],{},[189,1549,1550],{},"terraform.tf:"," contains the terraform back-end (e.g. remote state)\ndefinition;",[524,1553,1555],{"id":1554},"what-about-terraform-modules","What about Terraform modules?",[17,1557,1558,1562],{},[98,1559,883],{"href":1560,"rel":1561},"https:\u002F\u002Fwww.terraform.io\u002Fdocs\u002Flanguage\u002Fmodules\u002Findex.html",[102]," are containers for multiple resources that are used together\nto achieve a shared goal. Modules can be used to create lightweight\nabstractions, facilitating reusability and distribution of Terraform code.",[17,1564,1565],{},"Therefore, we assume that the following are anti-patterns that make Terraform\nmodules' reusability difficult:",[72,1567,1568,1571,1574,1577,1580],{},[75,1569,1570],{},"Configuration of Terraform Providers inside a module;",[75,1572,1573],{},"Implementation of Business logic and\u002For hard-coded parameters in a\nmodule;",[75,1575,1576],{},"Default values are specified in optional variables instead of\nhard-coding;",[75,1578,1579],{},"Modules should be self-contained and provide a clear contract.\nDependencies (pre-existing resources) must be specified through required\nvariables.",[75,1581,1582],{},"Modules must serve to a singular purpose. Multiple purpose must be\nachieved through composability of modules and not by \"monolithic\" modules.",[17,1584,1585,1586,501],{},"Modules are abstractions that should be used to reduce the amount of code\nduplication, implementing the ",[98,1587,1589],{"href":1243,"rel":1588},[102],"DRY (don't repeat yourself) principle",[17,1591,1592,1593,1598,1599,501],{},"On top of that, modules are an important factor to reduce the parity among\nenvironments, which helps to better address the ",[98,1594,1597],{"href":1595,"rel":1596},"https:\u002F\u002F12factor.net\u002F",[102],"Twelve-Factor App model"," in\nregards to ",[98,1600,1602],{"href":1102,"rel":1601},[102],"Factor X (ten)",{"title":346,"searchDepth":347,"depth":347,"links":1604},[1605,1608],{"id":375,"depth":347,"text":376,"children":1606},[1607],{"id":1354,"depth":795,"text":1355},{"id":1378,"depth":347,"text":1379,"children":1609},[1610,1611,1612,1613,1614],{"id":1382,"depth":795,"text":1383},{"id":1411,"depth":795,"text":1412},{"id":1477,"depth":795,"text":1478},{"id":1493,"depth":795,"text":1494},{"id":1554,"depth":795,"text":1555},"2021-03-31T00:00:00+02:00","Composing on the existing Terraform best-practices documents to empower developers and distribute the power of managing Infrastructure. In doing so, some self-service Ops and micro-services architecture were added to the mix.",{},"\u002Fposts\u002F2021\u002F03\u002Fterraform-design-best-practices",{"title":1295,"description":1616},"terraform-design-best-practices","posts\u002F2021\u002F03\u002Fterraform-design-best-practices",[366,806,500],"IqLbKyCdWg0VIZ70HZpIxE1-m-hZNH2LRJmifbapGDU",{"id":1625,"title":1626,"author":7,"body":1627,"createdAt":1897,"description":1898,"extension":355,"meta":1899,"navigation":357,"path":1900,"seo":1901,"slug":1902,"stem":1903,"tags":1904,"__hash__":1907},"posts\u002Fposts\u002F2021\u002F02\u002Fjenkins-rebuilding-it-phase-by-phase.md","The Freeletics CI\u002FCD rebuild, phase by phase",{"type":9,"value":1628,"toc":1889},[1629,1633,1645,1648,1652,1656,1659,1674,1682,1687,1691,1694,1702,1705,1716,1719,1723,1730,1736,1750,1758,1768,1772,1779,1802,1806,1809,1815,1818,1822,1835,1838,1877,1882,1886],[12,1630,1632],{"id":1631},"picking-up-where-we-left-off","Picking up where we left off",[17,1634,1635,1639,1640,1644],{},[98,1636,1638],{"href":1637},"\u002Fposts\u002F2021\u002F01\u002Fjenkins-five-years-of-cicd-debt","Part 1"," covered what we\ninherited and why we chose to rebuild Jenkins rather than replace it.\n",[98,1641,1643],{"href":1642},"\u002Fposts\u002F2021\u002F01\u002Fjenkins-boring-security-by-design","Part 2"," covered the\nsecurity foundation: authorization that stays accurate without manual upkeep,\nand secrets decoupled from the configuration release cycle. This post is about\nthe build system itself (the piece that made Dependabot Monday mornings a\nnon-event).",[17,1646,1647],{},"One constraint shaped everything: we couldn't afford a hard cutover. Freeletics'\nback-end and web deployment pipelines were running continuously. The redesign\nhad to happen alongside normal operations, phase by phase, with the old system\nstill running until each piece was ready to replace.",[12,1649,1651],{"id":1650},"phase-2-kill-the-job-builder-introduce-kaniko","Phase 2: Kill the Job Builder, introduce Kaniko",[91,1653,1655],{"id":1654},"first-things-first-reproducibility","First things first: reproducibility",[17,1657,1658],{},"The hardest requirement to state and the easiest to overlook: a Jenkins\ninstallation that can be deleted and recreated from scratch without losing\nanything. No manual configuration living only in someone's memory, no jobs\nthat were \"created by hand one afternoon\" and never written down.",[17,1660,1661,1662,1667,1668,1673],{},"To get there, JJB had to go. Its replacement was a combination of\n",[98,1663,1666],{"href":1664,"rel":1665},"https:\u002F\u002Fgithub.com\u002Fjenkinsci\u002Fconfiguration-as-code-plugin",[102],"Jenkins Configuration as Code (JCasC)","\nfor the Jenkins system-level settings and\n",[98,1669,1672],{"href":1670,"rel":1671},"https:\u002F\u002Fgithub.com\u002Fjenkinsci\u002Fjob-dsl-plugin",[102],"Job DSL"," for job definitions,\nboth managed through Terraform. Every Jenkins job became a pull request. Every\nsystem configuration change was reviewable, auditable, and rollback-able.",[17,1675,1676,1677,1681],{},"The practical impact was felt immediately. Previously, understanding what jobs\nJenkins was running required either opening the Jenkins UI or reading the\nbundled JJB YAML inside the Helm Chart. Now, the full picture lived in the\nTerraform repository alongside everything else. New team members could onboard\nto Jenkins by reading code, not by poking around a UI. This is\n",[98,1678,1680],{"href":1679},"\u002Fposts\u002F2019\u002F01\u002Fdevops-benefits#configuration-andor-infrastructure-as-code","Infrastructure as Code"," applied to CI\nconfiguration:",[405,1683,1684],{},[17,1685,1686],{},"\"Easy to recreate the infrastructure, if it is necessary to move everything\nto another place, this can happen with a few manual interactions; allows for\na code review of infrastructure and configurations, which consequently brings\na culture of collaboration in the development, sharing of knowledge.\"",[91,1688,1690],{"id":1689},"docker-in-docker-out-kaniko-in","Docker-in-Docker out, Kaniko in",[17,1692,1693],{},"Docker-in-Docker (DinD) was how Jenkins built container images: a Docker daemon\nrunning inside a container, building images for other containers. It works, but\nit requires running privileged Kubernetes pods (a security concern on shared\ninfrastructure) and carries well-documented instability issues under high build\nconcurrency (the same concurrency problem that made Dependabot mornings painful).",[17,1695,1696,1701],{},[98,1697,1700],{"href":1698,"rel":1699},"https:\u002F\u002Fgithub.com\u002FGoogleContainerTools\u002Fkaniko",[102],"Kaniko"," builds images from a\nDockerfile without needing a Docker daemon at all. It runs as a regular\nKubernetes pod with no elevated privileges, uses the container registry\ndirectly for cache, and handles concurrent builds gracefully because each build\nis fully isolated.",[17,1703,1704],{},"Migration covered every server-side application:",[72,1706,1707,1710,1713],{},[75,1708,1709],{},"All Rails back-end services;",[75,1711,1712],{},"Web applications;",[75,1714,1715],{},"Internal tools (AI and analytics).",[17,1717,1718],{},"Each application got its own Kaniko cache repository in ECR. Layer caching\nacross builds meant the build times stayed competitive with the DinD baseline\nwhile removing the concurrency issues that caused the Monday morning hangs.",[91,1720,1722],{"id":1721},"the-shared-library-was-doing-too-much","The Shared Library was doing too much",[17,1724,1725,1726,1729],{},"The Jenkins Groovy Shared Library had a ",[194,1727,1728],{},"run()"," method that orchestrated the\nentire pipeline from a single entry point: git checkout, build, tag, push to\nECR, Slack notification. One call to rule them all. The appeal was obvious:\none-liner Jenkinsfiles across every repository.",[17,1731,1732,1733,1735],{},"The problem showed up the moment someone needed to deviate from the happy path.\nWant to add a custom build arg? Override the image tag format? Skip the Slack\nnotification for a specific branch? None of that was possible without touching\nthe Shared Library and potentially breaking every other pipeline that depended\non the same ",[194,1734,1728],{}," method.",[17,1737,1738,1739,1742,1743,1746,1747,1749],{},"The refactoring introduced a ",[194,1740,1741],{},"KanikoBuilder"," class that pipelines compose\nexplicitly from their ",[194,1744,1745],{},"Jenkinsfile",". The Shared Library provides the building\nblocks; the per-repository ",[194,1748,1745],{}," decides how to assemble them:",[487,1751,1756],{"className":1752,"code":1754,"language":1755,"meta":346},[1753],"language-groovy","@Library('jenkins_shared_library@master') _\n\nimport com.example.KanikoBuilder\n\ndef kanikoBuilder = new KanikoBuilder(repositoryName: \"foobar\",\n                                      steps: this)\n\npodTemplate(yaml: kanikoBuilder.getPodTemplate()) {\n    node(POD_LABEL) {\n        stage('Git - Fetch code') {\n            env.GIT_COMMIT = gitCheckout(branchName: env.BRANCH_NAME)\n        }\n\n        stage(\"Build - Container Image\") {\n            kanikoBuilder.setImageTag(env.GIT_COMMIT)\n            kanikoBuilder.build()\n        }\n    }\n}\n","groovy",[194,1757,1754],{"__ignoreMap":346},[17,1759,1760,1761,1764,1765,1767],{},"The ",[194,1762,1763],{},"steps: this"," argument passes the pipeline context into ",[194,1766,1741],{},",\nletting the library add stages at runtime. Per-repository Jenkinsfiles became\nshort and readable. Cross-cutting changes (new tag format, new registry, Slack\nintegration update) could now be made once in the library and propagated across\nall pipelines on the next library release.",[91,1769,1771],{"id":1770},"image-multi-tagging","Image multi-tagging",[17,1773,1774,1775,1778],{},"The QA stack resolved images by tag convention, and it needed multiple tags\napplied in a single build pass. Kaniko accepts N ",[194,1776,1777],{},"--destination"," arguments, so\nthe library was extended to receive a tag list and generate the flag string:",[72,1780,1781,1787,1796],{},[75,1782,1783,1786],{},[194,1784,1785],{},"qa-\u003CSHA1>"," applied always;",[75,1788,1789,1792,1793,472],{},[194,1790,1791],{},"qa-latest-master"," applied when the branch is ",[194,1794,1795],{},"master",[75,1797,1798,1801],{},[194,1799,1800],{},"qa-\u003Cbranch-name>"," applied optionally, declared in the Jenkinsfile.",[91,1803,1805],{"id":1804},"human-readable-credential-ids","Human-readable credential IDs",[17,1807,1808],{},"A minor change with outsized quality-of-life impact: all Jenkins credential IDs\nwere UUIDs. Debugging a failed credential lookup meant grep-ing through the\nHelm values to cross-reference the UUID against a label. After the migration:",[487,1810,1813],{"className":1811,"code":1812,"language":492},[490],"slack_token\naws_credentials\nnpm_token\ngithub_api_token\n",[194,1814,1812],{"__ignoreMap":346},[17,1816,1817],{},"All Jenkinsfiles across repositories were updated to reference the new IDs.\nTime spent debugging credential issues dropped to near zero.",[12,1819,1821],{"id":1820},"the-result","The result",[17,1823,1824,1825,1829,1830,1834],{},"By Feb\u002F2021, the Jenkins installation was unrecognizable from what it had been\nsix months earlier. The Monday morning Dependabot problem was gone. The\n",[98,1826,1828],{"href":1827},"\u002Fposts\u002F2019\u002F01\u002Fdevops-genesis#mura-unevenness","Mura"," was eliminated at the\nsource rather than managed around. The ",[98,1831,1833],{"href":1832},"\u002Fposts\u002F2019\u002F01\u002Fdevops-benefits#conclusion","boring\nplatform"," goal (\"no unexpected\nbehavior that makes your heart pump faster, no surprises, it just works\") had a\nnew data point: 50 concurrent Kaniko builds triggered without a single hang.\nBuild concurrency that used to cause cascading failures now just worked, because\nKaniko pods are isolated by design and don't share a Docker daemon to fight\nover.",[17,1836,1837],{},"A summary of what changed:",[72,1839,1840,1846,1852,1858,1868],{},[75,1841,1842,1845],{},[189,1843,1844],{},"Kubernetes-native workers:"," ephemeral pods spawned per job, sized to the\nworkload type, torn down when the job finishes;",[75,1847,1848,1851],{},[189,1849,1850],{},"Fully reproducible:"," delete the Helm release, recreate it from Terraform\nand JCasC, get the same Jenkins back (zero manual state);",[75,1853,1854,1857],{},[189,1855,1856],{},"Single Groovy Shared Library:"," one codebase for back-end, web, coach,\nand tracking pipelines (previously each maintained their own ad-hoc\nimplementations with duplicated logic);",[75,1859,1860,1863,1864,1867],{},[189,1861,1862],{},"GitHub OAuth with team-based RBAC:"," access control that matches the\nexisting GitHub team structure (one place to manage, one place to audit;\nfull decision in ",[98,1865,1643],{"href":1866},"\u002Fposts\u002F2021\u002F01\u002Fjenkins-boring-security-by-design#authorization-as-an-engineering-problem",");",[75,1869,1870,1873,1874,1066],{},[189,1871,1872],{},"Decoupled secrets:"," Jenkins configuration releases and secrets changes\nhave independent cycles; both are fully auditable (see\n",[98,1875,1643],{"href":1876},"\u002Fposts\u002F2021\u002F01\u002Fjenkins-boring-security-by-design#splitting-the-concerns",[164,1878,1879],{"type":166},[17,1880,1881],{},"CircleCI stayed in place for iOS and Android builds. The redesign scope\nwas server-side applications only (closing the mobile\u002Fserver-side gap was\nexplicitly deferred).",[91,1883,1885],{"id":1884},"whats-next","What's next",[17,1887,1888],{},"ChatOps for deployment triggering was designed but didn't ship in this phase:\na Hubot script accepting deployment commands from a Slack channel and\ntranslating them to Jenkins pipeline triggers, mirroring the mobile deployment\nworkflow already in use on CircleCI. The interface design was drafted\n(including ACL for restricting deployment permissions to a named group), but\nthe implementation was deprioritized. The deployment flow still goes through\nthe Jenkins UI.",{"title":346,"searchDepth":347,"depth":347,"links":1890},[1891,1892,1893,1894,1895,1896],{"id":1654,"depth":347,"text":1655},{"id":1689,"depth":347,"text":1690},{"id":1721,"depth":347,"text":1722},{"id":1770,"depth":347,"text":1771},{"id":1804,"depth":347,"text":1805},{"id":1884,"depth":347,"text":1885},"2021-02-01T00:00:00+01:00","The build system: Kaniko replacing Docker-in-Docker, a Groovy Shared Library redesign that cut per-repository boilerplate to near zero, and the change that made Dependabot Monday mornings a non-event. What we shipped and what it changed.",{},"\u002Fposts\u002F2021\u002F02\u002Fjenkins-rebuilding-it-phase-by-phase",{"title":1626,"description":1898},"jenkins-rebuilding-it-phase-by-phase","posts\u002F2021\u002F02\u002Fjenkins-rebuilding-it-phase-by-phase",[1905,1906,364,366],"ci-cd","jenkins","n5nfhedVJ-EYbEBmFtVLCzTselrsryfyUXKnh8PMjeg",{"id":1909,"title":1910,"author":7,"body":1911,"createdAt":2063,"description":2064,"extension":355,"meta":2065,"navigation":357,"path":1642,"seo":2066,"slug":2067,"stem":2068,"tags":2069,"__hash__":2071},"posts\u002Fposts\u002F2021\u002F01\u002Fjenkins-boring-security-by-design.md","Boring security on Freeletics Jenkins, by design",{"type":9,"value":1912,"toc":2061},[1913,1917,1920,1923,1926,1933,1937,1944,1947,1953,1959,1962,1965,1972,1976,1979,1986,1989,1992,1999,2003,2006,2009,2021,2027,2032,2036,2039,2042,2045,2052,2055],[12,1914,1916],{"id":1915},"the-access-control-nobody-chose","The access control nobody chose",[17,1918,1919],{},"At some point during Jenkins' original setup at Freeletics, the Google OAuth\nplugin was configured in development mode. That mode grants every account in\nthe G-Suite domain admin access to Jenkins. It was left in production. By the\ntime I ran an access review in 2020, every Freeletics employee (regardless of\nrole) could open Jenkins, modify jobs, trigger deployments to any environment,\nand change system configuration.",[17,1921,1922],{},"Nobody intended this. It was a default.",[17,1924,1925],{},"That distinction matters: security debt rarely accumulates through deliberate\nchoices. It accumulates through configuration that worked well enough to ship\nand was never revisited. Each individual default is harmless; the pattern is\nnot. A CI system that any employee can reconfigure is an audit finding waiting\nto happen (and, more practically, a deployment pipeline that can be triggered\nor altered by someone who has no business doing so).",[17,1927,1928,1929,1932],{},"The fix wasn't just \"restrict access.\" That framing leads to the wrong design.\nThe real question was: ",[520,1930,1931],{},"how do you build access control that stays accurate over\ntime without requiring manual maintenance?"," Access lists that require human\nupkeep drift. They drift because people change roles, people leave, and nobody\nhas time to be the steward of a permissions spreadsheet.",[12,1934,1936],{"id":1935},"authorization-as-an-engineering-problem","Authorization as an engineering problem",[17,1938,1939,1943],{},[98,1940,1942],{"href":1941},"\u002Fposts\u002F2021\u002F01\u002Fjenkins-five-years-of-cicd-debt#the-evaluation-stay-or-leave","When we decided to rebuild Jenkins",",\nthe design goals included making every aspect of the system reproducible from\ncode. Authorization was no exception.",[17,1945,1946],{},"Two options were evaluated.",[17,1948,1949,1952],{},[189,1950,1951],{},"Google OAuth with groups"," would restrict access to specific G-Suite groups\nrather than the entire domain. The off-boarding story is clean: deactivate a\nG-Suite account and that person loses Jenkins access immediately. The problem is\nthe authorization model itself. G-Suite groups are designed for e-mail\ndistribution, not access control. A Jenkins permission group built on top of\nG-Suite groups would live outside any existing engineering workflow: group\nmembership changes happen in the G-Suite admin console, not through pull\nrequests; there's no audit trail a developer can query; and keeping groups\nsynchronized with actual team structure requires manual intervention every time\nsomeone changes teams.",[17,1954,1955,1958],{},[189,1956,1957],{},"GitHub OAuth"," maps Jenkins authorization directly to GitHub team membership.\nAt Freeletics, GitHub teams already reflected engineering structure (back-end,\nweb, ops, coach) and were already managed through pull requests. Using them as\nthe Jenkins authorization source meant zero new model to maintain: one source of\ntruth, one place to make changes, one place to audit them.",[17,1960,1961],{},"The off-boarding trade-off is real: deactivating a G-Suite account doesn't\nautomatically remove someone from a GitHub team, so there's a separate removal\nstep required. That gap was consciously accepted.",[17,1963,1964],{},"Q: Is a manual off-boarding step a deal-breaker?\nA: Only if the alternative is more reliable in practice. A model requiring\nconstant maintenance to stay accurate has a short mean-time-to-drift; a model\nrequiring one deliberate step at a specific moment has a narrow, bounded failure\nmode. GitHub OAuth's gap is known, documented, and owned. Google group drift is\nopen-ended and invisible until an audit surfaces it.",[17,1966,1967,1968,1971],{},"The outcome: GitHub OAuth with team-based RBAC. Access requests happen through\npull requests on the GitHub teams configuration. New engineers, role changes,\ncontractor access: all of it reviewable, auditable, and executed through a\nworkflow that already exists. Any access review is a ",[194,1969,1970],{},"git log"," away from being\ndone.",[12,1973,1975],{"id":1974},"the-helm-chart-had-a-dirty-secret","The Helm Chart had a dirty secret",[17,1977,1978],{},"The secrets problem was a separation of concerns violation wearing a Helm chart\ncostume.",[17,1980,1981,1982,1985],{},"The Jenkins Helm Chart stored secrets encrypted via ",[194,1983,1984],{},"helm-secrets"," inside the\nHelm values file. Encrypted at rest, available at deploy time (it worked). The\nproblem surfaced the moment you needed to change anything else.",[17,1987,1988],{},"Every Helm release (every plugin update, every executor count adjustment, every\nconfiguration change) carried the full secrets bundle along for the ride.\nRolling back a Helm release because a plugin update broke something also rolled\nback the secrets version. Two completely independent concerns (configuration and\nsecrets) shared one release train, which meant neither could change independently.",[17,1990,1991],{},"Q: What's the actual cost of that coupling?\nA: Imagine a plugin update that breaks the build pipeline on a Monday morning.\nYou roll back the Helm release. The rollback also silently reverts a credential\nrotation that happened the previous week. You now have live pipelines running\nagainst rotated credentials that your Jenkins hasn't been told about. The blast\nradius of a configuration change includes your secrets state, and there's no way\nto disentangle the two without a full re-deploy.",[17,1993,1994,1995,1998],{},"This is ",[98,1996,1404],{"href":1997},"\u002Fposts\u002F2019\u002F01\u002Fdevops-genesis#muda-waste"," at the infrastructure level:\nwork performed to fix a configuration change bleeds into your secrets state. Every\ndeployment carries overhead that has nothing to do with what you're actually\ntrying to change. That overhead is invisible until something breaks.",[12,2000,2002],{"id":2001},"splitting-the-concerns","Splitting the concerns",[17,2004,2005],{},"The redesign cut the coupling entirely. Secrets and configuration got independent\nrelease cycles and separate ownership.",[17,2007,2008],{},"Secrets fell into two categories, and the distinction drove the design:",[17,2010,2011,2014,2015,2020],{},[189,2012,2013],{},"Runtime secrets"," (Kubeconfigs, Kubernetes service account credentials,\nanything that rotates regularly or whose value pipelines read at execution time)\nwere moved to AWS Secrets Manager. The\n",[98,2016,2019],{"href":2017,"rel":2018},"https:\u002F\u002Fgithub.com\u002Fjenkinsci\u002Faws-secrets-manager-credentials-provider-plugin",[102],"AWS Secrets Manager Credentials Provider plugin","\nextends the Jenkins credentials API to read from Secrets Manager on the fly.\nRotation happens in Secrets Manager; pipelines pick up the new value on the next\nrun with no Helm release required.",[17,2022,2023,2026],{},[189,2024,2025],{},"Static credentials"," (AWS IAM keys, API tokens, service credentials that\nchange infrequently) stayed Sops-encrypted in the infrastructure repository\nand were synced to the Jenkins Credentials Store through JCasC on each Helm\nrelease. The repository is the audit log: every change goes through a pull\nrequest, every reviewer can see exactly what changed, and git history is the\nversioning.",[164,2028,2029],{"type":166},[17,2030,2031],{},"AWS SSM Parameter Store was ruled out early. The Jenkins plugin for SSM\ndoesn't obfuscate secret values from build log output, meaning credentials can\nappear in plaintext in job logs. HashiCorp Vault covers all three requirements\nand more, but running and operating a Vault cluster is itself a non-trivial\nplatform engineering task. The operational overhead outweighs the marginal\nbenefit over AWS Secrets Manager at this team size and scope.",[12,2033,2035],{"id":2034},"boring-security-by-design","Boring security, by design",[17,2037,2038],{},"The through-line across both changes is the same: the right security posture is\nthe one that stays correct without requiring ongoing effort to maintain.",[17,2040,2041],{},"Authorization that maps to existing GitHub teams doesn't drift because it\ninherits all the maintenance work that already happens around GitHub teams. When\na new engineer joins, they get added to a GitHub team (a step that was already\nhappening) and Jenkins access follows. When someone changes teams, their Jenkins\npermissions change with it. The security model is a side effect of normal\nengineering operations, not a parallel task stacked on top of them.",[17,2043,2044],{},"Secrets with independent release cycles don't get silently rolled back as\ncollateral damage from configuration changes. Rotation happens in one place and\npropagates without a deployment. The audit trail lives in the repository where\nanyone can query it.",[17,2046,2047,2048,2051],{},"Neither of these is \"hardening\" in the traditional sense. They're design choices\nthat make the system easier to operate correctly than incorrectly (which is\nwhat ",[98,2049,2050],{"href":1832},"a boring platform"," actually means). Not\nzero incidents, but no surprises: the kind of security that doesn't require\nemergency firefighting because the defaults are already aligned with the intent.",[2053,2054],"hr",{},[17,2056,2057,2058],{},"With authorization and secrets on solid footing, the next phase was the build\nsystem itself: replacing Docker-in-Docker with Kaniko, redesigning the Groovy\nShared Library, and making the Dependabot Monday mornings a non-event.\n",[98,2059,2060],{"href":1900},"Part 3: the execution.",{"title":346,"searchDepth":347,"depth":347,"links":2062},[],"2021-01-24T00:00:00+01:00","Every G-Suite account had Jenkins admin access. Nobody chose this; it was the default, and it was never revisited. How authorization and secrets management were rebuilt to make the right thing the path of least resistance.",{},{"title":1910,"description":2064},"jenkins-boring-security-by-design","posts\u002F2021\u002F01\u002Fjenkins-boring-security-by-design",[1905,1906,2070,365],"security","2DB0dXVXmHsa1V9kIcvNyq4p-N0cM0tOGZLklH4hAjk",{"id":2073,"title":2074,"author":7,"body":2075,"createdAt":2393,"description":2394,"extension":355,"meta":2395,"navigation":357,"path":1637,"seo":2396,"slug":2397,"stem":2398,"tags":2399,"__hash__":2400},"posts\u002Fposts\u002F2021\u002F01\u002Fjenkins-five-years-of-cicd-debt.md","Freeletics CI\u002FCD: five years of debt (and why we kept Jenkins)",{"type":9,"value":2076,"toc":2385},[2077,2081,2084,2090,2093,2097,2100,2120,2123,2130,2134,2138,2141,2144,2148,2153,2156,2159,2163,2166,2173,2177,2180,2187,2193,2339,2344,2347,2354,2360,2363,2377,2379],[12,2078,2080],{"id":2079},"the-cicd-system-nobody-wanted-to-touch","The CI\u002FCD system nobody wanted to touch",[17,2082,2083],{},"On any given Monday morning at Freeletics, Dependabot would have merged a\ndozen dependency update PRs overnight. Reasonable enough: automated dependency\nmanagement is one of those low-effort security hygiene practices that just\nmakes sense. Except that at Freeletics, those Monday morning merges triggered\na cascade of Docker image builds that Jenkins couldn't keep up with, flooding\nthe build queue and eventually causing master-to-slave HTTP communication to\nbreak down. Jobs hung. Developers refreshed their PR status pages waiting for\nCI feedback that never came. By the time someone from the ops team intervened,\nhalf the morning was gone.",[17,2085,2086,2087,2089],{},"Lean has a name for this: ",[98,2088,1828],{"href":1827}," (unevenness\nin operation caused by unpredictable, variable workloads). The CI system was\nsized for normal throughput, not for the burst that Dependabot produced every\nMonday. The result was exactly what Mura predicts: inconsistent outcomes,\ndeveloper exhaustion, and downstream waiting.",[17,2091,2092],{},"This wasn't a Jenkins problem per se. It was a design problem. And it had been\naccumulating for years.",[91,2094,2096],{"id":2095},"what-we-inherited","What we inherited",[17,2098,2099],{},"When I joined Freeletics in Aug\u002F2019, the CI\u002FCD landscape looked like this:",[72,2101,2102,2108,2114],{},[75,2103,2104,2107],{},[189,2105,2106],{},"Jenkins"," ran on the internal Kubernetes cluster and was responsible for\nbuilding all Docker images for back-end and web applications, then deploying\nthem to Production, QA, and Integration;",[75,2109,2110,2113],{},[189,2111,2112],{},"CircleCI"," handled mobile builds (iOS on macOS agents, Android on Linux)\nand some code quality checks;",[75,2115,2116,2119],{},[189,2117,2118],{},"Travis"," ran tests and reported code quality results on back-end and web\npull requests.",[17,2121,2122],{},"Three tools, three secrets stores, three mental models. A new engineer joining\nthe team had to context-switch between all three to get a full picture of a\nsingle deployment. Knowledge didn't accumulate in one place; it spread thin\nacross all three.",[17,2124,2125,2126,2129],{},"But the Jenkins situation was particularly bad. Jenkins itself was fine. The\n",[520,2127,2128],{},"way it had been deployed"," was the problem, and it had three layers stacked on\ntop of each other.",[91,2131,2133],{"id":2132},"three-layers-of-debt","Three layers of debt",[524,2135,2137],{"id":2136},"layer-1-jenkins-job-builder-jjb","Layer 1: Jenkins Job Builder (JJB)",[17,2139,2140],{},"Jenkins Job Builder is a tool that generates Jenkins job definitions from YAML\nfiles. Freeletics ran a customized fork of JJB that hadn't been updated in\nover two years. The consequence: Jenkins Pipelines (introduced in Jenkins 2.0)\nwere not supported. Any attempt to include a Pipeline definition in a JJB YAML\nfile would cause the deployment to fail.",[17,2142,2143],{},"This is the kind of blocker that silently freezes a platform. Every Jenkins\nfeature shipped in the past two years was unreachable. Developers kept writing\nworkarounds around the limitations of a job format that the upstream project\nhad effectively deprecated. The gap between what Jenkins could do and what the\nFreeletics setup could do kept widening, and nobody had time to fix it.",[524,2145,2147],{"id":2146},"layer-2-the-helm-chart-that-held-everything-hostage","Layer 2: The Helm Chart that held everything hostage",[17,2149,2150,2151,501],{},"Jenkins was deployed through a custom Helm Chart. So far so good. The problem\nwas how that Helm Chart was wired: it rendered JJB configuration YAML inside\nGo templates and executed JJB as part of the chart install process. Secrets\nwere encrypted directly into the Helm values file using ",[194,2152,1984],{},[17,2154,2155],{},"The consequence of bundling secrets into the chart: every Jenkins configuration\nchange (adding a job, tweaking an executor count, updating a plugin) required a\nfull Helm release that also included the secrets bundle. Rolling back a\nconfiguration change meant rolling back the secrets version too. Two completely\nseparate concerns sharing one release cycle.",[17,2157,2158],{},"More practically: making any change to Jenkins required understanding the full\nchart, including the templated JJB YAML, the secrets structure, and the\ninteractions between them. The blast radius for a failed release was total:\nJenkins would go down, and recovery was a manual exercise that blocked every\nteam's deployments until it completed.",[524,2160,2162],{"id":2161},"layer-3-the-fragmentation-tax","Layer 3: The fragmentation tax",[17,2164,2165],{},"Running three CI\u002FCD systems isn't inherently wrong. Mobile builds genuinely\nrequire macOS agents that CircleCI provides out-of-the-box, and that's a fair\nreason to keep CircleCI for iOS. But the fragmentation meant that every\nplatform-level decision had to be made three times: credentials rotation,\npipeline templates, deployment interfaces, monitoring. The cognitive overhead\nwas constant.",[17,2167,2168,2169,2172],{},"Q: How much does this actually cost?\nA: Consider that every time a developer debugs a CI failure, they have to know\nwhich system it lives in, how that system's logs are structured, what\ncredentials it uses, and how to re-trigger it. Multiply that by the number of\nincidents per week, across a team of ~20 engineers. The productivity drain is\ninvisible in any single incident and very visible in aggregate. One of the\n",[98,2170,2171],{"href":1832},"core promises of DevOps"," is reducing exactly\nthis kind of bottleneck: \"more free time to do work that really matters.\"\nThree CI systems running in parallel is the structural opposite of that.",[12,2174,2176],{"id":2175},"the-evaluation-stay-or-leave","The evaluation: stay or leave?",[17,2178,2179],{},"In Feb\u002F2020 (before COVID-19 pushed the project back to Aug\u002F2020), the team\nran a structured evaluation. The question: is Jenkins worth investing in, or do\nwe migrate to something else?",[17,2181,2182,2183,2186],{},"Tools considered: CircleCI (cloud), GitLab CI with shared runners, GitLab CI\nwith self-hosted runners using ",[98,2184,1700],{"href":1698,"rel":2185},[102],",\nJenkins X, and the current Jenkins.",[17,2188,2189,2190,2192],{},"The main axis of comparison was Docker image build time. Builds are the most\nfrequent CI operation at Freeletics (triggered by every commit on every\nrepository) and the most directly felt by developers. A build that takes 13\nminutes instead of 5 is 8 minutes of a developer either waiting or switching\ncontext. Across 50 builds per day (a realistic number for a team that size),\nthat's ~6 hours of developer time evaporating every day. ",[98,2191,1404],{"href":1997},"\nin the most direct sense: time consumed without adding value to anyone.",[2194,2195,2196,2216],"table",{},[2197,2198,2199],"thead",{},[2200,2201,2202,2206,2209,2212,2214],"tr",{},[2203,2204,2205],"th",{},"Application",[2203,2207,2208],{},"GitLab CI (shared)",[2203,2210,2211],{},"GitLab CI (self-hosted)",[2203,2213,2112],{},[2203,2215,2106],{},[2217,2218,2219,2237,2254,2271,2288,2305,2322],"tbody",{},[2200,2220,2221,2225,2228,2231,2234],{},[2222,2223,2224],"td",{},"web-api-service",[2222,2226,2227],{},"Timeout",[2222,2229,2230],{},"~9:49",[2222,2232,2233],{},"~13:51",[2222,2235,2236],{},"~5:44",[2200,2238,2239,2242,2245,2248,2251],{},[2222,2240,2241],{},"blog-service",[2222,2243,2244],{},"~51:21",[2222,2246,2247],{},"~13:50",[2222,2249,2250],{},"~17:29",[2222,2252,2253],{},"~11:35",[2200,2255,2256,2259,2262,2265,2268],{},[2222,2257,2258],{},"frontend-spa",[2222,2260,2261],{},"~13:38",[2222,2263,2264],{},"~2:13",[2222,2266,2267],{},"~10:32",[2222,2269,2270],{},"~3:54",[2200,2272,2273,2276,2279,2282,2285],{},[2222,2274,2275],{},"rails-base-image",[2222,2277,2278],{},"~4:06",[2222,2280,2281],{},"~3:24",[2222,2283,2284],{},"~2:44",[2222,2286,2287],{},"~0:11",[2200,2289,2290,2293,2296,2299,2302],{},[2222,2291,2292],{},"backend-api",[2222,2294,2295],{},"~8:23",[2222,2297,2298],{},"~4:54",[2222,2300,2301],{},"~10:01",[2222,2303,2304],{},"~2:00",[2200,2306,2307,2310,2313,2316,2319],{},[2222,2308,2309],{},"user-service",[2222,2311,2312],{},"~7:23",[2222,2314,2315],{},"~2:17",[2222,2317,2318],{},"~6:58",[2222,2320,2321],{},"~3:34",[2200,2323,2324,2327,2330,2333,2336],{},[2222,2325,2326],{},"marketing-service",[2222,2328,2329],{},"~6:02",[2222,2331,2332],{},"~4:44",[2222,2334,2335],{},"~7:19",[2222,2337,2338],{},"~5:18",[164,2340,2341],{"type":166},[17,2342,2343],{},"These are averages of Docker image build execution time, not\nincluding external influences like repository synchronization. Jenkins was\nrunning Docker-in-Docker (DinD) rather than Kaniko at the time, so its\nnumbers would change post-migration. The directional conclusion held regardless.",[17,2345,2346],{},"Jenkins won on almost every application. The underlying reason is control:\nJenkins runners live inside Freeletics' own Kubernetes cluster, meaning\nhardware sizing, network latency to AWS ECR, and layer cache availability are\nall tunable. GitLab shared runners and CircleCI are someone else's\ninfrastructure, and the benchmark numbers reflected that.",[17,2348,2349,2350,2353],{},"GitLab CI with self-hosted runners and Kaniko was a serious competitor on a\nfew applications (",[194,2351,2352],{},"fl-application-web"," at ~2:13 is impressive). But migrating\nto a new CI platform carries its own costs: new credentials model, new pipeline\nsyntax, retraining, migration of all existing job definitions. Those costs are\nupfront and real; the productivity benefits are speculative and compound slowly.",[17,2355,2356,2357,501],{},"The verdict: ",[189,2358,2359],{},"invest in Jenkins, redesign from scratch",[17,2361,2362],{},"The design goals were straightforward:",[72,2364,2365,2368,2371,2374],{},[75,2366,2367],{},"Everything reproducible from code (delete the Jenkins release, recreate it,\nget the exact same Jenkins back);",[75,2369,2370],{},"No more JJB (replace with JCasC and Job DSL, both managed through Terraform);",[75,2372,2373],{},"Secrets decoupled from the Helm Chart release cycle;",[75,2375,2376],{},"A single Groovy Shared Library covering all server-side pipelines.",[2053,2378],{},[17,2380,2381,2382],{},"Ready to see how the rebuild actually went?\n",[98,2383,2384],{"href":1642},"Part 2 covers the first thing we had to fix: authorization that any employee could bypass, and secrets that couldn't change independently of the rest of the system.",{"title":346,"searchDepth":347,"depth":347,"links":2386},[2387,2388],{"id":2095,"depth":347,"text":2096},{"id":2132,"depth":347,"text":2133,"children":2389},[2390,2391,2392],{"id":2136,"depth":795,"text":2137},{"id":2146,"depth":795,"text":2147},{"id":2161,"depth":795,"text":2162},"2021-01-10T00:00:00+01:00","An inherited Jenkins setup nobody dared to touch, three CI\u002FCD systems running in parallel, and a Monday morning ritual of watching builds hang. The case for rebuilding instead of replacing, and the benchmark data behind it.",{},{"title":2074,"description":2394},"jenkins-five-years-of-cicd-debt","posts\u002F2021\u002F01\u002Fjenkins-five-years-of-cicd-debt",[1905,1906,365,366],"Vslr6JSU36R4jx8_kSqTBVfenzcKSNROj4g81R31vvk",{"id":2402,"title":2403,"author":7,"body":2404,"createdAt":2926,"description":2927,"extension":355,"meta":2928,"navigation":357,"path":2931,"seo":2932,"slug":2933,"stem":2934,"tags":2935,"__hash__":2938},"posts\u002Fposts\u002F2020\u002F05\u002Fquickstart-apache-spark-on-kubernetes.md","Quickstart: Apache Spark on Kubernetes",{"type":9,"value":2405,"toc":2918},[2406,2410,2414,2455,2473,2482,2512,2521,2541,2549,2553,2574,2588,2615,2629,2638,2642,2646,2649,2711,2718,2724,2728,2745,2751,2755,2758,2764,2776,2782,2792,2800,2815,2819,2822,2828,2831,2858,2861,2867,2874,2880,2883,2889,2892,2898,2901,2907,2911],[12,2407,2409],{"id":2408},"introduction","Introduction",[91,2411,2413],{"id":2412},"the-apache-spark-operator-for-kubernetes","The Apache Spark Operator for Kubernetes",[17,2415,2416,2417,2420,2421,2424,2425,2428,2429,2434,2439,2444,2445,1509,2450,1066],{},"Since its launch in 2014 by Google, Kubernetes has gained a lot of\npopularity along with Docker itself and since 2016 has become the ",[520,2418,2419],{},"de\nfacto Container Orchestrator",", established as a market standard.\nHaving cloud-managed versions available in ",[189,2422,2423],{},"all"," the ",[520,2426,2427],{},"major Clouds",".\n",[98,2430,2433],{"href":2431,"rel":2432},"https:\u002F\u002Fcloud.google.com\u002Fkubernetes-engine\u002F",[102],"[1]",[98,2435,2438],{"href":2436,"rel":2437},"https:\u002F\u002Faws.amazon.com\u002Feks\u002F",[102],"[2]",[98,2440,2443],{"href":2441,"rel":2442},"https:\u002F\u002Fdocs.microsoft.com\u002Fen-us\u002Fazure\u002Faks\u002F",[102],"[3]"," (including\n",[98,2446,2449],{"href":2447,"rel":2448},"https:\u002F\u002Fwww.digitalocean.com\u002Fproducts\u002Fkubernetes\u002F",[102],"Digital Ocean",[98,2451,2454],{"href":2452,"rel":2453},"https:\u002F\u002Fwww.alibabacloud.com\u002Fproduct\u002Fkubernetes",[102],"Alibaba",[17,2456,2457,2458,2461,2462,2467,2468,501],{},"With this popularity came various implementations and ",[520,2459,2460],{},"use-cases"," of\nthe orchestrator, among them the execution of ",[98,2463,2466],{"href":2464,"rel":2465},"https:\u002F\u002Fkubernetes.io\u002Fdocs\u002Ftutorials\u002Fstateful-application\u002F",[102],"Stateful\napplications","\nincluding ",[98,2469,2472],{"href":2470,"rel":2471},"https:\u002F\u002Fvitess.io\u002Fzh\u002Fdocs\u002Fget-started\u002Fkubernetes\u002F",[102],"databases using containers",[17,2474,2475,2476,2481],{},"What would be the motivation to host an orchestrated database? That's\na great question. But let's focus on the ",[98,2477,2480],{"href":2478,"rel":2479},"https:\u002F\u002Fgithub.com\u002FGoogleCloudPlatform\u002Fspark-on-k8s-operator\u002Fblob\u002Fmaster\u002Fdocs\u002Fdesign.md",[102],"Spark Operator","\nrunning workloads on Kubernetes.",[17,2483,2484,2485,2490,2491,2494,2495,2500,2501,2506,2507,501],{},"A native Spark Operator ",[98,2486,2489],{"href":2487,"rel":2488},"https:\u002F\u002Fgithub.com\u002Fkubernetes\u002Fkubernetes\u002Fissues\u002F34377",[102],"idea came out","\nin 2016, before that you couldn't run Spark jobs natively except\nsome ",[520,2492,2493],{},"hacky alternatives",", like ",[98,2496,2499],{"href":2497,"rel":2498},"https:\u002F\u002Fkubernetes.io\u002Fblog\u002F2016\u002F03\u002Fusing-spark-and-zeppelin-to-process-big-data-on-kubernetes\u002F",[102],"running Apache Zeppelin","\ninside Kubernetes or creating your ",[98,2502,2505],{"href":2503,"rel":2504},"https:\u002F\u002Fgithub.com\u002Fkubernetes\u002Fexamples\u002Ftree\u002Fmaster\u002Fstaging\u002Fspark",[102],"Apache Spark cluster inside\nKubernetes (from the official Kubernetes organization on GitHub)","\nreferencing the ",[98,2508,2511],{"href":2509,"rel":2510},"http:\u002F\u002Fspark.apache.org\u002Fdocs\u002Flatest\u002Fspark-standalone.html",[102],"Spark workers in Stand-alone mode",[17,2513,2514,2515,2520],{},"However, the native execution would be far more interesting for taking\nadvantage of ",[98,2516,2519],{"href":2517,"rel":2518},"https:\u002F\u002Fkubernetes.io\u002Fdocs\u002Fconcepts\u002Fscheduling-eviction\u002Fkube-scheduler\u002F",[102],"Kubernetes Scheduler","\nresponsible for taking action of allocating resources, giving\nelasticity and an simpler interface to manage Apache Spark workloads.",[17,2522,2523,2524,2529,2530,2535,2536,501],{},"Considering that, ",[98,2525,2528],{"href":2526,"rel":2527},"https:\u002F\u002Fissues.apache.org\u002Fjira\u002Fbrowse\u002FSPARK-18278",[102],"Apache Spark Operator development got attention",",\nmerged and released into ",[98,2531,2534],{"href":2532,"rel":2533},"https:\u002F\u002Fspark.apache.org\u002Freleases\u002Fspark-release-2-3-0.html",[102],"Spark version 2.3.0","\nlaunched in ",[98,2537,2540],{"href":2538,"rel":2539},"https:\u002F\u002Fspark.apache.org\u002Fnews\u002Findex.html",[102],"February, 2018",[17,2542,2543,2544],{},"If you're eager for reading more regarding the Apache Spark proposal,\nyou can head to the ",[98,2545,2548],{"href":2546,"rel":2547},"https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1_bBzOZ8rKiOSjQg78DXOA3ZBIo_KkDJjqxVuq0yXdew\u002Fedit#heading=h.9bhogel14x0y",[102],"design document published in Google Docs.",[91,2550,2552],{"id":2551},"why-kubernetes","Why Kubernetes?",[17,2554,2555,2556,2561,2562,2566,2570,501],{},"As companies are currently seeking to ",[98,2557,2560],{"href":2558,"rel":2559},"https:\u002F\u002Fwww.cio.com\u002Farticle\u002F3211428\u002Fwhat-is-digital-transformation-a-necessary-disruption.html",[102],"reinvent themselves through the\nwidely spoken digital transformation","\nin order for them to be competitive and, above all, to survive in an\nincreasingly dynamic market, it is common to see approaches that\ninclude Big Data, Artificial Intelligence and Cloud Computing\n",[98,2563,2433],{"href":2564,"rel":2565},"https:\u002F\u002Fwww.zdnet.com\u002Farticle\u002Fhow-to-use-cloud-computing-and-big-data-to-support-digital-transformation\u002F",[102],[98,2567,2438],{"href":2568,"rel":2569},"https:\u002F\u002Fdigitalhealth.london\u002Fcloud-big-data-ai-lead-nhs-digital-transformation\u002F",[102],[98,2571,2443],{"href":2572,"rel":2573},"https:\u002F\u002Fwww.ibm.com\u002Fblogs\u002Fcloud-computing\u002F2018\u002F11\u002F05\u002Fguiding-framework-digital-transformation-garage\u002F",[102],[17,2575,2576,2577,2582,2583,501],{},"An interesting comparison between the benefits of using Cloud Computing in the\ncontext of Big Data instead of On-premises' servers can be read at ",[98,2578,2581],{"href":2579,"rel":2580},"https:\u002F\u002Fdatabricks.com\u002Fblog\u002F2017\u002F05\u002F31\u002Ftop-5-reasons-for-choosing-s3-over-hdfs.html",[102],"Databricks\nblog",",\nwhich is the company ",[98,2584,2587],{"href":2585,"rel":2586},"https:\u002F\u002Fwww.washingtonpost.com\u002Fnews\u002Fthe-switch\u002Fwp\u002F2016\u002F06\u002F09\u002Fthis-is-where-the-real-action-in-artificial-intelligence-takes-place\u002F",[102],"founded by the creators of Apache Spark",[17,2589,2590,2591,2596,2597,2602,2603,2608,2609,2614],{},"As we see a widespread adoption of Cloud Computing (even by companies\nthat would be able to afford the hardware and run on-premises), we\nnotice that most of these Cloud implementations don't have an ",[98,2592,2595],{"href":2593,"rel":2594},"https:\u002F\u002Fhadoop.apache.org\u002F",[102],"Apache\nHadoop"," since the Data Teams (BI\u002FData\nScience\u002FAnalytics) increasingly choose to use tools like ",[98,2598,2601],{"href":2599,"rel":2600},"https:\u002F\u002Fcloud.google.com\u002Fbigquery\u002F",[102],"Google\nBigQuery"," or ",[98,2604,2607],{"href":2605,"rel":2606},"https:\u002F\u002Faws.amazon.com\u002Fredshift\u002F",[102],"AWS Redshift",".\nTherefore, it doesn't make sense to spin-up a Hadoop with the only intention to\nuse ",[98,2610,2613],{"href":2611,"rel":2612},"https:\u002F\u002Fhortonworks.com\u002Fapache\u002Fyarn\u002F",[102],"YARN"," as the resources manager.",[17,2616,2617,2618,2602,2623,2628],{},"An alternative is the use of Hadoop cluster providers such as ",[98,2619,2622],{"href":2620,"rel":2621},"https:\u002F\u002Fcloud.google.com\u002Fdataproc",[102],"Google\nDataProc",[98,2624,2627],{"href":2625,"rel":2626},"https:\u002F\u002Faws.amazon.com\u002Femr\u002F",[102],"AWS EMR","\nfor the creation of ephemeral clusters. Just to name a few options.",[17,2630,2631,2632,2637],{},"To better understand the design of Spark Operator, the doc from ",[98,2633,2636],{"href":2634,"rel":2635},"https:\u002F\u002Fgithub.com\u002FGoogleCloudPlatform\u002Fspark-on-k8s-operatoR\u002Fblob\u002Fmaster\u002Fdocs\u002Fdesign.md#the-crd-controller",[102],"GCP on GitHub","\nis a no-brainer.",[12,2639,2641],{"id":2640},"lets-get-hands-on","Let's get hands-on!",[91,2643,2645],{"id":2644},"warming-up-the-engine","Warming up the engine",[17,2647,2648],{},"Now that the word has been spread, let's get our hands on it to show\nthe engine running. For that, let's use:",[72,2650,2651,2664,2672,2684],{},[75,2652,2653,2658,2659,472],{},[98,2654,2657],{"href":2655,"rel":2656},"https:\u002F\u002Fwww.docker.com\u002F",[102],"Docker"," as the container engine for\nKubernetes ",[98,2660,2663],{"href":2661,"rel":2662},"https:\u002F\u002Fdocs.docker.com\u002Finstall\u002F",[102],"(installation guide)",[75,2665,2666,2667,2671],{},"Minikube ",[98,2668,2663],{"href":2669,"rel":2670},"https:\u002F\u002Fkubernetes.io\u002Fdocs\u002Ftasks\u002Ftools\u002Finstall-minikube\u002F",[102],"\nto facilitate the provisioning of the Kubernetes (yes, it will be\na local execution);",[75,2673,2674,2675,2678,2679,501],{},"For interaction with the Kubernetes API it is necessary to have\n",[194,2676,2677],{},"kubectl"," installed, ",[98,2680,2683],{"href":2681,"rel":2682},"https:\u002F\u002Fkubernetes.io\u002Fdocs\u002Ftasks\u002Ftools\u002Finstall-kubectl\u002F",[102],"if you don't have it, follow instructions\nhere",[75,2685,2686,2687],{},"a compiled version of Apache Spark larger than 2.3.0.\n",[462,2688,2689,2702],{},[75,2690,2691,2692,2697,2698,2701],{},"you can either compile ",[98,2693,2696],{"href":2694,"rel":2695},"https:\u002F\u002Fgithub.com\u002Fapache\u002Fspark",[102],"source code",",\nwhich will took ",[520,2699,2700],{},"some hours"," to finish, or",[75,2703,2704,2705,2710],{},"download a compiled version ",[98,2706,2709],{"href":2707,"rel":2708},"https:\u002F\u002Fspark.apache.org\u002Fdownloads.html",[102],"here","\n(recommended).",[17,2712,2713,2714,2717],{},"Once the necessary tools are installed, it's necessary to\ninclude Apache Spark path in ",[194,2715,2716],{},"PATH"," environment variable, to ease the\ninvocation of Apache Spark executables. Simply run:",[487,2719,2722],{"className":2720,"code":2721,"language":578,"meta":346},[576],"export PATH=${PATH}:\u002Fpath\u002Fto\u002Fapache-spark-X.Y.Z\u002Fbin\n",[194,2723,2721],{"__ignoreMap":346},[91,2725,2727],{"id":2726},"creating-the-minikube-cluster","Creating the Minikube \"cluster\"",[17,2729,2730,2731,2734,2735,2740,2741,2744],{},"At last, to have a Kubernetes \"cluster\" we will start a ",[194,2732,2733],{},"minikube","\nwith the intention of running an example from ",[98,2736,2739],{"href":2737,"rel":2738},"https:\u002F\u002Fgithub.com\u002Fapache\u002Fspark\u002Fblob\u002Fmaster\u002Fexamples\u002Fsrc\u002Fmain\u002Fscala\u002Forg\u002Fapache\u002Fspark\u002Fexamples\u002FSparkPi.scala",[102],"Spark\nrepository","\ncalled ",[194,2742,2743],{},"SparkPi"," just as a demonstration.",[487,2746,2749],{"className":2747,"code":2748,"language":578,"meta":346},[576],"minikube start --cpus=2 \\\n    --memory=4g\n",[194,2750,2748],{"__ignoreMap":346},[91,2752,2754],{"id":2753},"building-the-docker-image","Building the Docker image",[17,2756,2757],{},"Let's use the Minikube Docker daemon to not depend on an external registry (and\nonly generate Docker image layers on the VM, facilitating garbage disposal\nlater). Minikube has a wrapper that makes our life easier:",[487,2759,2762],{"className":2760,"code":2761,"language":578,"meta":346},[576],"eval $(minikube docker-env)\n",[194,2763,2761],{"__ignoreMap":346},[17,2765,2766,2767,2772,2773,2775],{},"After having the daemon environment variables configured, we need a\nDocker image to run the jobs. There is a ",[98,2768,2771],{"href":2769,"rel":2770},"https:\u002F\u002Fgithub.com\u002Fapache\u002Fspark\u002Fblob\u002Fmaster\u002Fbin\u002Fdocker-image-tool.sh",[102],"shell script in the Spark\nrepository","\nto help with this. Considering that our ",[194,2774,2716],{}," was properly\nconfigured, just run:",[487,2777,2780],{"className":2778,"code":2779,"language":578,"meta":346},[576],"docker-image-tool.sh -m -t latest build\n",[194,2781,2779],{"__ignoreMap":346},[17,2783,2784,2787,2788,2791],{},[520,2785,2786],{},"FYI:"," The ",[194,2789,2790],{},"-m"," parameter here indicates a minikube build.",[17,2793,2794,2795,501],{},"Let's take the highway to execute SparkPi, using the same command\nthat would be used for a Hadoop Spark cluster ",[98,2796,2799],{"href":2797,"rel":2798},"https:\u002F\u002Fspark.apache.org\u002Fdocs\u002Flatest\u002Fsubmitting-applications.html",[102],"spark-submit",[17,2801,2802,2803,2808,2809,2814],{},"However, Spark Operator supports defining jobs in the \"Kubernetes\ndialect\" using ",[98,2804,2807],{"href":2805,"rel":2806},"https:\u002F\u002Fkubernetes.io\u002Fdocs\u002Fconcepts\u002Fextend-kubernetes\u002Fapi-extension\u002Fcustom-resources\u002F",[102],"CRD",",\n",[98,2810,2813],{"href":2811,"rel":2812},"https:\u002F\u002Fgithub.com\u002FGoogleCloudPlatform\u002Fspark-on-k8s-operator\u002Ftree\u002Fmaster\u002Fexamples",[102],"here are some examples"," - for later.",[12,2816,2818],{"id":2817},"fire-in-the-hole","Fire in the hole!",[17,2820,2821],{},"Mid the gap between the Scala version and .jar when you're\nparameterizing with your Apache Spark version:",[487,2823,2826],{"className":2824,"code":2825,"language":578,"meta":346},[576],"spark-submit --master k8s:\u002F\u002Fhttps:\u002F\u002F$(minikube ip):8443 \\\n    --deploy-mode cluster \\\n    --name spark-pi \\\n    --class org.apache.spark.examples.SparkPi \\\n    --conf spark.executor.instances=2 \\\n    --executor-memory 1024m \\\n    --conf spark.kubernetes.container.image=spark:latest \\\n    local:\u002F\u002F\u002Fopt\u002Fspark\u002Fexamples\u002Fjars\u002Fspark-examples_2.11-X.Y.Z.jar # here\n",[194,2827,2825],{"__ignoreMap":346},[17,2829,2830],{},"What's new is:",[72,2832,2833,2852],{},[75,2834,2835,2838,2839,2842,2843,2846,2847,472],{},[194,2836,2837],{},"--master",": Accepts a prefix ",[194,2840,2841],{},"k8s:\u002F\u002F"," in the URL, for the\nKubernetes master API endpoint, exposed by the command\n",[194,2844,2845],{},"https:\u002F\u002F$(minikube ip):8443",". BTW, in case you want to\nknow, it's a ",[98,2848,2851],{"href":2849,"rel":2850},"https:\u002F\u002Fwww.gnu.org\u002Fsoftware\u002Fbash\u002Fmanual\u002Fhtml_node\u002FCommand-Substitution.html",[102],"shell command substitution",[75,2853,2854,2857],{},[194,2855,2856],{},"--conf spark.kubernetes.container.image=",": Configures the Docker\nimage to run in Kubernetes.",[17,2859,2860],{},"Sample output:",[487,2862,2865],{"className":2863,"code":2864,"language":492},[490],"...\n\n19\u002F08\u002F22 11:59:09 INFO LoggingPodStatusWatcherImpl: State changed,\nnew state: pod name: spark-pi-1566485909677-driver namespace: default\nlabels: spark-app-selector -> spark-20477e803e7648a59e9bcd37394f7f60,\nspark-role -> driver pod uid: c789c4d2-27c4-45ce-ba10-539940cccb8d\ncreation time: 2019-08-22T14:58:30Z service account name: default\nvolumes: spark-local-dir-1, spark-conf-volume, default-token-tj7jn\nnode name: minikube start time: 2019-08-22T14:58:30Z container\nimages: spark:docker phase: Succeeded status:\n[ContainerStatus(containerID=docker:\u002F\u002Fe044d944d2ebee2855cd2b993c62025d\n6406258ef247648a5902bf6ac09801cc, image=spark:docker,\nimageID=docker:\u002F\u002Fsha256:86649110778a10aa5d6997d1e3d556b35454e9657978f3\na87de32c21787ff82f, lastState=ContainerState(running=null,\nterminated=null, waiting=null, additionalProperties={}),\nname=spark-kubernetes-driver, ready=false, restartCount=0,\nstate=ContainerState(running=null,\nterminated=ContainerStateTerminated(containerID=docker:\u002F\u002Fe044d944d2ebe\ne2855cd2b993c62025d6406258ef247648a5902bf6ac09801cc, exitCode=0,\nfinishedAt=2019-08-22T14:59:08Z, message=null, reason=Completed,\nsignal=null, startedAt=2019-08-22T14:58:32Z,\nadditionalProperties={}), waiting=null, additionalProperties={}),\nadditionalProperties={})]\n\n19\u002F08\u002F22 11:59:09 INFO LoggingPodStatusWatcherImpl: Container final\nstatuses: Container name: spark-kubernetes-driver Container image:\nspark:docker Container state: Terminated Exit code: 0\n",[194,2866,2864],{"__ignoreMap":346},[17,2868,2869,2870,2873],{},"To see the job result (and the whole execution) we can run a\n",[194,2871,2872],{},"kubectl logs"," passing the name of the driver pod as a parameter:",[487,2875,2878],{"className":2876,"code":2877,"language":578,"meta":346},[576],"kubectl logs $(kubectl get pods | grep 'spark-pi.*-driver')\n",[194,2879,2877],{"__ignoreMap":346},[17,2881,2882],{},"Which brings the output (omitted some entries), similar to:",[487,2884,2887],{"className":2885,"code":2886,"language":492},[490],"...\n19\u002F08\u002F22 14:59:08 INFO TaskSetManager: Finished task 1.0 in stage 0.0\n(TID 1) in 52 ms on 172.17.0.7 (executor 1) (2\u002F2)\n19\u002F08\u002F22 14:59:08 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose\ntasks have all completed, from pool19\u002F08\u002F22 14:59:08 INFO\nDAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in\n0.957 s\n19\u002F08\u002F22 14:59:08 INFO DAGScheduler: Job 0 finished: reduce at\nSparkPi.scala:38, took 1.040608 s Pi is roughly 3.138915694578473\n19\u002F08\u002F22 14:59:08 INFO SparkUI: Stopped Spark web UI at\nhttp:\u002F\u002Fspark-pi-1566485909677-driver-svc.default.svc:4040\n19\u002F08\u002F22 14:59:08 INFO KubernetesClusterSchedulerBackend: Shutting\ndown all executors\n19\u002F08\u002F22 14:59:08 INFO\nKubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking\neach executor to shut down\n19\u002F08\u002F22 14:59:08 WARN ExecutorPodsWatchSnapshotSource: Kubernetes\nclient has been closed (this is expected if the application is\nshutting down.)\n19\u002F08\u002F22 14:59:08 INFO MapOutputTrackerMasterEndpoint:\nMapOutputTrackerMasterEndpoint stopped!\n19\u002F08\u002F22 14:59:08 INFO MemoryStore: MemoryStore cleared\n19\u002F08\u002F22 14:59:08 INFO BlockManager: BlockManager stopped\n19\u002F08\u002F22 14:59:08 INFO BlockManagerMaster: BlockManagerMaster stopped\n19\u002F08\u002F22 14:59:08 INFO\nOutputCommitCoordinator$OutputCommitCoordinatorEndpoint:\nOutputCommitCoordinator stopped!\n19\u002F08\u002F22 14:59:08 INFO SparkContext: Successfully stopped SparkContext\n19\u002F08\u002F22 14:59:08 INFO ShutdownHookManager: Shutdown hook called\n19\u002F08\u002F22 14:59:08 INFO ShutdownHookManager: Deleting directory\n\u002Ftmp\u002Fspark-aeadc6ba-36aa-4b7e-8c74-53aa48c3c9b2\n19\u002F08\u002F22 14:59:08 INFO ShutdownHookManager: Deleting directory\n\u002Fvar\u002Fdata\u002Fspark-084e8326-c8ce-4042-a2ed-75c1eb80414a\u002Fspark-ef8117bf-90\nd0-4a0d-9cab-f36a7bb18910\n...\n",[194,2888,2886],{"__ignoreMap":346},[17,2890,2891],{},"The result appears in:",[487,2893,2896],{"className":2894,"code":2895,"language":492},[490],"19\u002F08\u002F22 14:59:08 INFO DAGScheduler: Job 0 finished: reduce at\nSparkPi.scala:38, took 1.040608 s Pi is roughly 3.138915694578473\n",[194,2897,2895],{"__ignoreMap":346},[17,2899,2900],{},"Finally, let's delete the VM that Minikube generates, to clean up the\nenvironment (unless you want to keep playing with it):",[487,2902,2905],{"className":2903,"code":2904,"language":578,"meta":346},[576],"minikube delete\n",[194,2906,2904],{"__ignoreMap":346},[91,2908,2910],{"id":2909},"last-words","Last words",[17,2912,2913,2914,2917],{},"I hope your curiosity got ",[520,2915,2916],{},"sparked"," and some ideas for further\ndevelopment have raised for your Big Data workloads. If you have any\ndoubt or suggestion, don't hesitate to share on the comment section.",{"title":346,"searchDepth":347,"depth":347,"links":2919},[2920,2921,2922,2923,2924,2925],{"id":2412,"depth":347,"text":2413},{"id":2551,"depth":347,"text":2552},{"id":2644,"depth":347,"text":2645},{"id":2726,"depth":347,"text":2727},{"id":2753,"depth":347,"text":2754},{"id":2909,"depth":347,"text":2910},"2020-05-21T23:00:57+02:00","Using Apache Spark Operator in Kubernetes to streamline your Big Data workflows with a cloud-native approach without relying on a Hadoop cluster.",{"subtitle":2929,"image":2930},"Running Apache Spark Operator on Kubernetes","\u002Fimages\u002Fcontent\u002F2019\u002Fbig-load-of-containers.png","\u002Fposts\u002F2020\u002F05\u002Fquickstart-apache-spark-on-kubernetes",{"title":2403,"description":2927},"quickstart-apache-spark-on-kubernetes","posts\u002F2020\u002F05\u002Fquickstart-apache-spark-on-kubernetes",[2936,364,2937],"data pipelines","tutorials","VXxVM6rUzN-01z99mq_SGPe_8MbD3d8sidyg7Trfa0E",{"id":2940,"title":2941,"author":7,"body":2942,"createdAt":3195,"description":3196,"extension":355,"meta":3197,"navigation":357,"path":3198,"seo":3199,"slug":3200,"stem":3201,"tags":3202,"__hash__":3207},"posts\u002Fposts\u002F2019\u002F01\u002Fdevops-benefits.md","DevOps: Benefits",{"type":9,"value":2943,"toc":3182},[2944,2946,2949,2953,2956,2960,2963,2967,2970,2974,2977,2988,2992,2996,2999,3003,3006,3010,3013,3017,3020,3024,3027,3041,3049,3053,3056,3063,3066,3070,3073,3093,3101,3104,3111,3116,3160,3162,3165],[12,2945,2409],{"id":2408},[17,2947,2948],{},"Main benefits that a company generally expects and finds in the adoption of\nculture:",[91,2950,2952],{"id":2951},"faster-and-cheaper-releases","Faster and Cheaper Releases",[17,2954,2955],{},"Since releases will be continuous and frequent, deliverables will turn into\nsmall changes with the benefit of increasing speed in the development cycle\n(delivering always).",[91,2957,2959],{"id":2958},"improved-operational-support-with-quick-fixed","Improved Operational support with quick fixed",[17,2961,2962],{},"If there is a failure during delivery, the impact is minimal because the amount\nof modifications is small, just as the rollback is faster. Having a simple\ninspection and debugging.",[91,2964,2966],{"id":2965},"better-time-to-market-ttm","Better Time-to-market (TTM)",[17,2968,2969],{},"The software will be delivered much earlier when it's still an MVP. Customers\nwill be integrated as part of the development process, bringing insights and\nfeedback to the development team. Thus allowing for a higher launch speed in the\nmarket.",[91,2971,2973],{"id":2972},"superior-quality-products","Superior quality products",[17,2975,2976],{},"As has been said before, early failures prevent defects from being delivered to\nproduction, because:",[72,2978,2979,2982,2985],{},[75,2980,2981],{},"Reduces the volume of defects in the product as a whole;",[75,2983,2984],{},"Increases frequency of new features and releases;",[75,2986,2987],{},"Appropriate development processes in teams, including automation.",[12,2989,2991],{"id":2990},"now-we-understood-why-lets-talk-about-how","Now we understood WHY, let's talk about HOW",[91,2993,2995],{"id":2994},"continuous-releases-integration-delivery-deployment","Continuous releases (integration, delivery, deployment)",[17,2997,2998],{},"Usually follows a code versioning approach (through Git) using specific branches\nfor each environment (e.g.: feature branches with git flow).",[91,3000,3002],{"id":3001},"continuous-integration","Continuous integration",[17,3004,3005],{},"Automatic execution of unit tests, integration tests and code quality analysis\nagainst a git branch, to ensure that there was no disruption of the modified\npiece of code.",[91,3007,3009],{"id":3008},"continuous-delivery","Continuous delivery",[17,3011,3012],{},"Packaging the software that is tested and approved, to deliver it somewhere that\nit is possible to use in a deploy later. Examples are libs delivered in\nrepositories to be integrated into the code during the next update and code\ndeploy.",[91,3014,3016],{"id":3015},"continuous-deployment","Continuous deployment",[17,3018,3019],{},"Once you have completed all of the above steps, you can do automated deployments\nright in the environments, when the team is more confident about the tools they\nare testing, as well as the risk they're taking and also understanding that\nthere is a possibility of failure in a tests environment without worrying that\nit's going to be divergent from production.",[91,3021,3023],{"id":3022},"configuration-andor-infrastructure-as-code","Configuration (and\u002For Infrastructure) as code",[17,3025,3026],{},"To be able to test software with assertiveness, and to understand that it will\ntransit between environments without changing behavior, it is essential that the\nconfigurations are also expressed in code. This allows the settings to be also\nversioned, following the code. Also guaranteeing a uniformity among the\nenvironments, which enables:",[72,3028,3029,3032,3035,3038],{},[75,3030,3031],{},"Reduction in maintenance costs, having a single point to look at and\nunderstand the operation of the system;",[75,3033,3034],{},"Easy to recreate the infrastructure, if it is necessary to move everything to\nanother place, this can happen with a few manual interactions;",[75,3036,3037],{},"Allows for a code review of infrastructure and configurations, which\nconsequently brings a culture of collaboration in the development, sharing of\nknowledge and increases the democratization of the infra;",[75,3039,3040],{},"Documentation as code, helping new team members get a faster warm up.",[17,3042,3043,3044,3048],{},"These points were well-stressed by the Heroku team and gave rise to the famous\npaper: ",[98,3045,3047],{"href":1595,"rel":3046},[102],"The Twelve-Factor App",". It's an excellent reading\nfor the explanation of the benefits of configuration management.",[91,3050,3052],{"id":3051},"observability-monitoring-and-self-healing","Observability, Monitoring, and self-healing",[17,3054,3055],{},"At the end of the delivery process, the software must be monitored. Avoiding to\nwait for an external report of failures, ensuring that the actions are proactive\nrather than reactive.",[17,3057,3058,3059,3062],{},"With mature monitoring, it's possible to create trigger against alerts, creating\na self-healing system in which actions (scripts) are performed to ",[189,3060,3061],{},"fix known","\nfailures in the infrastructure so that everyone can sleep peacefully at night,\nwithout having to worry about the on-call schedule that makes you read some\ndocumentation at dawn. (If you have had experience with this, you know for sure\nhow bad it is).",[17,3064,3065],{},"Scaling up only those cases that are extreme exceptions (mistakes not\nknown\u002Fexpected) in the process for the employee to act, ensuring higher health\nin operation.",[91,3067,3069],{"id":3068},"processes-automation","Processes automation",[17,3071,3072],{},"All processes that cause Muda should be addressed with automation, allowing\npeople to work more quickly. Good examples of processes that are usually\nautomated are:",[72,3074,3075,3078,3081,3084,3087,3090],{},[75,3076,3077],{},"Deployment;",[75,3079,3080],{},"Self-healing (system resilience in response to anomalies);",[75,3082,3083],{},"Renewal of Certificates;",[75,3085,3086],{},"Execution of tests (unitary, integration, functional, etc.);",[75,3088,3089],{},"Monitoring (with auto-discovery);",[75,3091,3092],{},"User Governance;",[12,3094,3096],{"id":3095},"devops-toolchain",[98,3097,3100],{"href":3098,"rel":3099},"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDevOps_toolchain",[102],"DevOps toolchain",[17,3102,3103],{},"A combination of tools to facilitate the maintenance and operation of the\nsystem, with the flow:",[17,3105,3106],{},[3107,3108],"img",{"alt":3109,"src":3110},"Development Cycle Using DevOps","\u002Fimg\u002Fcontent\u002Fdevops-lifecycle.png",[405,3112,3113],{},[17,3114,3115],{},"Note: Any similarity to the PDCA is pure certainty.",[72,3117,3118,3124,3130,3136,3142,3148,3154],{},[75,3119,3120,3123],{},[189,3121,3122],{},"Plan",": Project planning phase, in which feedbacks are collected for\nrequirements survey, and backlog creation;",[75,3125,3126,3129],{},[189,3127,3128],{},"Create",": Creation of a deliverable (to validate a hypothesis), such as an\nMVP;",[75,3131,3132,3135],{},[189,3133,3134],{},"Verify",": Pass the deliverable to the test phase;",[75,3137,3138,3141],{},[189,3139,3140],{},"Package:"," Package the build to be able to put it in some testing\nenvironment;",[75,3143,3144,3147],{},[189,3145,3146],{},"Release",": Deploy packaged deliverable;",[75,3149,3150,3153],{},[189,3151,3152],{},"Configure",": Perform the configuration of the deliverable in the testing\nenvironment, trying to get as close as possible to the twelve-factor app.",[75,3155,3156,3159],{},[189,3157,3158],{},"Monitor",": After deploying to the environment, track business metrics and\ninfrastructure to ensure everything is working as expected.",[12,3161,1223],{"id":1222},[17,3163,3164],{},"During the implementation of these techniques it is possible to observe\nimprovements in the development process, the most notable gains are:",[72,3166,3167,3170,3173,3176,3179],{},[75,3168,3169],{},"Increase in team engagement;",[75,3171,3172],{},"Knowledge sharing;",[75,3174,3175],{},"Reduction of bottlenecks;",[75,3177,3178],{},"More free time to do work that really matters (adds value to the user\nexperience or generates impact);",[75,3180,3181],{},"Greater confidence in delivering software.",{"title":346,"searchDepth":347,"depth":347,"links":3183},[3184,3185,3186,3187,3188,3189,3190,3191,3192,3193,3194],{"id":2951,"depth":347,"text":2952},{"id":2958,"depth":347,"text":2959},{"id":2965,"depth":347,"text":2966},{"id":2972,"depth":347,"text":2973},{"id":2994,"depth":347,"text":2995},{"id":3001,"depth":347,"text":3002},{"id":3008,"depth":347,"text":3009},{"id":3015,"depth":347,"text":3016},{"id":3022,"depth":347,"text":3023},{"id":3051,"depth":347,"text":3052},{"id":3068,"depth":347,"text":3069},"2019-01-11T22:00:00","Benefits of implementing DevOps culture in business, why this is a feasible option and the DevOps world big picture in-a-nutshell from a business point of view.",{},"\u002Fposts\u002F2019\u002F01\u002Fdevops-benefits",{"title":2941,"description":3196},"devops-benefits","posts\u002F2019\u002F01\u002Fdevops-benefits",[3203,3204,3205,3206],"devops","culture","agile","lean","aezJy2B7ysB2yGf6LD84oSRLU3IRqSAhMgGJVt50D5M",{"id":3209,"title":3210,"author":7,"body":3211,"createdAt":3559,"description":3560,"extension":355,"meta":3561,"navigation":357,"path":3562,"seo":3563,"slug":3564,"stem":3565,"tags":3566,"__hash__":3567},"posts\u002Fposts\u002F2019\u002F01\u002Fdevops-genesis.md","DevOps: The Genesis",{"type":9,"value":3212,"toc":3546},[3213,3215,3218,3221,3224,3228,3231,3234,3238,3241,3244,3247,3251,3254,3265,3268,3285,3292,3306,3310,3314,3317,3328,3331,3334,3338,3341,3345,3348,3352,3358,3361,3379,3382,3385,3390,3393,3396,3399,3412,3415,3420,3423,3426,3431,3434,3438,3441,3456,3459,3464,3467,3470,3473,3476,3481,3484,3487,3495,3505,3508,3511,3514,3519,3522,3525,3528,3532,3539],[12,3214,2409],{"id":2408},[17,3216,3217],{},"First of all, it's all about agile.",[17,3219,3220],{},"The DevOps methodology was created on top of agile methods, to deliver a higher\nvalue inside software releases, automating feature release through pipelines,\nthat can test hypothesis faster allowing higher adaptability using \"fail-fast\"\napproaches. Those changes are more cultural than technical, so it's normal to\nsee DevOps being called culture.",[17,3222,3223],{},"The implementation of DevOps happens through processes automation, having a\nstrong sense of processes re-engineering inside the company. Comparing to the\ncultural change, the technical is easy to implement. Therefore the role that a\n\"DevOps Engineer\u002FAnalyst\" performs is very confusing, which enables many\nSysAdmins and Infra Analysts assuming the role of \"DevOps.\"",[91,3225,3227],{"id":3226},"lean-is-the-basis-of-agile","Lean is the basis of Agile",[17,3229,3230],{},"Reality is not as happy as it sounds. After World War II, Japan was destroyed\nand under-resourced after losing the war. With a limited amount of resources,\nthe country needed to reinvent itself and survive after a time of severe\ndepression. During that time two guys gained attention inside a company that\nlater gave its name after the methodology.",[17,3232,3233],{},"Those guys were Eiji Toyoda and Taiichi Ohno, inside Toyota Motor Corporation.\nThey're the founders of the \"Toyota production model\" also known as Toyotism.",[91,3235,3237],{"id":3236},"toyota-gave-birth-to-lean","Toyota gave birth to Lean",[17,3239,3240],{},"Lean teaches how to optimize the end-to-end process, focusing on processes that\ncreate value for customers. Bottlenecks in the process must be removed, and\nwasteful activities need to be identified and avoided. Both explained and\ndefined by LEAN 3M: Muda, Mura, and Muri.",[17,3242,3243],{},"Also teaches to improve yourself day after day and always focus on quality\nthrough Kaizen (continuous improvement).",[17,3245,3246],{},"Japanese culture truly believes that quality is the main objective to deliver\nvalue to customers since quality is what brings your clients back.",[91,3248,3250],{"id":3249},"kaizen","Kaizen",[17,3252,3253],{},"A mindset that helps to look at each part of the process exclusively and think\nabout the improvements. Involving the people who are part of the process,\nencourage the inclusion of these people in the decisions of change, since:",[72,3255,3256,3259,3262],{},[75,3257,3258],{},"It is much easier to accept a change when it is not imposed (top-down);",[75,3260,3261],{},"There is a greater absorption of change by people when they're included in the\nplanning;",[75,3263,3264],{},"The people who are involved in the process bring their concerns and\nsuggestions, which contribute positively to the evolution of the change,\nmaking the idea more robust.",[17,3266,3267],{},"The process of defining improvements through Kaizen happens (usually) in the\nfollowing order:",[462,3269,3270,3273,3276,3279,3282],{},[75,3271,3272],{},"Define data-driven objectives;",[75,3274,3275],{},"Review the current state and develop an improvement plan;",[75,3277,3278],{},"Implement improvement;",[75,3280,3281],{},"Review the implementation and improve what does not work;",[75,3283,3284],{},"Report the results and determine the items to be monitored.",[17,3286,3287,3288,3291],{},"This process is also called ",[189,3289,3290],{},"PDCA: Plain-Do-Control-Act",", which is summarized\nin:",[72,3293,3294,3297,3300,3303],{},[75,3295,3296],{},"Plan (develop the hypothesis);",[75,3298,3299],{},"Do (experiment);",[75,3301,3302],{},"Check (validate results);",[75,3304,3305],{},"Act (refine the experiment and start over).",[12,3307,3309],{"id":3308},"_3m-muda-mura-muri","3M: Muda, Mura, Muri",[91,3311,3313],{"id":3312},"muda-waste","Muda (waste)",[17,3315,3316],{},"Any activity that consumes time without adding value to the final consumer. e.g.:",[72,3318,3319,3322,3325],{},[75,3320,3321],{},"over-production;",[75,3323,3324],{},"idle time in the process;",[75,3326,3327],{},"products with a defect.",[17,3329,3330],{},"It's important to remember that there are different levels of Muda that can be\nremoved quickly or not, and the classification depends on the time for removal.",[17,3332,3333],{},"An example of a more time-consuming Muda is the discontinuation of legacy\nsoftware that ends up with longer release cycles, causing teams to be idle,\nfollowed by an often long or manual test routine.",[91,3335,3337],{"id":3336},"mura-unevenness","Mura (unevenness)",[17,3339,3340],{},"Unevenness in operation, caused by activities that are very changeable and\nunpredictable, generating different results in all executions. e.g., the\nexecution of tasks that were not well planned and ended up arriving with strict\ndeadlines. The team runs in the rush, generating exhaustion, despair, and\nmoreover, when finished leaves the people who have performed these tasks waiting\n(for feedback, or confirmation that it is completed).",[91,3342,3344],{"id":3343},"muri-overload","Muri (overload)",[17,3346,3347],{},"Overburdening equipment or operators by requiring them to run at a higher or\nharder pace beyond the limit, to achieve some goal or expectation, causing\nfatigue and consequently failures during the process. These failures are usually\nhuman errors caused by fatigue during overwork.",[91,3349,3351],{"id":3350},"back-to-agile","Back to Agile",[17,3353,3354,3355,501],{},"In 2000 a group of 17 people met at a resort in Oregon to talk about ideas that\ncould improve the flow of software development. After a year of mature ideas,\nthese people met again and published the ideas, which we now know as ",[189,3356,3357],{},"Agile\nManifesto",[17,3359,3360],{},"Main points are:",[17,3362,3363,3366,3367,3370,3371,3374,3375,3378],{},[189,3364,3365],{},"Individuals and interactions"," over processes and tools ",[189,3368,3369],{},"Working software","\nover comprehensive documentation ",[189,3372,3373],{},"Customer collaboration"," over contract\nnegotiation ",[189,3376,3377],{},"Responding to change"," over following a plan",[17,3380,3381],{},"I will restrict the explanation of these points with the DevOps point of view,\nkeeping on track (now).",[91,3383,3365],{"id":3384},"individuals-and-interactions",[17,3386,3387],{},[520,3388,3389],{},"over processes and tools",[17,3391,3392],{},"First comes the individuals, they should receive the necessary tooling to work\nwith, and then be empowered to do their jobs. Interactions between people are\ngreatly encouraged, for sharing knowledge and also for facilitating creative\nflow within development teams.",[17,3394,3395],{},"An excellent example of interaction encouraged through DevOps is the code review\nhabit. Considering that small parts of the software will be iterated and\napproved in the pipeline passing through different environments, automatically,\nthe best way to prevent defects is through code review.",[17,3397,3398],{},"This habit brings benefits such as:",[72,3400,3401,3403,3406,3409],{},[75,3402,3172],{},[75,3404,3405],{},"Observation of the problem from a different point of view;",[75,3407,3408],{},"Team engagement;",[75,3410,3411],{},"Lesser bugs.",[91,3413,3369],{"id":3414},"working-software",[17,3416,3417],{},[520,3418,3419],{},"over comprehensive documentation",[17,3421,3422],{},"Here's a trick in \"working software,\" software that works is not code that\ncompiles. The software that works is what meets the requirements of the user;\ni.e., the software that solves the problem and the pains of the user.",[17,3424,3425],{},"As the market is very dynamic, and evolves with high speed, often during the\nsoftware development project the requirements change due to external factors.\nTherefore, knowing that it is not possible to predict all the elements, many\n\"workarounds\" are made during development and documented. Passing the\nresponsibility to the user to handle the faults, and perform the workarounds,\nexpending more effort than would be required to perform the tasks using the\nsoftware.",[405,3427,3428],{},[17,3429,3430],{},"Deliver a working software frequently, ranging from a few weeks to a few months, considering shorter time-scale. - Agile Manifesto",[17,3432,3433],{},"Encouraging as many deployments as possible, so that failures happen as early as\npossible, thus allowing their impact to be much less.",[12,3435,3437],{"id":3436},"fail-fast","Fail-fast!",[17,3439,3440],{},"Failures are understood and encouraged because it's part of the mindset. Because:",[72,3442,3443,3450,3453],{},[75,3444,3445,3446,3449],{},"Only those who ",[189,3447,3448],{},"do"," make mistakes;",[75,3451,3452],{},"Failures are the best opportunity for learning and evolving;",[75,3454,3455],{},"Shit happens.",[17,3457,3458],{},"Nothing like quoting Murphy's law to contextualize",[405,3460,3461],{},[17,3462,3463],{},"\"Anything that can possibly go wrong, does.\"",[17,3465,3466],{},"Therefore, it's best for failures to occur early, while the cost of correction\nis still low. Failing a controlled testing environment allows the fix to be much\nfaster (and cheaper) than it would if the fix were already in production.",[17,3468,3469],{},"For this approach to succeed, there is a premise that environments are\nproduction copies, or at least as close as possible. Otherwise, there will be\nbehavioral changes in the software between the environments, making the test\nenvironment unfeasible.",[17,3471,3472],{},"If the environments are divergent, the promotion of bugs for production will be\nvery frequent, causing late failures, which are expensive failures.",[91,3474,3373],{"id":3475},"customer-collaboration",[17,3477,3478],{},[520,3479,3480],{},"over contract negotiation",[17,3482,3483],{},"Know your client! Including it in the process is the best approach to have\nworking software. After iterating over deliverables, it's essential to create a\npositive feedback loop with your client, bringing it as close as possible to the\ndevelopment of the tools that he\u002Fshe is going to use.",[17,3485,3486],{},"We can describe this situation with:",[72,3488,3489,3492],{},[75,3490,3491],{},"From point A it is possible to see only point B;",[75,3493,3494],{},"From point B it is possible to see point C;",[17,3496,3497,3498,501],{},"Therefore there is a great incentive for the software to be delivered in parts,\ncontinuously. Thus gathering user feedback on the next steps, following the\nconcepts of evolutionary prototyping, which were widely publicized through ",[98,3499,3502],{"href":3500,"rel":3501},"http:\u002F\u002Ftheleanstartup.com\u002Fbook",[102],[520,3503,3504],{},"The\nLean Startup",[17,3506,3507],{},"This point contrasts sharply with the previous one about continuous release, so\nthat it is possible to present the prototype and evolve it throughout the\nproject.",[17,3509,3510],{},"Learn who your customer\u002Fconsumer\u002Fuser is, and whom you are making the software\nfor, as this is the only way you can deliver value to that customer. An\nessential part of the software development process is to be empathic with user\nproblems, and to truly understand what the problem is to be solved, and the\nresult of the impact on software development (value creation for the user).",[91,3512,3377],{"id":3513},"responding-to-change",[17,3515,3516],{},[520,3517,3518],{},"over following a plan",[17,3520,3521],{},"Redesigning the requirements overtime is part of the job, and a necessary step\nto success. If you want to build something useful that is going to grow and have\nabsorption, it's a key feature to include your client in the implementation\nprocess.",[17,3523,3524],{},"It will be the only way to bring all the problems of the user to the table and\ncreate the best solution for all these problems because the user is the only\nperson that knows the real challenges he faces in their routine dealing with\nsoftware.",[17,3526,3527],{},"With continuous delivery of software along with monitoring results, the process\nof collecting feedback is much simpler and faster.",[12,3529,3531],{"id":3530},"devops-devops-devops","DevOps, DevOps, DevOps",[17,3533,3534,3535,3538],{},"With the popularization of DevOps, a lot of disagreement came out there followed\nby a significant confusion about the subject. It is very common to come across\ndifferent interpretations of ",[189,3536,3537],{},"what is DevOps",". There is a lot of euphemism in\nthe area, and gourmetization on LinkedIn, with many SysAdmins calling themselves\nDevOps since they learned to code shell script inside Python.",[17,3540,3541,3542,3545],{},"Do you want to keep reading? [Here are the benefits of adopting DevOps techniques.](",[1208,3543],{"value":3544},"\u003C relref \"post\u002F2019\u002F01\u002Fdevops-benefits\u002Findex.md\" >",")",{"title":346,"searchDepth":347,"depth":347,"links":3547},[3548,3549,3550,3551,3552,3553,3554,3555,3556,3557,3558],{"id":3226,"depth":347,"text":3227},{"id":3236,"depth":347,"text":3237},{"id":3249,"depth":347,"text":3250},{"id":3312,"depth":347,"text":3313},{"id":3336,"depth":347,"text":3337},{"id":3343,"depth":347,"text":3344},{"id":3350,"depth":347,"text":3351},{"id":3384,"depth":347,"text":3365},{"id":3414,"depth":347,"text":3369},{"id":3475,"depth":347,"text":3373},{"id":3513,"depth":347,"text":3377},"2019-01-11T21:00:00","From where DevOps came and to where we go. DevOps isn't simply automation, but a whole culture around agile business",{},"\u002Fposts\u002F2019\u002F01\u002Fdevops-genesis",{"title":3210,"description":3560},"devops-genesis","posts\u002F2019\u002F01\u002Fdevops-genesis",[3203,3204,3205,3206],"C-N9pTpV2fyHJAGyPw6uDVVxN01f3LLaykpjg7IQQr8",{"id":3569,"title":3570,"author":7,"body":3571,"createdAt":3772,"description":3773,"extension":355,"meta":3774,"navigation":357,"path":3775,"seo":3776,"slug":7,"stem":3777,"tags":3778,"__hash__":3781},"posts\u002Fposts\u002F2018\u002F10\u002Finsights-from-a-perfectionist-about-over-engineering.md","Insights from a perfectionist about Over-Engineering",{"type":9,"value":3572,"toc":3765},[3573,3577,3582,3584,3587,3590,3593,3604,3612,3625,3635,3639,3642,3648,3680,3691,3694,3703,3706,3709,3712,3716,3722,3725,3738,3740,3743,3748,3751,3759,3762],[91,3574,3576],{"id":3575},"foreword","Foreword",[164,3578,3579],{"type":166},[17,3580,3581],{},"This article is an open letter for me to keep reminding myself about what\nto prioritize when developing software, I am as much of a sinner in this aspect\nas the next person.",[91,3583,376],{"id":375},[17,3585,3586],{},"We as software engineers are always trying to do our best when it comes to being\ninnovative, improving our systems to work better and faster, perhaps with a\nbetter design, or a more comprehensive codebase. We all have some preference\nwhen it comes to doing our bests which we try to achieve at all times.",[17,3588,3589],{},"The main drive of this motivation is our necessity as \"digital craftspeople\" to\nexpress ourselves through quality work, along with the personal realization we\nfeel by doing a great job, with great quality, that challenges us and takes\nourselves further. It's motivating, isn't it? Assuming risks and getting out of\nthe comfort zone is incredibly funny, our brain's reward system goes crazy with\nunpredictability.",[17,3591,3592],{},"To help achieve that challenge, innovation, and quality in Software Engineering\nwe usually think that we need the best tools available, so we'll have fewer\nthings to worry about, and can concentrate our efforts in the process of\ncreating great products. On top of that, having the best tools could improve our\nquality of life (allowing us not to work under pressure, avoids overwork, and\nalso helps us to sleep better at night). Furthermore, \"the right set of tools\"\ncould even enhance our productivity through self-satisfaction with work,\neveryone has their preoccupations and is willing to create something to be proud\nof.",[17,3594,3595,3596,3599,3600,724],{},"Many times during the design and development of products we take unmeasured\nsolutions for a simple problem. After all, we want to have not just the right\nset of tools, but the ",[189,3597,3598],{},"best"," right? How can we be ground-breaking, innovative,\ndisruptive, and pick-your-buzzword-poison otherwise? Well, as Nathan Marz\n(creator of Apache Storm) puts better in his ",[98,3601,3603],{"href":400,"rel":3602},[102],"suffering-oriented programming",[405,3605,3606],{},[17,3607,3608,3611],{},[451,3609,3610],{},"…"," don't build technology unless you feel the pain of not having it. It applies\nto the big, architectural decisions as well as the smaller everyday programming\ndecisions. Suffering-oriented programming greatly reduces risk by ensuring that\nyou're always working on something important, and it ensures that you are\nwell-versed in a problem space before attempting a large investment.",[17,3613,3614,3615,3620,3621,3624],{},"This method describes a good way to think about LEAN and evolving products\nthrough ",[98,3616,3619],{"href":3617,"rel":3618},"https:\u002F\u002Fdzone.com\u002Farticles\u002Fwhat-is-minimum-viable-product-and-how-to-build-it",[102],"an MVP concept",", helping to keep track of what ",[189,3622,3623],{},"really"," matters when it\ncomes to a good balance between Product and Engineering efforts.",[17,3626,3627,3628,3631,3632,501],{},"As we're daily overfed with information, it's easy to make mistakes trying to\nchoose the right set of tools to work with. From picking Frameworks to Operating\nSystems, and even the cloud provider to host our systems and products. It's OK\nto make mistakes, we all have a great first impression about all choices we\ncould have done, if you read AWS or GCP documentation you'll be impressed with\ntheir magical solutions to your problems, where you can just throw everything in\n(including your credit card), and everything will be fine, right? The magic\ncloud will solve ",[189,3629,3630],{},"all of your"," problems. Yeah, ",[520,3633,3634],{},"maybe",[91,3636,3638],{"id":3637},"what-is-the-problem-i-am-trying-to-solvehere","What is the problem I am trying to solve here?",[17,3640,3641],{},"One good example of the current hype, when it comes to applications is Docker\ncontainers and Kubernetes. Kubernetes is the open-source version of Google's\nBorg, a great Linux containers orchestration tool developed to orchestrate\napplications on Google's data center.",[17,3643,3644,3645,501],{},"Kubernetes is great, but the hype goes too far sometimes with companies running\neven Production transactional databases on it, as well as entire monoliths and\nStateful services. At this point, we have to look back and ask ourselves: \"What\nproblem I'm trying to solve here?\". Because, if you take a second look, these\ndecisions are kind of a \"Hydra\" solution, \"for every head chopped off, the Hydra\nwould regrow two heads\", or even better: these solutions are creating more\nproblems, by trying to solve problems ",[189,3646,3647],{},"that may not even exist",[17,3649,3650,3651,3656,3657,3662,3663,3666,3667,3670,3671,3676,3677,724],{},"Yeah, Google orchestrated MySQL instance deployment using Borg. The first\n",[98,3652,3655],{"href":3653,"rel":3654},"https:\u002F\u002Fsre.google\u002Fsre-book\u002Fautomation-at-google\u002F",[102],"version (POC) was released in 2008 and finished by 2009"," at that time the revenue\nof the Ad service was ",[98,3658,3661],{"href":3659,"rel":3660},"https:\u002F\u002Fwww.statista.com\u002Fstatistics\u002F266249\u002Fadvertising-revenue-of-google\u002F",[102],"estimated at USD ~22.9 Bi",". Ask yourself, do your database\nserves a ",[189,3664,3665],{},"USD 22.9 BILLION service","? Do you ",[520,3668,3669],{},"really need"," orchestration there?\nChances are, and let's face it, ",[98,3672,3675],{"href":3673,"rel":3674},"https:\u002F\u002Fblog.bradfieldcs.com\u002Fyou-are-not-google-84912cf44afb",[102],"You Are Not Google",". This is an extreme example\nbut it serves to illustrate the main concept of ",[520,3678,3679],{},"suffering-oriented\nprogramming",[405,3681,3682],{},[17,3683,3684,3686,3687,3690],{},[451,3685,453],{}," don't build technology unless you ",[189,3688,3689],{},"feel the pain"," of not having it.",[17,3692,3693],{},"A nice quote from \"You Are Not Google\" to sink in:",[405,3695,3696],{},[17,3697,3698,3699,3702],{},"Don’t even start considering solutions until you ",[189,3700,3701],{},"understand"," the problem. Your\ngoal should be to “solve” the problem mostly within the problem domain, not the\nsolution domain.",[17,3704,3705],{},"Otherwise, in case we insist on the inappropriate (not necessarily wrong)\nsolution, we're going to spend some extra time dealing with the consequences\n(i.e. chopping additional Hydra heads). Worth noting that dealing with the\nconsequences is not something bad, as long as you have the resources (time and\nmoney) to invest into learning and rework, investing some resources into\ninappropriate software solutions could even be seen as a way of training with\nhigher outcomes (learnings) than conferences, courses, and books. There is a lot\nof lessons and knowledge to be extracted from these experiments.",[17,3707,3708],{},"Learning from our experiences is the only path to success, and failures teach\nbest. Failures were also the motivation for writing this article to keep\nreminding myself (:",[17,3710,3711],{},"As Software Engineers the problem space analysis oftentimes fail due to an\nunderrated aspect, mostly unnoticed: on the other side of the line is a user of\nthis software.",[91,3713,3715],{"id":3714},"and-guesswhat","And guess what?",[17,3717,3718,3719,501],{},"He doesn't care if you're running Elixir inside a container on Kubernetes, using\nContainer OS or Core OS, which you provisioned with your bare hands, and have\npolished bit by bit to be XYZ ms faster than the Vanilla version. As long as you\nrespond to their requests, and ",[189,3720,3721],{},"don't break things",[17,3723,3724],{},"Innovation has nothing to do with the fact that you want to use cutting-edge\ntechnology, and it's not about how fast you spend money on those solutions\neither. It's about delivering value to your customers, and enrich their\nexperience from the interactions with your product.",[17,3726,3727,3728,3730,3731,3734,3735,501],{},"If you're going through some orchestration problems, having 10+ micro-services\nwith some asynchronous task-based workers (e.g. Python's Celery). Then, ",[520,3729,3634],{},"\nit's time to use Kubernetes. But, as an engineer you should know that the best\npath is to put some solutions on the table, run some benchmarks and compare\nthem, so you'll have data to help in your decision, and choose what's the right\nsolution for your problem, ",[189,3732,3733],{},"at the right time",". We just have to keep asking\nourselves: ",[520,3736,3737],{},"\"What is the problem I'm trying to solve here?\"",[91,3739,1223],{"id":1222},[17,3741,3742],{},"There's a quote from a great investor called Benjamin Graham that says:",[405,3744,3745],{},[17,3746,3747],{},"If you are looking for investments*, choose them the way you would buy\ngroceries, not the way you would buy perfume.\n-- Graham, The Intelligent Investor (1973)",[17,3749,3750],{},"We should carefully look to where we're going with our choices. So we don't\noverspend and keep things going for more time, thus we go further.",[164,3752,3753],{"type":166},[17,3754,3755,3756,3758],{},"The original text is: \"If you are shopping for common stocks ",[451,3757,453],{},"\". But, as\na Software Engineer, I just switched the syntax so we could adapt it to more\nuse-cases (:",[17,3760,3761],{},"I learned from my own experience that over-engineered decisions end up bringing\nmore pain than solving problems, and it currently happens through early\nimprovements on the system, timing really matters. Many times we try to solve\nall problems at once (even those we don't have), and it brings more problems,\nlike high costs of maintenance and infrastructure, or under-utilization of the\nresources.",[17,3763,3764],{},"Sooner or later, the Over-Engineering bill will come as Hydra heads keep growing\nup accumulating technical debt. Be mindful when analyzing the problem space,\npick the right tool for the job that eases your real pain (not the imaginary\none).",{"title":346,"searchDepth":347,"depth":347,"links":3766},[3767,3768,3769,3770,3771],{"id":3575,"depth":347,"text":3576},{"id":375,"depth":347,"text":376},{"id":3637,"depth":347,"text":3638},{"id":3714,"depth":347,"text":3715},{"id":1222,"depth":347,"text":1223},"2018-10-28T00:00:00+02:00","Software engineers are always trying to do their best when it comes to being innovative and improving their systems. This article helps to put that willingness into perspective and drive it in the right direction.",{},"\u002Fposts\u002F2018\u002F10\u002Finsights-from-a-perfectionist-about-over-engineering",{"title":3570,"description":3773},"posts\u002F2018\u002F10\u002Finsights-from-a-perfectionist-about-over-engineering",[3779,3780],"product engineering","development","yTzIm_ItOjQ74WTmohaj0WtSJXvhgVpjsO_xMEaYALg",1778441743697]