[{"data":1,"prerenderedAt":368},["ShallowReactive",2],{"\u002Fen\u002Fpost\u002F2021\u002F01\u002Fjenkins-five-years-of-cicd-debt":3},{"id":4,"title":5,"author":6,"body":7,"createdAt":353,"description":354,"extension":355,"meta":356,"navigation":357,"path":358,"seo":359,"slug":360,"stem":361,"tags":362,"__hash__":367},"posts\u002Fposts\u002F2021\u002F01\u002Fjenkins-five-years-of-cicd-debt.md","Freeletics CI\u002FCD: five years of debt (and why we kept Jenkins)",null,{"type":8,"value":9,"toc":342},"minimark",[10,15,19,28,31,36,39,62,65,73,77,82,85,88,92,100,103,106,110,113,121,125,128,138,146,292,299,302,309,315,318,332,335],[11,12,14],"h1",{"id":13},"the-cicd-system-nobody-wanted-to-touch","The CI\u002FCD system nobody wanted to touch",[16,17,18],"p",{},"On any given Monday morning at Freeletics, Dependabot would have merged a\ndozen dependency update PRs overnight. Reasonable enough: automated dependency\nmanagement is one of those low-effort security hygiene practices that just\nmakes sense. Except that at Freeletics, those Monday morning merges triggered\na cascade of Docker image builds that Jenkins couldn't keep up with, flooding\nthe build queue and eventually causing master-to-slave HTTP communication to\nbreak down. Jobs hung. Developers refreshed their PR status pages waiting for\nCI feedback that never came. By the time someone from the ops team intervened,\nhalf the morning was gone.",[16,20,21,22,27],{},"Lean has a name for this: ",[23,24,26],"a",{"href":25},"\u002Fposts\u002F2019\u002F01\u002Fdevops-genesis#mura-unevenness","Mura"," (unevenness\nin operation caused by unpredictable, variable workloads). The CI system was\nsized for normal throughput, not for the burst that Dependabot produced every\nMonday. The result was exactly what Mura predicts: inconsistent outcomes,\ndeveloper exhaustion, and downstream waiting.",[16,29,30],{},"This wasn't a Jenkins problem per se. It was a design problem. And it had been\naccumulating for years.",[32,33,35],"h2",{"id":34},"what-we-inherited","What we inherited",[16,37,38],{},"When I joined Freeletics in Aug\u002F2019, the CI\u002FCD landscape looked like this:",[40,41,42,50,56],"ul",{},[43,44,45,49],"li",{},[46,47,48],"strong",{},"Jenkins"," ran on the internal Kubernetes cluster and was responsible for\nbuilding all Docker images for back-end and web applications, then deploying\nthem to Production, QA, and Integration;",[43,51,52,55],{},[46,53,54],{},"CircleCI"," handled mobile builds (iOS on macOS agents, Android on Linux)\nand some code quality checks;",[43,57,58,61],{},[46,59,60],{},"Travis"," ran tests and reported code quality results on back-end and web\npull requests.",[16,63,64],{},"Three tools, three secrets stores, three mental models. A new engineer joining\nthe team had to context-switch between all three to get a full picture of a\nsingle deployment. Knowledge didn't accumulate in one place; it spread thin\nacross all three.",[16,66,67,68,72],{},"But the Jenkins situation was particularly bad. Jenkins itself was fine. The\n",[69,70,71],"em",{},"way it had been deployed"," was the problem, and it had three layers stacked on\ntop of each other.",[32,74,76],{"id":75},"three-layers-of-debt","Three layers of debt",[78,79,81],"h3",{"id":80},"layer-1-jenkins-job-builder-jjb","Layer 1: Jenkins Job Builder (JJB)",[16,83,84],{},"Jenkins Job Builder is a tool that generates Jenkins job definitions from YAML\nfiles. Freeletics ran a customized fork of JJB that hadn't been updated in\nover two years. The consequence: Jenkins Pipelines (introduced in Jenkins 2.0)\nwere not supported. Any attempt to include a Pipeline definition in a JJB YAML\nfile would cause the deployment to fail.",[16,86,87],{},"This is the kind of blocker that silently freezes a platform. Every Jenkins\nfeature shipped in the past two years was unreachable. Developers kept writing\nworkarounds around the limitations of a job format that the upstream project\nhad effectively deprecated. The gap between what Jenkins could do and what the\nFreeletics setup could do kept widening, and nobody had time to fix it.",[78,89,91],{"id":90},"layer-2-the-helm-chart-that-held-everything-hostage","Layer 2: The Helm Chart that held everything hostage",[16,93,94,95,99],{},"Jenkins was deployed through a custom Helm Chart. So far so good. The problem\nwas how that Helm Chart was wired: it rendered JJB configuration YAML inside\nGo templates and executed JJB as part of the chart install process. Secrets\nwere encrypted directly into the Helm values file using ",[96,97,98],"code",{},"helm-secrets",".",[16,101,102],{},"The consequence of bundling secrets into the chart: every Jenkins configuration\nchange (adding a job, tweaking an executor count, updating a plugin) required a\nfull Helm release that also included the secrets bundle. Rolling back a\nconfiguration change meant rolling back the secrets version too. Two completely\nseparate concerns sharing one release cycle.",[16,104,105],{},"More practically: making any change to Jenkins required understanding the full\nchart, including the templated JJB YAML, the secrets structure, and the\ninteractions between them. The blast radius for a failed release was total:\nJenkins would go down, and recovery was a manual exercise that blocked every\nteam's deployments until it completed.",[78,107,109],{"id":108},"layer-3-the-fragmentation-tax","Layer 3: The fragmentation tax",[16,111,112],{},"Running three CI\u002FCD systems isn't inherently wrong. Mobile builds genuinely\nrequire macOS agents that CircleCI provides out-of-the-box, and that's a fair\nreason to keep CircleCI for iOS. But the fragmentation meant that every\nplatform-level decision had to be made three times: credentials rotation,\npipeline templates, deployment interfaces, monitoring. The cognitive overhead\nwas constant.",[16,114,115,116,120],{},"Q: How much does this actually cost?\nA: Consider that every time a developer debugs a CI failure, they have to know\nwhich system it lives in, how that system's logs are structured, what\ncredentials it uses, and how to re-trigger it. Multiply that by the number of\nincidents per week, across a team of ~20 engineers. The productivity drain is\ninvisible in any single incident and very visible in aggregate. One of the\n",[23,117,119],{"href":118},"\u002Fposts\u002F2019\u002F01\u002Fdevops-benefits#conclusion","core promises of DevOps"," is reducing exactly\nthis kind of bottleneck: \"more free time to do work that really matters.\"\nThree CI systems running in parallel is the structural opposite of that.",[11,122,124],{"id":123},"the-evaluation-stay-or-leave","The evaluation: stay or leave?",[16,126,127],{},"In Feb\u002F2020 (before COVID-19 pushed the project back to Aug\u002F2020), the team\nran a structured evaluation. The question: is Jenkins worth investing in, or do\nwe migrate to something else?",[16,129,130,131,137],{},"Tools considered: CircleCI (cloud), GitLab CI with shared runners, GitLab CI\nwith self-hosted runners using ",[23,132,136],{"href":133,"rel":134},"https:\u002F\u002Fgithub.com\u002FGoogleContainerTools\u002Fkaniko",[135],"nofollow","Kaniko",",\nJenkins X, and the current Jenkins.",[16,139,140,141,145],{},"The main axis of comparison was Docker image build time. Builds are the most\nfrequent CI operation at Freeletics (triggered by every commit on every\nrepository) and the most directly felt by developers. A build that takes 13\nminutes instead of 5 is 8 minutes of a developer either waiting or switching\ncontext. Across 50 builds per day (a realistic number for a team that size),\nthat's ~6 hours of developer time evaporating every day. ",[23,142,144],{"href":143},"\u002Fposts\u002F2019\u002F01\u002Fdevops-genesis#muda-waste","Muda","\nin the most direct sense: time consumed without adding value to anyone.",[147,148,149,169],"table",{},[150,151,152],"thead",{},[153,154,155,159,162,165,167],"tr",{},[156,157,158],"th",{},"Application",[156,160,161],{},"GitLab CI (shared)",[156,163,164],{},"GitLab CI (self-hosted)",[156,166,54],{},[156,168,48],{},[170,171,172,190,207,224,241,258,275],"tbody",{},[153,173,174,178,181,184,187],{},[175,176,177],"td",{},"web-api-service",[175,179,180],{},"Timeout",[175,182,183],{},"~9:49",[175,185,186],{},"~13:51",[175,188,189],{},"~5:44",[153,191,192,195,198,201,204],{},[175,193,194],{},"blog-service",[175,196,197],{},"~51:21",[175,199,200],{},"~13:50",[175,202,203],{},"~17:29",[175,205,206],{},"~11:35",[153,208,209,212,215,218,221],{},[175,210,211],{},"frontend-spa",[175,213,214],{},"~13:38",[175,216,217],{},"~2:13",[175,219,220],{},"~10:32",[175,222,223],{},"~3:54",[153,225,226,229,232,235,238],{},[175,227,228],{},"rails-base-image",[175,230,231],{},"~4:06",[175,233,234],{},"~3:24",[175,236,237],{},"~2:44",[175,239,240],{},"~0:11",[153,242,243,246,249,252,255],{},[175,244,245],{},"backend-api",[175,247,248],{},"~8:23",[175,250,251],{},"~4:54",[175,253,254],{},"~10:01",[175,256,257],{},"~2:00",[153,259,260,263,266,269,272],{},[175,261,262],{},"user-service",[175,264,265],{},"~7:23",[175,267,268],{},"~2:17",[175,270,271],{},"~6:58",[175,273,274],{},"~3:34",[153,276,277,280,283,286,289],{},[175,278,279],{},"marketing-service",[175,281,282],{},"~6:02",[175,284,285],{},"~4:44",[175,287,288],{},"~7:19",[175,290,291],{},"~5:18",[293,294,296],"callout",{"type":295},"note",[16,297,298],{},"These are averages of Docker image build execution time, not\nincluding external influences like repository synchronization. Jenkins was\nrunning Docker-in-Docker (DinD) rather than Kaniko at the time, so its\nnumbers would change post-migration. The directional conclusion held regardless.",[16,300,301],{},"Jenkins won on almost every application. The underlying reason is control:\nJenkins runners live inside Freeletics' own Kubernetes cluster, meaning\nhardware sizing, network latency to AWS ECR, and layer cache availability are\nall tunable. GitLab shared runners and CircleCI are someone else's\ninfrastructure, and the benchmark numbers reflected that.",[16,303,304,305,308],{},"GitLab CI with self-hosted runners and Kaniko was a serious competitor on a\nfew applications (",[96,306,307],{},"fl-application-web"," at ~2:13 is impressive). But migrating\nto a new CI platform carries its own costs: new credentials model, new pipeline\nsyntax, retraining, migration of all existing job definitions. Those costs are\nupfront and real; the productivity benefits are speculative and compound slowly.",[16,310,311,312,99],{},"The verdict: ",[46,313,314],{},"invest in Jenkins, redesign from scratch",[16,316,317],{},"The design goals were straightforward:",[40,319,320,323,326,329],{},[43,321,322],{},"Everything reproducible from code (delete the Jenkins release, recreate it,\nget the exact same Jenkins back);",[43,324,325],{},"No more JJB (replace with JCasC and Job DSL, both managed through Terraform);",[43,327,328],{},"Secrets decoupled from the Helm Chart release cycle;",[43,330,331],{},"A single Groovy Shared Library covering all server-side pipelines.",[333,334],"hr",{},[16,336,337,338],{},"Ready to see how the rebuild actually went?\n",[23,339,341],{"href":340},"\u002Fposts\u002F2021\u002F01\u002Fjenkins-boring-security-by-design","Part 2 covers the first thing we had to fix: authorization that any employee could bypass, and secrets that couldn't change independently of the rest of the system.",{"title":343,"searchDepth":344,"depth":344,"links":345},"",2,[346,347],{"id":34,"depth":344,"text":35},{"id":75,"depth":344,"text":76,"children":348},[349,351,352],{"id":80,"depth":350,"text":81},3,{"id":90,"depth":350,"text":91},{"id":108,"depth":350,"text":109},"2021-01-10T00:00:00+01:00","An inherited Jenkins setup nobody dared to touch, three CI\u002FCD systems running in parallel, and a Monday morning ritual of watching builds hang. The case for rebuilding instead of replacing, and the benchmark data behind it.","md",{},true,"\u002Fposts\u002F2021\u002F01\u002Fjenkins-five-years-of-cicd-debt",{"title":5,"description":354},"jenkins-five-years-of-cicd-debt","posts\u002F2021\u002F01\u002Fjenkins-five-years-of-cicd-debt",[363,364,365,366],"ci-cd","jenkins","platform-engineering","infrastructure-as-code","Vslr6JSU36R4jx8_kSqTBVfenzcKSNROj4g81R31vvk",1778441744138]