Freeletics CI/CD: five years of debt (and why we kept Jenkins)
- Published on
The CI/CD system nobody wanted to touch
On any given Monday morning at Freeletics, Dependabot would have merged a dozen dependency update PRs overnight. Reasonable enough: automated dependency management is one of those low-effort security hygiene practices that just makes sense. Except that at Freeletics, those Monday morning merges triggered a cascade of Docker image builds that Jenkins couldn't keep up with, flooding the build queue and eventually causing master-to-slave HTTP communication to break down. Jobs hung. Developers refreshed their PR status pages waiting for CI feedback that never came. By the time someone from the ops team intervened, half the morning was gone.
Lean has a name for this: Mura (unevenness in operation caused by unpredictable, variable workloads). The CI system was sized for normal throughput, not for the burst that Dependabot produced every Monday. The result was exactly what Mura predicts: inconsistent outcomes, developer exhaustion, and downstream waiting.
This wasn't a Jenkins problem per se. It was a design problem. And it had been accumulating for years.
What we inherited
When I joined Freeletics in Aug/2019, the CI/CD landscape looked like this:
- Jenkins ran on the internal Kubernetes cluster and was responsible for building all Docker images for back-end and web applications, then deploying them to Production, QA, and Integration;
- CircleCI handled mobile builds (iOS on macOS agents, Android on Linux) and some code quality checks;
- Travis ran tests and reported code quality results on back-end and web pull requests.
Three tools, three secrets stores, three mental models. A new engineer joining the team had to context-switch between all three to get a full picture of a single deployment. Knowledge didn't accumulate in one place; it spread thin across all three.
But the Jenkins situation was particularly bad. Jenkins itself was fine. The way it had been deployed was the problem, and it had three layers stacked on top of each other.
Three layers of debt
Layer 1: Jenkins Job Builder (JJB)
Jenkins Job Builder is a tool that generates Jenkins job definitions from YAML files. Freeletics ran a customized fork of JJB that hadn't been updated in over two years. The consequence: Jenkins Pipelines (introduced in Jenkins 2.0) were not supported. Any attempt to include a Pipeline definition in a JJB YAML file would cause the deployment to fail.
This is the kind of blocker that silently freezes a platform. Every Jenkins feature shipped in the past two years was unreachable. Developers kept writing workarounds around the limitations of a job format that the upstream project had effectively deprecated. The gap between what Jenkins could do and what the Freeletics setup could do kept widening, and nobody had time to fix it.
Layer 2: The Helm Chart that held everything hostage
Jenkins was deployed through a custom Helm Chart. So far so good. The problem
was how that Helm Chart was wired: it rendered JJB configuration YAML inside
Go templates and executed JJB as part of the chart install process. Secrets
were encrypted directly into the Helm values file using helm-secrets.
The consequence of bundling secrets into the chart: every Jenkins configuration change (adding a job, tweaking an executor count, updating a plugin) required a full Helm release that also included the secrets bundle. Rolling back a configuration change meant rolling back the secrets version too. Two completely separate concerns sharing one release cycle.
More practically: making any change to Jenkins required understanding the full chart, including the templated JJB YAML, the secrets structure, and the interactions between them. The blast radius for a failed release was total: Jenkins would go down, and recovery was a manual exercise that blocked every team's deployments until it completed.
Layer 3: The fragmentation tax
Running three CI/CD systems isn't inherently wrong. Mobile builds genuinely require macOS agents that CircleCI provides out-of-the-box, and that's a fair reason to keep CircleCI for iOS. But the fragmentation meant that every platform-level decision had to be made three times: credentials rotation, pipeline templates, deployment interfaces, monitoring. The cognitive overhead was constant.
Q: How much does this actually cost? A: Consider that every time a developer debugs a CI failure, they have to know which system it lives in, how that system's logs are structured, what credentials it uses, and how to re-trigger it. Multiply that by the number of incidents per week, across a team of ~20 engineers. The productivity drain is invisible in any single incident and very visible in aggregate. One of the core promises of DevOps is reducing exactly this kind of bottleneck: "more free time to do work that really matters." Three CI systems running in parallel is the structural opposite of that.
The evaluation: stay or leave?
In Feb/2020 (before COVID-19 pushed the project back to Aug/2020), the team ran a structured evaluation. The question: is Jenkins worth investing in, or do we migrate to something else?
Tools considered: CircleCI (cloud), GitLab CI with shared runners, GitLab CI with self-hosted runners using Kaniko, Jenkins X, and the current Jenkins.
The main axis of comparison was Docker image build time. Builds are the most frequent CI operation at Freeletics (triggered by every commit on every repository) and the most directly felt by developers. A build that takes 13 minutes instead of 5 is 8 minutes of a developer either waiting or switching context. Across 50 builds per day (a realistic number for a team that size), that's ~6 hours of developer time evaporating every day. Muda in the most direct sense: time consumed without adding value to anyone.
| Application | GitLab CI (shared) | GitLab CI (self-hosted) | CircleCI | Jenkins |
|---|---|---|---|---|
| web-api-service | Timeout | ~9:49 | ~13:51 | ~5:44 |
| blog-service | ~51:21 | ~13:50 | ~17:29 | ~11:35 |
| frontend-spa | ~13:38 | ~2:13 | ~10:32 | ~3:54 |
| rails-base-image | ~4:06 | ~3:24 | ~2:44 | ~0:11 |
| backend-api | ~8:23 | ~4:54 | ~10:01 | ~2:00 |
| user-service | ~7:23 | ~2:17 | ~6:58 | ~3:34 |
| marketing-service | ~6:02 | ~4:44 | ~7:19 | ~5:18 |
These are averages of Docker image build execution time, not including external influences like repository synchronization. Jenkins was running Docker-in-Docker (DinD) rather than Kaniko at the time, so its numbers would change post-migration. The directional conclusion held regardless.
Jenkins won on almost every application. The underlying reason is control: Jenkins runners live inside Freeletics' own Kubernetes cluster, meaning hardware sizing, network latency to AWS ECR, and layer cache availability are all tunable. GitLab shared runners and CircleCI are someone else's infrastructure, and the benchmark numbers reflected that.
GitLab CI with self-hosted runners and Kaniko was a serious competitor on a
few applications (fl-application-web at ~2:13 is impressive). But migrating
to a new CI platform carries its own costs: new credentials model, new pipeline
syntax, retraining, migration of all existing job definitions. Those costs are
upfront and real; the productivity benefits are speculative and compound slowly.
The verdict: invest in Jenkins, redesign from scratch.
The design goals were straightforward:
- Everything reproducible from code (delete the Jenkins release, recreate it, get the exact same Jenkins back);
- No more JJB (replace with JCasC and Job DSL, both managed through Terraform);
- Secrets decoupled from the Helm Chart release cycle;
- A single Groovy Shared Library covering all server-side pipelines.
Ready to see how the rebuild actually went? Part 2 covers the first thing we had to fix: authorization that any employee could bypass, and secrets that couldn't change independently of the rest of the system.