Picking up where we left off

Part 1 covered what we inherited and why we chose to rebuild Jenkins rather than replace it. Part 2 covered the security foundation: authorization that stays accurate without manual upkeep, and secrets decoupled from the configuration release cycle. This post is about the build system itself (the piece that made Dependabot Monday mornings a non-event).

One constraint shaped everything: we couldn't afford a hard cutover. Freeletics' back-end and web deployment pipelines were running continuously. The redesign had to happen alongside normal operations, phase by phase, with the old system still running until each piece was ready to replace.

Phase 2: Kill the Job Builder, introduce Kaniko

First things first: reproducibility

The hardest requirement to state and the easiest to overlook: a Jenkins installation that can be deleted and recreated from scratch without losing anything. No manual configuration living only in someone's memory, no jobs that were "created by hand one afternoon" and never written down.

To get there, JJB had to go. Its replacement was a combination of Jenkins Configuration as Code (JCasC) for the Jenkins system-level settings and Job DSL for job definitions, both managed through Terraform. Every Jenkins job became a pull request. Every system configuration change was reviewable, auditable, and rollback-able.

The practical impact was felt immediately. Previously, understanding what jobs Jenkins was running required either opening the Jenkins UI or reading the bundled JJB YAML inside the Helm Chart. Now, the full picture lived in the Terraform repository alongside everything else. New team members could onboard to Jenkins by reading code, not by poking around a UI. This is Infrastructure as Code applied to CI configuration:

"Easy to recreate the infrastructure, if it is necessary to move everything to another place, this can happen with a few manual interactions; allows for a code review of infrastructure and configurations, which consequently brings a culture of collaboration in the development, sharing of knowledge."

Docker-in-Docker out, Kaniko in

Docker-in-Docker (DinD) was how Jenkins built container images: a Docker daemon running inside a container, building images for other containers. It works, but it requires running privileged Kubernetes pods (a security concern on shared infrastructure) and carries well-documented instability issues under high build concurrency (the same concurrency problem that made Dependabot mornings painful).

Kaniko builds images from a Dockerfile without needing a Docker daemon at all. It runs as a regular Kubernetes pod with no elevated privileges, uses the container registry directly for cache, and handles concurrent builds gracefully because each build is fully isolated.

Migration covered every server-side application:

All Rails back-end services;
Web applications;
Internal tools (AI and analytics).

Each application got its own Kaniko cache repository in ECR. Layer caching across builds meant the build times stayed competitive with the DinD baseline while removing the concurrency issues that caused the Monday morning hangs.

The Shared Library was doing too much

The Jenkins Groovy Shared Library had a run() method that orchestrated the entire pipeline from a single entry point: git checkout, build, tag, push to ECR, Slack notification. One call to rule them all. The appeal was obvious: one-liner Jenkinsfiles across every repository.

The problem showed up the moment someone needed to deviate from the happy path. Want to add a custom build arg? Override the image tag format? Skip the Slack notification for a specific branch? None of that was possible without touching the Shared Library and potentially breaking every other pipeline that depended on the same run() method.

The refactoring introduced a KanikoBuilder class that pipelines compose explicitly from their Jenkinsfile. The Shared Library provides the building blocks; the per-repository Jenkinsfile decides how to assemble them:

@Library('jenkins_shared_library@master') _

import com.example.KanikoBuilder

def kanikoBuilder = new KanikoBuilder(repositoryName: "foobar",
                                      steps: this)

podTemplate(yaml: kanikoBuilder.getPodTemplate()) {
    node(POD_LABEL) {
        stage('Git - Fetch code') {
            env.GIT_COMMIT = gitCheckout(branchName: env.BRANCH_NAME)
        }

        stage("Build - Container Image") {
            kanikoBuilder.setImageTag(env.GIT_COMMIT)
            kanikoBuilder.build()
        }
    }
}

The steps: this argument passes the pipeline context into KanikoBuilder, letting the library add stages at runtime. Per-repository Jenkinsfiles became short and readable. Cross-cutting changes (new tag format, new registry, Slack integration update) could now be made once in the library and propagated across all pipelines on the next library release.

Image multi-tagging

The QA stack resolved images by tag convention, and it needed multiple tags applied in a single build pass. Kaniko accepts N --destination arguments, so the library was extended to receive a tag list and generate the flag string:

qa-<SHA1> applied always;
qa-latest-master applied when the branch is master;
qa-<branch-name> applied optionally, declared in the Jenkinsfile.

Human-readable credential IDs

A minor change with outsized quality-of-life impact: all Jenkins credential IDs were UUIDs. Debugging a failed credential lookup meant grep-ing through the Helm values to cross-reference the UUID against a label. After the migration:

slack_token
aws_credentials
npm_token
github_api_token

All Jenkinsfiles across repositories were updated to reference the new IDs. Time spent debugging credential issues dropped to near zero.

The result

By Feb/2021, the Jenkins installation was unrecognizable from what it had been six months earlier. The Monday morning Dependabot problem was gone. The Mura was eliminated at the source rather than managed around. The boring platform goal ("no unexpected behavior that makes your heart pump faster, no surprises, it just works") had a new data point: 50 concurrent Kaniko builds triggered without a single hang. Build concurrency that used to cause cascading failures now just worked, because Kaniko pods are isolated by design and don't share a Docker daemon to fight over.

A summary of what changed:

Kubernetes-native workers: ephemeral pods spawned per job, sized to the workload type, torn down when the job finishes;
Fully reproducible: delete the Helm release, recreate it from Terraform and JCasC, get the same Jenkins back (zero manual state);
Single Groovy Shared Library: one codebase for back-end, web, coach, and tracking pipelines (previously each maintained their own ad-hoc implementations with duplicated logic);
GitHub OAuth with team-based RBAC: access control that matches the existing GitHub team structure (one place to manage, one place to audit; full decision in Part 2);
Decoupled secrets: Jenkins configuration releases and secrets changes have independent cycles; both are fully auditable (see Part 2).

CircleCI stayed in place for iOS and Android builds. The redesign scope was server-side applications only (closing the mobile/server-side gap was explicitly deferred).

What's next

ChatOps for deployment triggering was designed but didn't ship in this phase: a Hubot script accepting deployment commands from a Slack channel and translating them to Jenkins pipeline triggers, mirroring the mobile deployment workflow already in use on CircleCI. The interface design was drafted (including ACL for restricting deployment permissions to a named group), but the implementation was deprioritized. The deployment flow still goes through the Jenkins UI.