Real-life Terraform Refactoring Guide


As reality hits, the unavoidable fact of dealing with a hard-to-manage Terraform Big ball of mud code base comes in. There is no way around natural growth and evolution of code bases and the design flaws that come with it. Our Agile mindset is to “move fast and break things”, implement something as simple as possible and let the design decisions for the next iterations (if any).

Refactoring Terraform code is actually as natural as developing it, time and time again you will be faced with a situation where a better structure or organization can be achieved, maybe you want to upgrade from a home-made module to an open-source/community alternative, maybe you just want to segregate your resources into different states to speed-up development. Regardless of the goal, once you get into it, you will realize that Terraform code refactoring is actually a basic missing step on the development process that no one told you before.

As the Suffering-Oriented Programming mantra dictates:

“First make it possible. Then make it beautiful. Then make it fast.”

So, time to make the Terraform code beautiful!

How to break a big ball of mud? STRANGLE IT

<joke> Martin Fowler has already written everything there is to write about (early 2000s) DevOps, Agile, and Software Development. Therefore, we could reference Martin Fowler for virtually anything Software related </joke>, but really, the Refactoring book is THE reference on this subject.

Martin Fowler shared the Stangler (Fig) Pattern, which describes a strategy to refactor a legacy code base by re-implementing the same features (sometimes even the bugs) on another application.

[…] the huge strangler figs. They seed in the upper branches of a tree and gradually work their way down the tree until they root in the soil. Over many years they grow into fantastic and beautiful shapes, meanwhile strangling and killing the tree that was their host.

This metaphor struck me as a way of describing a way of doing a rewrite of an important system.

In this document we are going to follow the same idea:

  1. implement the same feature on a different Terraform composition;
  2. migrate the Terraform state;
  3. delete (kill) the previous implementation.

The mono-repository (monorepo) approach to Legacy

Let’s suppose that your Terraform code base is versioned in a single repository (a.k.a. monorepo), following the random structure displayed below (just to help illustrate)

├── modules/    # Definition of TF modules used by underlying compositions
├── global/     # Resources that aren't restricted to one environment
│   ├── aws/
├── production/ # Production environment resources
│   └── aws/
└── staging/    # Staging environment resources
    └── aws/

On this example each directory corresponds to a Terraform state. In order to apply changes you have to walk to a path and execute terraform.

The structure on this example repository was created a few hypothetical years ago when the number of existing microservices and resources (DB, message queues, etc) was significantly smaller. At the time, it was feasible to keep Terraform definitions together because it was easier to maintain, Cloud resources were managed with one-shot!

As the time went by, the number of Products and the team grew, and engineers started facing concurrency issues: Terraform lock executions on a shared storage when someone else is running terraform apply as well as a general slowness on every execution since the number of data sources to sync is frightening.

A mono-repository approach is not necessarily bad, versioning is actually simpler when performed in one single repository. Ideally, there won’t be many changes on the scale of GiB meaning that it is safe to proceed on this one as long as the Terraform remote states are divided.

Splitting the modules sub-path to its own repository

One thing to mention though is the modules sub-path, this one could be stored in a different git repository to leverage its own versioning. Since Terraform modules and its implementations don’t always evolve in the same pace, keeping two distinct version trees is beneficial. Additionally, a separated repository for Terraform modules allows the specification of “pinned versions”, e.g.:

module "aws_main_vpc" {
  source = "git::"
  # Note the ref=${GIT_REVISION_DIGEST}

That reference for a module’s version should always be specified, regardless if it comes from an internal/private repository or public. When you specify the version, you are ensuring reproducibility.

Therefore, let’s move the modules sub-path to another git repository, following instructions from this StackOverflow answer so that the git commit history is preserved:


Walk to the monorepo path and create a branch from the commits at monorepo/modules path

git subtree split -P modules -b refact-modules


Create the new repository

mkdir /path/to/the/terraform-modules && cd $_
git init
git pull "${MAIN_BIGGER_REPO}" refact-modules


Link the new repository to your remote Git (server)

git remote add origin <[email protected]:user/terraform-modules.git>
git push -u origin master


[OPTIONAL] Cleanup inside $MAIN_BIGGER_REPO, if desired

git rm -rf modules
git filter-branch --prune-empty \
    --tree-filter "rm -rf modules" -f HEAD

Let’s start strangling the repository

Now that a substantial piece of code was moved somewhere else, it is time to put the Stangler (Fig) Pattern in practice.

Move all the existing content as-is to the legacy sub-path, keeping the same repository and change history (commits). It also allows applying the legacy code as it used to be from one of those paths.

└── legacy
    ├── global
    │   └── aws
    ├── production
    │   └── aws
    └── staging
        └── aws

Once the content is moved to legacy, the idea is to follow the Boy Scout rule in order to strangle the legacy content little by little (unless you are really committed to migrating it all at once, which is going to be exhaustive).

The Boy Scout rule goes like:

  1. every time a task that involves deprecated code appears, we implement it on the new structure;
  2. import the Terraform state to keep the Cloud resources that a given code represents/describes;
  3. remove the state and the code from legacy.

Until there is nothing left inside legacy (or there are only unused resources/left-behinds that could be destroyed/garbage collected either way).

Import state? Remove state and code from what? Where?

That will depend on the kind of resource we are migrating from the remote state, on the bottom of each resource on Terraform’s provider documentation you can find a reference command to import existing resources into your Terraform code specification. e.g.: AWS RDS DB instance.

Suppose we want to replace the code of the AWS RDS Aurora defined in production/aws and then re-implement the same using the community module. After creating the corresponding sub-path to the monorepo according to your preference, provisioning the bucket and initializing the Terraform backend:

  1. implement the definition of the community module with the closest parameters from the existing one; e.g.:

    module "aws_aurora_main_cluster" {
      source  = "terraform-aws-modules/rds-aurora/aws"
      version = "~> 5.2"
      # ...
  2. import the Terraform states from the previous (existing) cluster

    terraform import 'aws_aurora_main_cluster.aws_rds_cluster.this[0]' main-database-name
    terraform import 'aws_aurora_main_cluster.aws_rds_cluster_instance.this[0]' main-database-instance-name-01
    terraform import 'aws_aurora_main_cluster.aws_rds_cluster_instance.this[1]' main-database-instance-name-02
    # ...

    then if you haven’t yet and would like to “match reality” between the existing and the specified resource, run terraform plan a few times and adjust the parameters until Terraform reports:

    No changes. Your infrastructure matches the configuration.
  3. last but not least, remove the corresponding resources from the legacy Terraform state so that it doesn’t try to keep track of the changes and also don’t try to destroy once the resource definition is no longer in that code base:

    # Hypothetical name of the resource inside production/aws/
    terraform state rm aws_rds_cluster.default \
        'aws_rds_cluster_instance.default[0]' 'aws_rds_cluster_instance.default[1]'
    # ...

    once that is performed, feel free to remove the corresponding resource’s definition from the legacy code.

    resource "aws_rds_cluster" "default" {
      # ...
    resource "aws_rds_cluster_instance" "default" {
      count = var.number_of_database_instances
      # ...
Matheus Cunha
Matheus Cunha
Systems Engineer and Magician 🎩

Just a technology lover empowering business with high-tech computing to help innovation (: