Empowering Developers: How Terraform Together with Nx Redefined Our Development Process

8 min readDec 7, 2023

As a senior developer, one of our key roles is to identify and mitigate bottlenecks in the development process, and find potential areas to increase velocity. A common yet often overlooked obstacle is infrastructure management. Developers might deploy their code without fully understanding the underlying infrastructure elements like Containers, ECS, EKS, or VPC. This lack of awareness, or worse, apathy, can significantly hinder development velocity.
You probably ask yourself, isn’t it the purpose of DevOps? Right, but the reality is different and that’s why this velocity killer is usually hidden.

This challenge became evident in a startup I worked for. Initially, when the project was small, everyone had admin access to our AWS account, and resources were managed manually. However, as the company grew, this approach became unsustainable and disorganized. Recognizing this, I took on the responsibility of overhauling our system, making it more maintainable and scalable as the developer team grew.

We were about to break our monolith to multiple microservices. That transition highlighted the need for a common configuration layer. Managing this manually was not only impractical but also risked losing vital knowledge trapped in people’s heads. We obviously needed a shift towards code-managed infrastructure. This transition was not immediate, but once in place, it revolutionized our deployment process- cloud costs reduced significantly, and it allowed us not only to make infrastructure changes safer, but to move much faster.

Deploying our entire system to new regions or even to a new AWS account (following our acquisition) became a matter of clicks and minutes, rather than days or weeks of manual work.

In this post, I’ll discuss our Terraform codebase design, the advantages of integrating application code and infrastructure in a single monorepo, and how custom Nx generators made our infrastructure deployment 10x faster.

Architect Your Terraform Codebase

At the beginning of this project I set 2 guidelines:

every infrastructure-related resource, from AWS resources, github repositories or Jenkins jobs, must be code-defined.
Developers had to use this code for infrastructure changes, with a limited AWS console access, ensuring full auditability.

The process began with mapping and cleaning up unused AWS resources, followed by describing the used resources in Terraform code and importing them to our Terraform environment, so everything was in-sync.

However, pretty quickly our Terraform state grew as the amount of managed resources, slowing down Terraform operations. Terragrunt came to the rescue, enabling us to break our Terraform state into smaller pieces that depend on each other, so an output of one state can be an input of another state. Terragrunt is a set of tools that allow you to manage your Terraform code better, including generating Terraform code and a modular approach for state management. This approach drastically improved efficiency, since only the relevant state is determined when running terragrunt plan or terragrunt apply, instead of the entire TF state.

I broke our TF state to the following states:

Global defines common resources that are not region-specific, such as Route53 DNS Zone and CloudFront Distribution for each environment (which is an instance of our internal cloudfront Terraform module, by using Terraform’s for_eachstatement that creates a module instance for each element in the list).

On top of the global state we have the region-related resources, for each region our app is deployed to:

Base resources- VPC and other networking-related stuff, ECS clusters, etc. that must be defined on every region. This is done by using an internal shared-infra module that defines these resources and by passing different variables from each region, different resources will be created.
Specific resources for each region, such as our CI or Prometheus workspace that are only deployed in us-east-1.

In addition I defined another state for DBs, to manage our mongoDB atlas databases and their permissions, since they’re not region-specific and part of our AWS account.

The state of the apps themselves depends on all of these states, as we need the CloudFront distributions defined in the global state to deploy frontend applications, or the databases defined and in DBs state and the VPCs defined in the region state to deploy our backend microservices.

This architecture significantly reduced our TF state size, made them more focused. If you made some changes in us-east-1, why it is needed to review the eu-central-1 state?

The result was that infrastructure changes became much faster and safer with fewer potential changes. Remember the second guideline? If I want to allow developers to make infrastructure changes related to their services by themselves, everything must be extra-safe, with as little risk as possible.

Deploying a Microservice

A service deployment of some service is defined as a tuple of service name, region and environment (e.g. employees service RC environment in us-east-1).

Our application, spread across multiple regions (mainly due to regulation concerns and compliance) and environments (RC, sandbox, production), faced potential inconsistencies with every infrastructure change. We had tens of deployments relatively quickly, and everything infrastructure change must be synced between regions and environments, how did we avoid the mess?

To manage this, I defined two separate Terraform modules for each service-

Service-related stuff- region and environment agnostic. For example ECR repository, deploy job in Jenkins, etc.
Deployment-related stuff- such as ECS service, service discovery registration, logging, etc.

A Terraform module is a reusable unit of resources, allowing to manage these resources as a group instead of individual resources, and reuse them by passing different arguments to the module instance.

In the apps state I had a single reference per service to the service module, and a reference to the deployment module for each region and environment, with different parameters, for example:

module "lidan-nothing-to-prod_us-east-1_development" {
  source = "../app-defs/lidan-nothing-to-prod/deployment"
  app_name = var.app_definition.app_name
  friendly_name = local.lidan-nothing-to-prod_us-east-1_development_friendly_name
  environment = var.app_definition.development.environment
  total_instances = var.app_definition.development.instances
  dispatcher = var.dispatchers.us-east-1.development_internal
  vpc_id = var.data_centers.us-east-1.network.vpc_id
  port = var.app_definition.port
  dns_main_domain = var.dns_main_domain
  private_subnet_ids = var.data_centers.us-east-1.network.subnet_ids.private.development
  ecs_cluster = var.data_centers.us-east-1.ecs_clusters.dev
  domain_certificate_arn = var.data_centers.us-east-1.main_domain_certificate
  service_discovery_namespace = var.data_centers.us-east-1.service_discovery_namespaces.development
  
  providers = {
    aws = aws.us-east-1
  }
}

As mentioned before, the parameters provided to each module instance are input variables coming from lower level Terraform states (databases, regions), thanks to Terragrunt that allows doing it relatively easily.

In this approach, every change in the deployment module automatically propagated across all instances of the service, ensuring consistency and speed.

Consider this scenario: a developer needs to introduce a new resource to a service. This addition must be replicated across every region and environment where the service operates. The prospect of manually configuring this update is daunting, not to mention the complexities involved in troubleshooting any issues that might arise.

Our refined architecture streamlines this process. All the developer has to do is define the new resource within the service’s deployment module (thanks ChatGPT). By running terragrunt apply within the apps directory, the update is automatically and efficiently propagated. This is because the deployment module, already instantiated for each region and environment, will apply the new resource uniformly across all service instances — whether it’s the RC in us-east-1, production in the same region, or production environment in eu-central-1.

Changes were executed x10 faster, if not more. But perhaps more significantly, this system empowers developers. They can independently add or remove resources within the familiar coding environment, as opposed to navigating the AWS console. Plus, every change they make is meticulously tracked in our GitHub repository, ensuring transparency and control.

Scaffolding a New Microservice

You might be wondering, in a system where everything is managed by code and accessible to developers, who is responsible for creating a new service? Is it me or the developers? And how is this flexible structure consistently maintained?

As part of our initiative, we consolidated all our code, including infrastructure code, into a single monorepo, treating it just like any other microservice or frontend application. While I won’t delve into the reasons for this decision here, its advantages become apparent during the creation of a new service.

For managing our monorepo, I chose to rely on Nx, which offers a powerful feature: code generation. This capability allows for the creation of custom generators tailored to specific needs, including microservice scaffolding. I developed a new service template leveraging our internal libraries and other resources. This generator was also responsible for creating the relevant Terraform code of the new service- from defining the service module to deploying it and adding its instances to the apps Terraform folder, all while aligning with the predefined module variables.

Nx also provides the flexibility to add custom commands for each application within the monorepo. One such custom command is apply-infradesigned to simplify the process for developers to make changes specific to their service. By executing nx apply-infra lidan-nothing-to-prod, developers could apply changes relevant exclusively to the lidan-nothing-to-prod service, ensuring that no other service is inadvertently affected.

For those looking to expand their app’s reach across more regions and environments, I’ve also created a dedicated generator for this purpose. This tool ensures that such expansions are seamlessly integrated into our system.

The Results

We began with a monolithic system and a manually managed AWS account, where everyone had administrative privileges. This setup was coupled with a daunting backlog of infrastructure tasks that were largely avoided. The idea of creating and deploying a new service to production without direct involvement from a DevOps specialist seemed ludicrous to our developers.

However, by the conclusion of this project, the scenario had dramatically transformed. A developer could now roll-out a basic microservice in just 5 minutes, zero to production. The majority of this time was actually consumed by AWS and Jenkins in setting up the necessary resources and executing the deployment.

This shift significantly accelerated our ability to prototype and implement new ideas. Developers gained the autonomy to modify the infrastructure independently (with a little help from ChatGPT). They could do so without the fear of inadvertently impacting aspects beyond their service’s scope. This autonomy shifted their focus back to enhancing the product and business, rather than getting bogged down by infrastructure complexities. As for my role, the time I spent assisting developers with these changes dropped dramatically.

The most significant outcome? Our workflow became 10x better. That, in essence, encapsulates the ultimate goal of a senior developer.

Should you need code examples or a template of our Terraform architecture, feel free to reach out, and I’d be happy to share.

Empowering Developers: How Terraform Together with Nx Redefined Our Development Process

Architect Your Terraform Codebase

Deploying a Microservice

Scaffolding a New Microservice

The Results

Written by Lidan Hifi