DevOps Foundations

Fast, secure, reliable AWS environments using Terraform

Yevgeniy Brikman
Gruntwork
Published in
17 min readAug 16, 2023

--

To go fast in a car, a powerful engine is not enough. For example, F1 race cars not only have ~1,000 hp engines that can take you from 0 to 186 mph in about 8 seconds, but just as importantly, they also have carbon-ceramic brakes that can take you from 186 mph back to 0 in under 4 seconds. What makes it possible for drivers to hit high speeds is the knowledge that they can do so safely. And that requires powerful brakes, as well as a strong frame, seat belts, bumpers, and helmets: these are the foundations that make it possible to go fast in the car racing world.

Every company wants to be fast & agile, like an F1 race car. This kind of speed requires rock-solid foundations.

So that leads to an interesting question: what are the foundations you need to go fast in the DevOps world?

In this blog post, I’m going to answer that question by introducing Gruntwork DevOps Foundations, which consists of the 4 ingredients you need to go fast in the DevOps world with AWS and Terraform:

  1. Account foundations: account baselines & account vending machine.
  2. Code foundations: infrastructure as code (IaC) & code standards.
  3. CI / CD foundations: automated pipelines & workflows.
  4. Maintenance foundations: automated updates & patching.

Each of these four ingredients is powerful on its own, but something magical happens when you put them together in just the right way. What’s so magical about it? Well, the best way to answer that is to show you.

In the rest of this blog post, I’ll walk you through the experience of using each of the 4 ingredients, explaining what they are, short videos snippets that show you what it looks like in the real world to have fast, secure, and reliable environments set up with AWS and Terraform. And at the very end, I’ll show how Gruntwork can get you set up with rock-solid DevOps Foundations in a matter of days.

Let’s get started.

Ingredient #1: Account Foundations

The first ingredient is all about setting up your AWS multi-account structure. The goal is to ensure that (a) every one of your AWS accounts is configured with baselines that enforce your company’s requirements around security, compliance, reliability, monitoring, etc., and (b) you have an easy, automated, self-service way to create new AWS accounts with those baselines in place.

Here’s how it works:

  1. Use an account factory to create new AWS accounts
  2. Use Control Tower as a single pane of glass for all of your AWS accounts
  3. Use Terraform to configure a custom baseline for every account
  4. Get access to the accounts via SSO

1.1 Use an account factory to create new AWS accounts

Using an account factory in GitHub Actions to create a new account called “sandbox” in the “Pre-Prod” OU.

The first step is to set up one or more AWS accounts. You do this via an account factory (AKA account vending machine) that allows any team in your company to automatically create new accounts via a self-service experience:

  1. Prompt the user for data: You might prompt the user for the name to use for the account, email address of the root user for the account, Organizational Unit (OU) to add the account to, team name (for tagging), who should get access to the account, and so on. The video above uses a web form in GitHub Actions to gather this data and kick off the workflow, but you could also use other UIs as well (e.g., you could set up a ServiceNow UI in front of GitHub Actions).
  2. Generate code: Under the hood, everything is managed as code and GitOps-driven (I’ll talk more about code later, in Ingredient #2: Code Foundations), so once the account factory has gathered the necessary data, it generates code to (1) trigger AWS Control Tower to create the new account and (2) use a Terraform module to apply a custom baseline to the new account. You’ll see more details on both of these items in sections 1.2 and 1.3.
  3. Create the account automatically: The account factory opens a pull request (PR) with the code. Once the PR is reviewed and merged (optionally, these PRs can be auto-merged), the CI / CD pipeline runs Terraform to apply the code changes and create the account automatically (I’ll talk more about CI / CD later, in Ingredient #3: CI / CD Foundations).

1.2 Use Control Tower as a single pane of glass for all of your AWS accounts

Using Control Tower to see all of your AWS accounts, including the new “sandbox” account.

The account factory uses AWS Control Tower to create all your AWS accounts. This gives you the following:

  • Single pane of glass: Control Tower gives you a single, central place where you can see all your accounts and Organizational Units (OUs).
  • CloudTrail and AWS Config: Each account you create with Control Tower includes CloudTrail and AWS Config for audit logging and configuration monitoring.
  • Guard rails: In addition to the baseline, you can enable controls in Control Tower, including Service Control Policies (SCPs) and AWS Config Rules, to act as “guard rails” across all of your AWS accounts. For example, you can enable controls to enforce requirements from standards such as NIST 800–53 or PCI DSS. If any accounts or resources are out of compliance with the controls you enable, you’ll get an alert, and be able to see the noncompliant resources in the Control Tower UI.

1.3 Use Terraform to configure a custom baseline for every account

Configuring a Terraform module as a secure baseline to apply to every AWS account.

In addition to what Control Tower sets up in each account (i.e., CloudTrail, AWS Config), the account factory uses a Terraform module to apply a custom baseline (a Landing Zone) to each account to fill in everything Control Tower doesn’t include, such as:

  1. Additional guard rails: E.g., GuardDuty, Macie, IAM Access Analyzer.
  2. Encryption: E.g., default EBS encryption, multi-region KMS keys.
  3. Networking: E.g., VPCs, subnets, route tables, NAT Gateways, VPN, Direct Connect, etc. Note: Control Tower includes its own VPC, but the customization options are minimal, so you’re typically better off replacing it.
  4. Auth: SSO settings, IAM roles, OIDC providers. You’ll see more about this in the next section.
  5. Anything else your company needs: You can capture any other requirements your company has in this Terraform module: security, compliance, monitoring, networking, etc.

1.4 Get access to the accounts via SSO

Using AWS Identity Center (SSO) with Google to login to the newly-created “sandbox” account.

As part of creating the new AWS account, the account factory grants access to the account through Single-Sign On (SSO) via the AWS Identity Center, allowing you to login to the AWS account using an existing identity provider, such as Google, Active Directory, Okta, etc. That means that all your users and permissions are managed in one place.

Ingredient #2: Code Foundations

The second ingredient is about managing all of your infrastructure as code. This enables your infrastructure to be self-service, automated, standardized, tested, secure, and fast.

Here’s how it works:

  1. Use infrastructure from a catalog
  2. Use Terraform to manage everything as code
  3. Deploy new infrastructure via scaffolding and GitOps

2.1 Use infrastructure from a catalog

An example of picking the EKS Cluster module from the catalog you get out-of-the-box with the Gruntwork Library.

Now that you’ve set up the AWS accounts you need, the next step is to deploy whatever infrastructure you need into those accounts: e.g., EKS, ECS, EC2, RDS, ElastiCache, S3, etc. You want teams to be able to deploy the infrastructure they need for themselves (self-service) but not in a free-for-all, ad-hoc manner: otherwise, every EKS cluster, every EC2 instance, and every RDS database will be configured in a different way, with no guarantees they meet your company’s requirements for security, compliance, reliability, etc.

Instead, every team can deploy from a catalog (AKA library) of standardized, off-the-shelf infrastructure that has been built and tested to meet your company’s requirements. You can build out your own catalog of infrastructure from scratch, or you can leverage the Gruntwork Library, a collection of over 300K lines of infrastructure code that has been proven in production at hundreds of companies, complies with the CIS AWS Foundations Benchmark out of the box, and is commercially supported and maintained.

Under the hood, all the infrastructure in your catalog should be managed as code, as described in the next section.

2.2 Use Terraform to manage everything as code

Browsing the code for the EKS Cluster module in the Gruntwork Library.

[Update, August 16th, 2023]: Curious what’s the impact of Terraform moving to a BSL license? See The future of Terraform must be open for our plan.

Under the hood, you manage everything as code, primarily using Terraform (see also Why we use Terraform and not Chef, Puppet, Ansible, Pulumi, or CloudFormation). If you’re a solo developer or doing a hobby project, you might be able to get away with managing everything by clicking around the AWS Console (“ClickOps”), but if you’re managing infrastructure for a company — with lots of customers, teams, environments, and revenue counting on you — the only solution that scales is to manage your infrastructure as code (IaC). Here are just a few of the key benefits of IaC:

  • Self-service: If you manage your infrastructure manually, you typically end up with just a handful of people (sometimes, just one) who are the only ones who know all the magic incantations to make the deployment work and are the only ones with access to production. This becomes a major bottleneck as the company grows. If your infrastructure is defined in code, the entire deployment process can be automated, and developers can kick off their own deployments whenever necessary.
  • Standardization: If you manage infrastructure manually, then every deployment will be its own unique snowflake, which quickly becomes a nightmare in terms of security, compliance, reliability, maintainability, etc. If you manage everything as code, you can create standardized, reusable Terraform modules for all your infrastructure (e.g., EKS, EC2, RDS, S3, etc.) that have been built and tested to meet your company requirements (security, compliance, reliability, etc.).
  • Speed and safety: If all of your infrastructure is defined as code, then deployments will be significantly faster, since a computer can carry out deployment steps far faster than a person, and safer, given that an automated process will be more consistent, more repeatable, and not prone to manual error.
  • Leverage: If you manage your infrastructure manually, then every deployment requires following the same tedious, repetitive steps: there’s no way to leverage previous work. But if your infrastructure is defined as code, all of that knowledge is captured in a way that’s reusable: if one team figures out how to deploy an EKS cluster effectively using a Terraform module, all other teams can immediately use that Terraform module, and get huge leverage from all that hard work. And it doesn’t even have to be your own team that does that hard work: you can leverage battle-tested modules from the Gruntwork Library.
  • Testing (“Shift Left”): If the state of your infrastructure is defined in code, then for every single change, you can perform code reviews and run a variety of automated tests, including security scans, policy checks, compliance checks, functional tests, and so on (I’ll talk more about testing later, in Ingredient #3: CI / CD Foundations).

2.3. Deploy new infrastructure via scaffolding and GitOps

Configuring the EKS Cluster module from the Gruntwork Library for deployment in the “sandbox” account.

If you want to manage everything as code, then to deploy new infrastructure, such as deploying an EKS cluster into the “sandbox” account (as shown in the video above), you need to write some code! Here’s how you make this a self-service experience that is accessible to your entire dev team:

  • Scaffold out the code. Instead of trying to figure out how to write the code from scratch, and every developer doing it slightly differently (which can become a nightmare from a maintenance standpoint), you use scaffolding tools to generate the basic code structure with all your company’s standardized coding patterns built in. These tools could be web based (e.g., the video above shows the Gruntwork docs site’s “sample usages”) or CLI based (e.g., Gruntwork provides a CLI tool called boilerplate tool that can scaffold out code based on Go templates), and they can even be interactive (e.g., boilerplate prompts the user for inputs and generates the code based on their responses).
  • Configure the infrastructure. Once the basic code structure is generated by the scaffolding tool, you follow the comments in the generated code to fill in parameters to configure the infrastructure to meet your needs.
  • Open a pull request (“GitOps”). Commit your code to a branch and open a PR. This is a GitOps-driven approach where all actions are driven via changes in your version control system. This way, your version control system becomes an audit log of all changes.
  • Let the CI / CD pipeline do the rest. Once you open a PR, the CI / CD pipeline will handle the rest, including enforcing code reviews, running tests, and deploying the EKS cluster into your new “sandbox” account. You’ll see how this works in the next section.

Ingredient #3: CI / CD foundations

The third ingredient is about using CI / CD pipelines and workflows to do all deployments. This gives you a secure, automated, tested, audited, and reproducible way to make changes to your infrastructure.

Here’s how it works:

  1. Use a CI / CD pipeline to do all deployments
  2. Secure your CI / CD pipeline

3.1 Use a CI / CD pipeline to do all deployments

The CI / CD pipeline runs ‘plan’ and ‘apply’ on the EKS cluster code to deploy into the “sandbox” account.

All deployments are done via a CI / CD pipeline. This ensures that all changes must be done as code and that every change goes through a systematic, vetted, audited series of validations. In other words, your CI / CD pipeline captures your company’s workflows, processes, and policies as code, including:

  • GitOps-driven changes: The CI / CD pipeline is triggered by changes in version control: e.g., commits, pushes, PRs, etc.
  • Translate Git changes to infrastructure operations. The CI / CD pipeline automatically translates Git events to higher-level infrastructure operations. For example, if your Git commits include modifying Terraform code, the CI / CD pipeline translates this to a module update. If your Git commits include deleting a folder with Terraform code, the CI / CD pipeline translates this to a module delete. If your Git commits make changes in the sandbox folder, the CI / CD pipeline translates this to a change in the “sandbox” account, and will know to authenticate to that account to deploy the changes. In response to these infrastructure operations, you run various workflows, as described in the next bullet point and the section after that.
  • plan, apply, and destroy workflows: When you open a PR that includes module updates, the CI / CD pipeline automatically runs terraform plan and adds the plan output as a PR comment; when you merge the PR, the pipeline runs terraform apply. When you open a PR that includes module deletes, the CI / CD pipeline checks out the version of the code just before deletion, runs terraform plan -destroy, and adds the destroy plan output as a PR comment; when you merge the PR, it runs terraform destroy. If a PR contains multiple updates, the pipeline automatically processes them all concurrently.

3.2 Secure your CI / CD pipeline

Approval workflow: PR is merged; notification in Slack; review the ‘plan’; if it looks good, approve the job; ‘apply’ runs.

Since the CI / CD pipeline is your gateway to production, it’s a popular target for attackers (and some of the real-world CI / CD exploits are scary and non-obvious), and therefore, it’s critical to think through how to keep your pipeline secure. The best approach is a defense-in-depth strategy, where you build multiple layers of protections into the pipeline, including:

  • OIDC for AWS access: To deploy changes into AWS, your CI / CD pipeline needs access to AWS credentials. The traditional approach is to manually create these credentials and copy them over to your CI / CD system, but permanent credentials and manually-managed credentials are tedious and not particularly secure. Here’s a better alternative: the account factory you use to create new accounts not only configures SSO to give access to human users, it also configures IAM roles and OpenID Connect (OIDC) identity providers to give access to machine users. This allows your CI / CD pipeline to securely authenticate to AWS with credentials that are automatically managed and rotated, which is simpler and more secure.
  • Isolated workers: Your CI / CD pipeline typically needs powerful credentials: e.g., to be able to deploy arbitrary infrastructure changes in AWS (e.g., with Terraform), your pipeline likely needs credentials with arbitrary permissions—that is, admin permissions. Even if you manage those credentials with OIDC, if your whole dev team can execute arbitrary code using that CI / CD pipeline, then you’ve effectively made every developer on your team an admin—not to mention created a huge, tempting surface area for attackers! Here’s a better approach: instead of giving your CI / CD pipelines direct access to powerful credentials, only give an isolated worker (e.g., an isolated ECS service or isolated CI / CD pipeline) those permissions, give only a small number of trusted admins access to modify the isolated worker, and give your CI / CD pipelines permissions to trigger the isolated worker, but only via a limited API: e.g., the API only allows you to run a list of approved commands (e.g., terraform plan and terraform apply) in approved repos / branches / folders. This significantly reduces your exposure to attacks, as developers (and attackers) no longer have direct access to powerful credentials: all they can do is request an isolated worker perform a limited set of actions.
  • Version control protection: Even if you protect your CI / CD credentials with OIDC and isolated workers, eventually, your CI / CD pipeline or an isolated worker will use those credentials to execute code in your version control system, so if that code is compromised, it’s game over. Therefore, it’s essential to protect your version control system with tools such as protected branches and AllStar to ensure that every single code change is code reviewed by at least 1 person who isn’t an author before it can be merged.
  • Approval workflows: In addition to protecting your version control system, you can require approval from specific teams or individuals for deployments to certain environments (e.g., prod) or changes to certain infrastructure (e.g., your database). This gives you one final chance to check what code is being executed and check the plan output before you run apply with sensitive credentials. The video above shows an example of an approval workflow.
  • Access controls: You can also define access control lists (ACLs) for which parts of your infrastructure can be modified by which teams, and automatically block any deployment that violates your ACLs.
  • Automated tests (“shift left”): Another key layer of protection is to run every code change through a variety of automated tests, including static analysis and security scans (e.g., checkov, tfsec, tflint), plan tests and policy checks (e.g., Terratest, OPA), and functional tests (e.g., unit, integration, and end-to-end tests with Terratest).

Ingredient #4: Maintenance Foundations

The fourth ingredient is about automatically keeping all your dependencies up to date. This allows you to keep your infrastructure working, even as everything under the hood is changing and evolving.

Here’s how it works:

  1. Use automated updates and patching to stay up to date
  2. Roll out updates with promotion workflows

4.1 Use automated updates and patching to stay up to date

An example of auto update and auto patching after new module version is released.

Everything in the DevOps world is changing all the time. For example, every quarter, a new version of Kubernetes and EKS comes out, and an older version is deprecated; several times per year, a new version of Terraform or a Terraform provider is released, sometimes with breaking changes; even compliance standards are changing all the time, such as the CIS AWS Foundations Benchmark, which has gone through versions 1.2, 1.3, 1.4, 1.5, and 2.0, all in the span of just a few years. These constant changes mean that merely setting up the first three ingredients of DevOps Foundations isn’t enough; you also you have to carve out time on an ongoing basis to keep everything up to date:

  • Release new module versions: Since all of your infrastructure is defined as code, the updates you do should be released as new versions of those modules. For example, if a new version of EKS comes out, you might update your eks-cluster module to work with that new version, and release a new version of your eks-cluster module, so all the teams at your company can pick it up. Note that Gruntwork can offload much of this work for you: we provide commercial maintenance for the Gruntwork Library, so each time there’s a new version of Kubernetes / EKS, or a new version of Terraform or Terraform providers, or a new version of a CIS benchmark, we update our modules, and release the changes as new versions.
  • Automated updates: Between changes in the DevOps world, changes to your own modules, and changes to your dependencies, you’ll see thousands of releases each year, so trying to stay up to date manually is a lost cause. Instead, you use an automated tool (e.g., Patcher) to regularly scan your code, find dependencies, look up if new versions of those dependencies are available, and if so, update your usages to the new versions, and open a PR with the changes, as shown in the video above. That way, the changes come to you automatically, the CI / CD pipeline runs tests against them automatically, and all your team has to do is review them, and decide if/when to merge them.
  • Automated patching: New releases sometimes include backward incompatible changes, which means that to update to the new version, you need to not only update the version number, but make other code changes as well. Today, this is mostly handled by the maintainer writing up a migration guide for you to follow manually, but a better approach is to use automated patching. The idea is that if you’re a maintainer of a module, and your new release of that module includes breaking changes, you can include a patch with that release to automatically transform the code of anyone using that module, and fix the backward incompatibility. For example, the video above shows a maintainer releasing a new version of an eks-cluster module, which includes a backward incompatible change: an input variable is renamed from fargate to enable_fargate. The maintainer of the module includes a patch in the release, so when the automated tooling runs and picks up the new release, it will apply the patch, and open a PR that includes not only a version number bump, but also the renamed variable. So now your team doesn’t have to spend time following migration guides manually: all you do is review the PR and test output, and decide if/when to merge it.

4.2 Roll out updates with promotion workflows

An example of promoting an update to the eks-cluster module from dev to stage to prod.

A naive approach to automated updates is to update all usages across all environments (e.g., dev, stage, prod) and all teams, all at the same time; of course, in the real world, that’s not how we update software, as it means you might roll out a change to prod before you’ve had a chance to test it in dev or stage! The better approach is to update your environments one at a time, promoting the changes from “lower” environments to “higher” ones: e.g., devstageprod.

For example, the video above shows the following workflow:

  1. Auto-update dev: first, a PR is opened that solely updates the dev folder (which corresponds to the dev environment). The CI / CD pipeline automatically runs validations on the PR, including terraform plan, and if the validations, the plan output, and the code changes all look good, you merge the PR, and the CI / CD pipeline automatically runs terraform apply to deploy it.
  2. Auto-update stage: Once the dev changes are successfully deployed, a PR is automatically opened in stage, promoting the same changes from dev. Again the CI / CD pipeline runs validations, and if everything looks good, you merge the PR, and the CI / CD pipeline deploys it to stage.
  3. Auto-update prod: Once the stage changes are successfully deployed, a PR is automatically opened prod. You repeat the process in prod one more time, and then you’re done!

This is a great example of what that happens when all four ingredients of DevOps Foundations are combined: the Code Foundations ensure all your infrastructure is managed as code; the Maintenance Foundations ensure that your code is kept up to date automatically, opening PRs each time there are new releases; the CI / CD Foundations ensure that each PR is thoroughly validated and automatically deployed; and the Account Foundations set up the secure account structure into which the CI / CD pipeline authenticates and deploys.

Try it out

Now that you’ve had a chance to see all four ingredients, and the experience you get when you put them together the right way, there’s one more thing you should know:

Gruntwork can set you up with rock-solid DevOps Foundations in days.

You don’t have to spend months or years setting this all up; you can have it in a matter of days because for each ingredient of DevOps Foundations, we offer battle-tested, off-the-shelf products:

Each of these has been proven in production at hundreds of companies, and we know how to put them all together to create a magical experience—in a matter of days.

Interested in trying it out? Subscribe now!

Want to learn more? Set up a chat or demo with our sales engineers!

--

--

Co-founder of Gruntwork, Author of “Hello, Startup” and “Terraform: Up & Running”