Yak Shaving Series #1: All I need is a little bit of disk space

A horror movie featuring Auto Scaling Groups, EBS Volumes, Terraform, and Bash

Yevgeniy Brikman
Gruntwork
Published in
9 min readJul 4, 2017

--

Image by dgrosso23

This is the first entry in The Yak Shaving Series, where we share stories of some of the unexpected, bizarre, painful, and time-consuming problems we’ve had to solve while working on DevOps and infrastructure.

The request looked straightforward: one of our customers sent us an email asking for more disk space on their servers. At Gruntwork, we set all of our customers up in AWS with infrastructure as code using Terraform, so I figured all I needed to do was make a simple tweak in their Terraform code, run terraform apply, and I’d be done.

“No problem,” I said, “it shouldn’t take more than an hour.”

You know that feeling when you’re home at night, completely alone, and you hear a strange noise in the basement, and think to yourself, “this is how every horror movie starts?” Well, next time you hear a programmer say, “it shouldn’t take more than an hour,” you should think to yourself, “this is how every yak shaving incident starts.”

Investigating the strange noise…

I opened up the customer’s Terraform code and immediately discovered the first complication: they were storing business-critical data on these servers! If you store this sort of data on a server’s “root volume”, if that server crashes, there is no easy way to migrate that root volume to a replacement server. Therefore, we would need to switch to using EBS Volumes, which live separately from the servers, are replicated, and can be attached/detached over the network.

Creating EBS Volumes in Terraform is straightforward:

resource "aws_ebs_volume" "volumes" {
count = 3
size = 200
availability_zone =
"${element(data.aws_availability_zone.all.names, count.index)}"
}
data "aws_availability_zone" "all" {}

The code above creates 3 EBS Volumes, each 200GB in size, and puts each one in a separate Availability Zone (i.e., isolated AWS data center). Now we needed to attach these Volumes to our servers.

This revealed the next complication: it turned out the customer was running the servers in an Auto Scaling Group (ASG). An ASG is a nice way to manage multiple servers, but unfortunately, if you want to reuse an EBS Volume across crashes or deployments, you can’t use the ebs_block_device settings in the ASG’s configuration (as this setting creates a brand new EBS Volume for each server) and you can’t use the aws_volume_attachment resource (as this setting doesn’t work with ASGs).

This meant I couldn’t do this work in a few lines of Terraform code. Instead, I’d have to write a Bash script to attach the Volumes manually. I emailed the customer again and said, “this may take a few hours, but I should have something for you by the end of the day.”

This is the scene in the horror movie where the actor slowly heads down to the basement to investigate the strange noise, peers into the darkness, and says, “Hello?”

The shrill music begins

As I started working on the Bash script, I stumbled across a new problem: the servers in an ASG can boot at any time and in any order (e.g. initial rollout or replacing a node during a crash). Since the servers keep changing, there is no way to assign each EBS Volume to a specific server. Instead, the Bash script would have to iterate through all the EBS Volumes until it found one that wasn’t already attached. Here’s the rough structure of the Bash script:

function attach_ebs_volume {
local readonly ebs_volume_ids=($@)
for ebs_volume_id in "${ebs_volume_ids[@]}"; do
try_to_attach_ebs_volume "$ebs_volume_id"
if [[ $? -eq 0 ]]; then
echo "Successfully attached EBS Volume $ebs_volume_id"
exit 0
fi
done
echo "ERROR: Unable to attach any of the volumes."
exit 1
}

The attach_ebs_volume function loops over all the EBS Volume IDs and tries to attach each one by calling the try_to_attach_ebs_volume function. If it succeeds (exit code of zero), we’re done, and the script exits; otherwise, we try the next EBS Volume. The try_to_attach_ebs_volume function looks like this:

function try_to_attach_ebs_volume {
local readonly ebs_volume_id="$1"
aws ec2 attach-volume \
--region "$(get_aws_region)" \
--instance-id "$(get_instance_id)" \
--device "$(get_available_device)" \
--volume-id "$ebs_volume_id"
}

The code above uses the attach-volume command of the AWS CLI to try to attach the specified EBS Volume. There is a problem with this code: the attach-volume command could fail not only because the Volume was already attached, in which case it was OK to keep trying, but also for many other reasons, such as lack of IAM permissions to attach EBS Volumes, in which case it would be better to exit the script entirely. Checking the AWS CLI docs, I hit yet another problem: just about all errors return the same exit code, 255.

Gah. Things were getting more and more complicated. I emailed the customer yet again and told them that I wouldn’t have the work done today, but surely, it would be ready tomorrow.

After slowly making his way through the dark basement, the hero in our horror movie comes upon a mysterious door. The music grows louder. You just know something horrible is about to happen.

The jump scare

To figure out what kind of error I was getting, I would need to capture stdout and stderr, and search within them for the VolumeInUse error message. I updated the try_to_attach_ebs_volume function to do just that:

function try_to_attach_ebs_volume {
local readonly ebs_volume_id="$1"

local readonly output=$(aws ec2 attach-volume \
--region "$(get_aws_region)" \
--instance-id "$(get_instance_id)" \
--device "$(get_available_device)" \
--volume-id "$ebs_volume_id" \
2>&1)
local readonly exit_code=$? if [[ $exit_code -eq 255 && $output == *"VolumeInUse"* ]] then
echo "Volume is already attached to another server"
elif [[ $exit_code -ne 0 ]];
echo "Got some other type of error: $output"
else
echo "Volume attached successfully!"
fi
}

This is when things got really weird. On every server, whether or not the attach-volume command succeeded, every time I ran the code, I would get the exact same output:

Volume attached successfully!

I added an echo statement to the code to print the exit code:

function try_to_attach_ebs_volume {
local readonly ebs_volume_id="$1"
local readonly output=$(aws ec2 attach-volume \
--region "$(get_aws_region)" \
--instance-id "$(get_instance_id)" \
--device "$(get_available_device)" \
--volume-id "$ebs_volume_id" \
2>&1)
local readonly exit_code=$?
echo "Exit code = $exit_code"
if [[ $exit_code -eq 255 && $output == *"VolumeInUse"* ]] then
echo "Volume is already attached to another server"
elif [[ $exit_code -ne 0 ]];
echo "Got some other type of error: $output"
else
echo "Volume attached successfully!"
fi
}

Even when the attach-volume command showed errors, I would always see:

Exit code = 0
Volume attached successfully!

At this point, I began suspecting that the AWS CLI had a bug where it always returned a 0 exit code, but from a quick Google search, it didn’t seem like anyone else was having this issue. I decided to try another experiment, replacing the attach-volume command with a call to false, which always exits with a code of 1:

function try_to_attach_ebs_volume {
local readonly output=$(false)
local readonly exit_code=$?
echo "Exit code = $exit_code"
if [[ $exit_code -eq 255 && $output == *"VolumeInUse"* ]] then
echo "Volume is already attached to another server"
elif [[ $exit_code -ne 0 ]];
echo "Got some other type of error: $output"
else
echo "Volume attached successfully!"
fi
}

I ran the code:

Exit code = 0
Volume attached successfully!

At this point, I was genuinely nervous. Here I was, ostensibly setting up disk drives to persist a customer’s invaluable data, and I couldn’t get so much as an exit code to work correctly.

The character in our horror movie is now just a foot away from the mysterious door, slowly reaching for the door knob, hand shaking, eyes wide, music screeching.

I decided to strip away all the code and reduce it down to just this:

function try_to_attach_ebs_volume {
local readonly output=$(false)
echo "Exit code = $?"
}

Surely, this must return an exit code of 1???

Exit code = 0

Nope.

Our character turns the knob, pulls the door open, the violins shriek, and… nothing happens. Behind the door is just a dusty, empty closet. Phew.

I tried one last experiment:

function try_to_attach_ebs_volume {
output=$(false)
echo "Exit code = $?"
}

Finally, after all that:

Exit code = 1

The character breathes a sigh of relief, turns around, and WHAM! Something LEAPS out of the shadows, our character is thrown sideways, SCREAMING, “aaaaaaaaagggghhhhh!!!!”

That’s what it felt like when I realized what was happening in that code snippet. It turns out that, unlike most other programming languages, in Bash, local and readonly are not keywords. They are functions. And since they are functions, they have their own exit codes! Therefore, the exit code we were looking at was not from calling attach-volume or false but from calling local, which always returns zero!

A gory death

At this point, I had to call it a day. I didn’t even email the customer this time. I was too upset. Too shaken.

But my suffering was not done yet.

The character in our movie is still alive. Just as suddenly as it had appeared, the mysterious attacker has vanished. Leg bleeding from the initial attack, our hero picks himself up off the floor, turns, and begins limping back towards the basement stairs.

I put all my code back to the way it was, making sure not to include local and readonly on the output variable:

function try_to_attach_ebs_volume {
local readonly ebs_volume_id="$1"
local output
output=$(aws ec2 attach-volume \
--region "$(get_aws_region)" \
--instance-id "$(get_instance_id)" \
--device "$(get_available_device)" \
--volume-id "$ebs_volume_id" \
2>&1)
local readonly exit_code=$?
echo "Exit code = $exit_code"
if [[ $exit_code -eq 255 && $output == *"VolumeInUse"* ]] then
return 256
else
return $exit_code
fi
}

Notice how the try_to_attach_ebs_volume function now uses different return codes depending on what happened:

  • If the Volume is already attached, I return 256 (specifically picked to not overlap with the AWS CLI exit code) to indicate to the caller that they should try again.
  • In all other cases, I return the original exit code from the attach-volume command.

I updated the attach_ebs_volume function to handle these return codes accordingly:

function attach_ebs_volume {
local readonly ebs_volume_ids=($@)
for ebs_volume_id in "${ebs_volume_ids[@]}"; do
try_to_attach_ebs_volume "$ebs_volume_id"
local readonly exit_code=$?
if [[ $exit_code -eq 0 ]]; then
echo "Successfully attached EBS Volume $ebs_volume_id!"
exit 0
elif [[ $exit_code -eq 256 ]]; then
echo "EBS Volume $ebs_volume_id is already attached."
else
echo "Got an unexpected error. Exiting."
exit 1
fi
done
echo "ERROR: Unable to attach any of the volumes."
exit 1
}

I ran this code, expecting things to finally work, and found, to my horror, that on every server, the log file contained the same exact output:

Seven days...

Ugh, no, that’s not it. Sorry. Here’s the actual output:

Successfully attached EBS Volume <ID>!

Even though I could see error messages in the logs and could tell some of the servers had not attached their Volumes successfully, all the logs said the exact same thing.

I thought I had survived the exit code fiasco, but it turns out it was still there the entire time.

The actor in our horror movie, dragging a badly injured leg, is making his way up the basement stairs, one agonizing step at a time, leaving a trail of blood behind him.

I chop out all the extraneous code from the attach_ebs_volume function and reduce it down to this:

function attach_ebs_volume {
try_to_attach_ebs_volume "$ebs_volume_id"
echo "exit code = $?"
}

Again, and again, and again, the logs show the same thing:

exit code = 0

I go back to the try_to_attach_ebs_volume function and cut out everything. And I mean everything:

function try_to_attach_ebs_volume {
return 256
}

And still, the logs, every time—as if they are mocking me — show the same thing:

exit code = 0

Our actor has made it to the door of the basement and he reaches for the knob with one blood soaked hand, panting, straining. The camera focuses on his face. Suddenly, his eyes go wide, he gasps, and is VIOLENTLY jerked back by an unseen force. He vanishes back down the stairs, into the darkness. We hear screams. We hear crashing sounds. We hear inhuman noises. And then, all is still.

That sinking feeling when it finally dawned on me what was happening is indescribable. It turns out that Bash exit codes have a maximum value of 255. By returning 256, I was causing an overflow, and the value was wrapping back around to 0. So whether or not the Volume attached successfully, I was always returning 0.

The investigation

The next day, as daylight streams through the basement windows, the police examine the scene and try to make sense of what happened. But things during the day don’t look the way they did at night. There are signs of commotion, blood stains, and a body, but no one can make sense of what happened.

It’s the same way with Yak shaving. It took me several more days to wrap up this work, and until I sat down here, and wrote out all the gory details, I honestly couldn’t tell you why. In fact, before I wrote this post, if you had asked me to add a little more disk space to a server, I probably would have said, “sure, it shouldn’t take more than an hour.”

Read on for part 2 of the Yak Shaving Series, A Tale of 12 Errors: Adventures with Terraform, AWS Lambda, and KMS. Do you have your own stories of Yak Shaving? If so, share them in the comments!

Your entire infrastructure. Defined as code. In about a day. Gruntwork.io.

--

--

Co-founder of Gruntwork, Author of “Hello, Startup” and “Terraform: Up & Running”