How to clean up custom AMIs in order to use them with AWS Elastic Beanstalk

The issue

This time I needed to make some modifications for an application managed by AWS Elastic Beanstalk. I had to modify something on the host system which means I had to create a new AMI which will then be used by Elastic Beanstalk. At first I didn’t take care of cleaning up the EC2 instance before creating the AMI. This means new launched instances already contained some application code and most of all some old Elastic Beanstalk configurations. Unfortunately not all configurations were overriden during the (first initial) deployment. In my case the name of the attached SQS queue wasn’t updated (regarding SQS queue configurations and my observations see the end of this post about additional comments).

The solution

You have to delete some certain directories before creating the AMI. I couldn’t find any official tutorials from AWS or stackoverflow posts about which directories I have to delete. That’s why I want to summarize it here. It’s difficult to give a general instruction as Elastic Beanstalk supports a huge amount of different setups (Web server environment vs. Worker environment, Docker vs. Multi-container Docker, Go vs. .NET vs. Java vs …). You can use the following commands as a starting point. If you want to add something feel free to leave a comment!

So let’s start:

Delete the directory containing the application code:

rm -rf /opt/elasticbeanstalk/

Depending which platform (Go, Java, Python,…) you are using you should delete the directory containing executables, too. In my case it was Python which is also installed at /opt by Elastic Beanstalk:

rm -rf /opt/python/

Elastic Beanstalk uses different software as proxy servers for processing http requests. For python it’s Apache. Visit https://docs.aws.amazon.com/elasticbeanstalk/latest/platforms/platforms-supported.html#platforms-supported.python to see which platforms use Apache, nginx or IIS in the preconfigured AMIs.

So keep in mind the directories containing configuration files for apache, nginx or IIS. For apache you find them at:

/etc/httpd/

The most important files are probably:

  • /etc/httpd/conf/httpd.conf
  • /etc/httpd/conf.d/wsgi.conf
  • /etc/httpd/conf.d/wsgi_custom.conf (if you modified the wsgi settings)


Optional:

Delete logfiles created and filled up by Elastic Beanstalk (to avoid seeing old log entries in the Elastic Beanstalk GUI during the first initial deployment):

rm /var/log/eb-activity.log /var/log/eb-cfn-init-call.log /var/log/eb-cfn-init.log /var/log/eb-commandprocessor.log /var/log/eb-publish-logs.log /var/log/eb-tools.log

If you are using Elastic Beanstalk as an worker environment and you have attached a SQS queue you can delete the corresponding log directory, too:

rm -rf /var/log/aws-sqsd/

Additional comments

I was really surprised that I didn’t found anything about the configuration file for the sqs queue. The only more detailed information about Elastic Beanstalk and SQS queues I found was https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html which wasn’t very helpful for me but still interesting to read (especially regarding the HTTP headers for processing SQS messages).

The configuration is saved at /etc/aws-sqsd.d/default.yaml and has the following format:

---
http_connections: 10
http_port: 80
verbose: false
inactivity_timeout: 9
healthcheck: TCP:80
environment_name: My-ElasticBeanstalk-Environment-Example
queue_url: https://sqs.us-east-1.amazonaws.com/123456789/my-sqs-queue
dynamodb_ssl: true
quiet: false
via_sns: false
retention_period: 60
sqs_ssl: true
threads: 50
mime_type: application/json
error_visibility_timeout: 2
debug: false
http_path: /
sqs_verify_checksums: true
connect_timeout: 2
visibility_timeout: 10
keepalive: true

During the first initial deployment this file was not updated. Deleting this file and creating an AMI didn’t help, too. I assume that this file is generated by files from /opt/elasticbeanstalk. Using grep to find out from which configurations files the default.yaml is being generated didn’t yield anything. Doing a deployment later manual/automatically the file was updated with the correct SQS queue name. I assume this applies to the other settings, too.

If you know how this yaml is generated please leave a comment. I would be very interested to know the details.

Self-healing of AWS Elastic Beanstalk EC2 instances

The issue

This time I observed the problem that an application managed by AWS Elastic Beanstalk was often failing. The app often failed because of some memory errors and I needed to find a quick solution in the first step. Replacing the underlying EC2 instance was the way I wanted to go for now.

The solution

So my first thought was creating an ALB/ELB just for doing health checks. This would work when the autoscaling group is configured to use ELB/ALB as health check type (see https://docs.aws.amazon.com/autoscaling/ec2/userguide/healthcheck.html). The disadvantage of this solution is that you have to pay for the load balancer although you don’t need it actually.

At the end I decided to write a shell script running on each instance so the instance is taking care of itself. The script is checking a logfile. If there are (two many) errors it will execute an aws autoscaling cli command to terminate the instance without decreasing the desired capacity. As a consequence the autoscaling group will launch a new instance.

You should know that AWS Elastic Beanstalk supports several logfiles per default. On the page https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/health-enhanced-serverlogs.html the different logfiles and its paths are described. I decided to use the application logs at /var/log/nginx/healthd/.

The log entries contains different values like the time of the request, the path of the request and so on. For me the http status code of the result was important. It’s the second column, the delimiter is “. Here is an example:

1437609879.311"/"200"0.083"0.083"177.72.242.17

In this case the http status code is 200.

awk tool for parsing log lines

I used the awk tool to get the second column of the logfile entries which contains http status code 500 and counted them:

awk 'BEGIN { FS="\"" ; count=0 } $3=="500" { ++count ; } END { print count ; } '

As there can be several application logs at /var/log/nginx/healthd/ I wanted to only consider the newest logfile. So I used this small code to get the newest logfile:

ls -t1 ./ |  head -n 1

As the mentioned app fails regularly I wanted to only replace the instance when a certain threshold of 500 status codes was exceeded. I saved the number of 500s in a variable, defined the percentage of allowed 500s and defined the number of logfile entries/lines I want to respect. As I needed to do float calculations in order to calculate the percentage I used the tool “bc”.

The aws autoscaling group cli supports the method terminate-instance-in-auto-scaling-group (see https://docs.aws.amazon.com/cli/latest/reference/autoscaling/terminate-instance-in-auto-scaling-group.html). It’s important to use the flag –no-should-decrement-desired-capacity in order to replace the failed instance and to not decrease the capacity.

Here is the final script now.

The final shell script

#!/bin/bash

# get the instance id
INSTANCE_ID=`curl -s http://169.254.169.254/latest/meta-data/instance-id`
# number of logfile lines to be regarded, default value: 30
NUMBER_LOGFILE=${1:-30}
# percentage of allowed failed requests, default value: 0.60
PERCENTAGE_OK=${2:-0.60}


cd /var/log/httpd/healthd/
# number of 500s in the newest logfile
COUNTER=`tail -n $NUMBER_LOGFILE $(ls -t1 ./ |  head -n 1) | awk 'BEGIN { FS="\"" ; count=0 } $3=="500" { ++count ; } END { print count ; } '`
# calculate percentage of 500s
RESULT=$(echo "$COUNTER / $NUMBER_LOGFILE" | bc -l)


# if there are more failed requests then allowed in variable $PERCENTAGE_OK --> terminate this instance and replace it with a new one
if [ $(echo "$RESULT > $PERCENTAGE_OK" | bc) = 1 ] 
then
    echo "$(date) too many failed requests $(echo "scale=2; ($RESULT*100)/1" | bc -l)%, terminating instance...\n" && \
    aws autoscaling terminate-instance-in-auto-scaling-group --instance-id $INSTANCE_ID --no-should-decrement-desired-capacity
else
    echo "$(date) instance is healthy...only $(echo "scale=2; ($RESULT*100)/1" | bc -l)% requests failed.\n"
fi

I created a cronjob which runs the script every minute. The script supports two parameters. The first one defines the number of logfile lines to consider for the health check. The second parameter defines the accepted percentage threshold of failed requests. The cronjob could look like:

* * * * * root /bin/bash /root/healthcheck.sh 30 0.6 >> /var/log/healthcheck-cron.log  2>&1

Don’t forget to add a newline ad the end of the cronjob file! Otherwise the cronjob is not executed. See https://manpages.debian.org/stretch/cron/crontab.5.en.html