Self-healing of AWS Elastic Beanstalk EC2 instances

The issue

This time I observed the problem that an application managed by AWS Elastic Beanstalk was often failing. The app often failed because of some memory errors and I needed to find a quick solution in the first step. Replacing the underlying EC2 instance was the way I wanted to go for now.

The solution

So my first thought was creating an ALB/ELB just for doing health checks. This would work when the autoscaling group is configured to use ELB/ALB as health check type (see https://docs.aws.amazon.com/autoscaling/ec2/userguide/healthcheck.html). The disadvantage of this solution is that you have to pay for the load balancer although you don’t need it actually.

At the end I decided to write a shell script running on each instance so the instance is taking care of itself. The script is checking a logfile. If there are (two many) errors it will execute an aws autoscaling cli command to terminate the instance without decreasing the desired capacity. As a consequence the autoscaling group will launch a new instance.

You should know that AWS Elastic Beanstalk supports several logfiles per default. On the page https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/health-enhanced-serverlogs.html the different logfiles and its paths are described. I decided to use the application logs at /var/log/nginx/healthd/.

The log entries contains different values like the time of the request, the path of the request and so on. For me the http status code of the result was important. It’s the second column, the delimiter is “. Here is an example:

1437609879.311"/"200"0.083"0.083"177.72.242.17

In this case the http status code is 200.

awk tool for parsing log lines

I used the awk tool to get the second column of the logfile entries which contains http status code 500 and counted them:

awk 'BEGIN { FS="\"" ; count=0 } $3=="500" { ++count ; } END { print count ; } '

As there can be several application logs at /var/log/nginx/healthd/ I wanted to only consider the newest logfile. So I used this small code to get the newest logfile:

ls -t1 ./ |  head -n 1

As the mentioned app fails regularly I wanted to only replace the instance when a certain threshold of 500 status codes was exceeded. I saved the number of 500s in a variable, defined the percentage of allowed 500s and defined the number of logfile entries/lines I want to respect. As I needed to do float calculations in order to calculate the percentage I used the tool “bc”.

The aws autoscaling group cli supports the method terminate-instance-in-auto-scaling-group (see https://docs.aws.amazon.com/cli/latest/reference/autoscaling/terminate-instance-in-auto-scaling-group.html). It’s important to use the flag –no-should-decrement-desired-capacity in order to replace the failed instance and to not decrease the capacity.

Here is the final script now.

The final shell script

#!/bin/bash

# get the instance id
INSTANCE_ID=`curl -s http://169.254.169.254/latest/meta-data/instance-id`
# number of logfile lines to be regarded, default value: 30
NUMBER_LOGFILE=${1:-30}
# percentage of allowed failed requests, default value: 0.60
PERCENTAGE_OK=${2:-0.60}


cd /var/log/httpd/healthd/
# number of 500s in the newest logfile
COUNTER=`tail -n $NUMBER_LOGFILE $(ls -t1 ./ |  head -n 1) | awk 'BEGIN { FS="\"" ; count=0 } $3=="500" { ++count ; } END { print count ; } '`
# calculate percentage of 500s
RESULT=$(echo "$COUNTER / $NUMBER_LOGFILE" | bc -l)


# if there are more failed requests then allowed in variable $PERCENTAGE_OK --> terminate this instance and replace it with a new one
if [ $(echo "$RESULT > $PERCENTAGE_OK" | bc) = 1 ] 
then
    echo "$(date) too many failed requests $(echo "scale=2; ($RESULT*100)/1" | bc -l)%, terminating instance...\n" && \
    aws autoscaling terminate-instance-in-auto-scaling-group --instance-id $INSTANCE_ID --no-should-decrement-desired-capacity
else
    echo "$(date) instance is healthy...only $(echo "scale=2; ($RESULT*100)/1" | bc -l)% requests failed.\n"
fi

I created a cronjob which runs the script every minute. The script supports two parameters. The first one defines the number of logfile lines to consider for the health check. The second parameter defines the accepted percentage threshold of failed requests. The cronjob could look like:

* * * * * root /bin/bash /root/healthcheck.sh 30 0.6 >> /var/log/healthcheck-cron.log  2>&1

Don’t forget to add a newline ad the end of the cronjob file! Otherwise the cronjob is not executed. See https://manpages.debian.org/stretch/cron/crontab.5.en.html