Self-healing of AWS Elastic Beanstalk EC2 instances

The issue

This time I observed the problem that an application managed by AWS Elastic Beanstalk was often failing. The app often failed because of some memory errors and I needed to find a quick solution in the first step. Replacing the underlying EC2 instance was the way I wanted to go for now.

The solution

So my first thought was creating an ALB/ELB just for doing health checks. This would work when the autoscaling group is configured to use ELB/ALB as health check type (see https://docs.aws.amazon.com/autoscaling/ec2/userguide/healthcheck.html). The disadvantage of this solution is that you have to pay for the load balancer although you don’t need it actually.

At the end I decided to write a shell script running on each instance so the instance is taking care of itself. The script is checking a logfile. If there are (two many) errors it will execute an aws autoscaling cli command to terminate the instance without decreasing the desired capacity. As a consequence the autoscaling group will launch a new instance.

You should know that AWS Elastic Beanstalk supports several logfiles per default. On the page https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/health-enhanced-serverlogs.html the different logfiles and its paths are described. I decided to use the application logs at /var/log/nginx/healthd/.

The log entries contains different values like the time of the request, the path of the request and so on. For me the http status code of the result was important. It’s the second column, the delimiter is “. Here is an example:

1437609879.311"/"200"0.083"0.083"177.72.242.17

In this case the http status code is 200.

awk tool for parsing log lines

I used the awk tool to get the second column of the logfile entries which contains http status code 500 and counted them:

awk 'BEGIN { FS="\"" ; count=0 } $3=="500" { ++count ; } END { print count ; } '

As there can be several application logs at /var/log/nginx/healthd/ I wanted to only consider the newest logfile. So I used this small code to get the newest logfile:

ls -t1 ./ |  head -n 1

As the mentioned app fails regularly I wanted to only replace the instance when a certain threshold of 500 status codes was exceeded. I saved the number of 500s in a variable, defined the percentage of allowed 500s and defined the number of logfile entries/lines I want to respect. As I needed to do float calculations in order to calculate the percentage I used the tool “bc”.

The aws autoscaling group cli supports the method terminate-instance-in-auto-scaling-group (see https://docs.aws.amazon.com/cli/latest/reference/autoscaling/terminate-instance-in-auto-scaling-group.html). It’s important to use the flag –no-should-decrement-desired-capacity in order to replace the failed instance and to not decrease the capacity.

Here is the final script now.

The final shell script

#!/bin/bash

# get the instance id
INSTANCE_ID=`curl -s http://169.254.169.254/latest/meta-data/instance-id`
# number of logfile lines to be regarded, default value: 30
NUMBER_LOGFILE=${1:-30}
# percentage of allowed failed requests, default value: 0.60
PERCENTAGE_OK=${2:-0.60}


cd /var/log/httpd/healthd/
# number of 500s in the newest logfile
COUNTER=`tail -n $NUMBER_LOGFILE $(ls -t1 ./ |  head -n 1) | awk 'BEGIN { FS="\"" ; count=0 } $3=="500" { ++count ; } END { print count ; } '`
# calculate percentage of 500s
RESULT=$(echo "$COUNTER / $NUMBER_LOGFILE" | bc -l)


# if there are more failed requests then allowed in variable $PERCENTAGE_OK --> terminate this instance and replace it with a new one
if [ $(echo "$RESULT > $PERCENTAGE_OK" | bc) = 1 ] 
then
    echo "$(date) too many failed requests $(echo "scale=2; ($RESULT*100)/1" | bc -l)%, terminating instance...\n" && \
    aws autoscaling terminate-instance-in-auto-scaling-group --instance-id $INSTANCE_ID --no-should-decrement-desired-capacity
else
    echo "$(date) instance is healthy...only $(echo "scale=2; ($RESULT*100)/1" | bc -l)% requests failed.\n"
fi

I created a cronjob which runs the script every minute. The script supports two parameters. The first one defines the number of logfile lines to consider for the health check. The second parameter defines the accepted percentage threshold of failed requests. The cronjob could look like:

* * * * * root /bin/bash /root/healthcheck.sh 30 0.6 >> /var/log/healthcheck-cron.log  2>&1

Don’t forget to add a newline ad the end of the cronjob file! Otherwise the cronjob is not executed. See https://manpages.debian.org/stretch/cron/crontab.5.en.html

[Errno 14] HTTP Error 403 – Forbidden – Pulling from AWS packages repo not possible

My very first post in this blog is about an issue I had using AWS Elastic Beanstalk for running an application which needs GPUs.

The issue

One day the app suddenly didn’t worked anymore on some EC2 instances. Going to the Elastic Beanstalk Environment Overview page showed already the broken environment in red color. I logged in to one of the broken EC2 instances and checked some Elastic Beanstalk logs. The logfile at /var/log/eb-activity.log shed light on the issue. One of the platform hooks failed when the instance was bootstrapping. Below you can see the relevant extract of the logfile:

[2019-01-10T09:27:48.730Z] INFO  [3495]  - [Initialization/PreInitStage0/PreInitHook/02installpackages.sh] : Starting activity...
[2019-01-10T09:27:51.620Z] INFO  [3495]  - [Initialization/PreInitStage0/PreInitHook/02installpackages.sh] : Activity execution failed, because: ++ /opt/elasticbeanstalk/bin/get-config container -k python_version
  + PYTHON_VERSION=2.7
  + is_baked python_packages
  + [[ -f /etc/elasticbeanstalk/baking_manifest/python_packages ]]
  + false
  + yum install -y httpd24 gcc44 mysql-5.5 mysql-devel
  Loaded plugins: dkms-build-requires, priorities, update-motd, upgrade-helper
  http://packages.us-west-1.amazonaws.com/2016.09/graphics/efb479739386/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 403 - Forbidden
  Trying other mirror.
  To address this issue please refer to the below knowledge base article
  
  https://access.redhat.com/solutions/69319
  
  If above article doesn't help to resolve this issue please open a ticket with Red Hat Support.
  
  http://packages.us-east-1.amazonaws.com/2016.09/graphics/efb479739386/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 403 - Forbidden
  Trying other mirror.
  http://packages.us-west-2.amazonaws.com/2016.09/graphics/efb479739386/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 403 - Forbidden
  Trying other mirror.
  http://packages.eu-west-1.amazonaws.com/2016.09/graphics/efb479739386/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 403 - Forbidden
  Trying other mirror.
  http://packages.eu-central-1.amazonaws.com/2016.09/graphics/efb479739386/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 403 - Forbidden
  Trying other mirror.
  http://packages.ap-southeast-1.amazonaws.com/2016.09/graphics/efb479739386/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 403 - Forbidden
  Trying other mirror.
  http://packages.ap-northeast-1.amazonaws.com/2016.09/graphics/efb479739386/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 403 - Forbidden
  Trying other mirror.
  http://packages.ap-northeast-2.amazonaws.com/2016.09/graphics/efb479739386/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 403 - Forbidden
  Trying other mirror.
  http://packages.sa-east-1.amazonaws.com/2016.09/graphics/efb479739386/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 403 - Forbidden
  Trying other mirror.
  http://packages.ap-southeast-2.amazonaws.com/2016.09/graphics/efb479739386/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 403 - Forbidden
  Trying other mirror.
  
  
   One of the configured repositories failed (amzn-graphics-Base),
   and yum doesn't have enough cached data to continue. At this point the only
   safe thing yum can do is fail. There are a few ways to work "fix" this:
  
       1. Contact the upstream for the repository and get them to fix the problem.
  
       2. Reconfigure the baseurl/etc. for the repository, to point to a working
          upstream. This is most often useful if you are using a newer
          distribution release than is supported by the repository (and the
          packages for the previous distribution release still work).
  
       3. Disable the repository, so yum won't use it by default. Yum will then
          just ignore the repository until you permanently enable it again or use
          --enablerepo for temporary usage:
  
              yum-config-manager --disable amzn-graphics
  
       4. Configure the failing repository to be skipped, if it is unavailable.
          Note that yum will try to contact the repo. when it runs most commands,
          so will have to try and fail each time (and thus. yum will be be much
          slower). If it is a very temporary problem though, this is often a nice
          compromise:
  
              yum-config-manager --save --setopt=amzn-graphics.skip_if_unavailable=true
  
  failure: repodata/repomd.xml from amzn-graphics: [Errno 256] No more mirrors to try.

According to the error message the EC2 instance was not able anymore to download the package amzn-graphics from the repo amzn-graphics-Base. To be exactly it couldn’t access the repomd.xml file. As you can also see by the URLs of the tried repos, e.g. http://packages.us-east-1.amazonaws.com/2016.09/graphics/efb479739386/x86_64/repodata/repomd.xml the repo is managed by AWS.

After searching for a solutions I found many solutions like trying to clear the yum cache/metadata as described at https://stackoverflow.com/questions/32483036/yum-install-fails-with-http-403-trying-to-access-repomd-xml.

Doing research at the official AWS forum it was supposed that a misconfigure subnet/ACL configuration can be the problem (see https://forums.aws.amazon.com/thread.jspa?threadID=94506). As I said the error raised up from one day to the other without changing anything so I didn’t really believe in this. And additionally as I said on some EC2 instance the bootstraping process worked properly. So pulling from the yum repo didn’t work only sometimes. I tried to find out what the differences between the failed and working instances are. They used the same AMI. To be really really sure I started some instances in the same subnet, some of them worked, some didn’t work.

The solution

As I knew pulling from the AWS managed yum repos is only allowed from within the AWS network (try the link I mentioned above http://packages.us-east-1.amazonaws.com/2016.09/graphics/efb479739386/x86_64/repodata/repomd.xml, you will get an access denied error in your browser) I looked at the EC2 instance public and checked the app again. I recognized that pulling from the repo only didn’t work for EC2 instance having a public IP adress from wihtin the “3.x.x.x” range. Instances using IPS from “18.x.x.x” or “54.x.x.x” worked well. So I contacted the AWS support and at the end the customer service raised an internal ticket and AWS fixed the issue. They thanked for pointing out this issue and answered: “We try our best to make sure that our products are up to date. However, we agree that some minor details may have been missed and deeply regret the same.”

Below you can see an extract of the logfile after this issue was fixed:

[2019-01-21T16:49:40.297Z] INFO  [3394]  - [Initialization/PreInitStage0/PreInitHook/02installpackages.sh] : Completed activity. Result:
  ++ /opt/elasticbeanstalk/bin/get-config container -k python_version
  + PYTHON_VERSION=2.7
  + is_baked python_packages
  + [[ -f /etc/elasticbeanstalk/baking_manifest/python_packages ]]
  + false
  + yum install -y httpd24 gcc44 mysql-5.5 mysql-devel
  Loaded plugins: dkms-build-requires, priorities, update-motd, upgrade-helper
  2 packages excluded due to repository priority protections
  Package httpd24-2.4.25-1.68.amzn1.x86_64 already installed and latest version
  Package gcc44-4.4.6-4.81.amzn1.x86_64 already installed and latest version
  Package mysql-5.5-1.6.amzn1.noarch already installed and latest version
  Package mysql-devel-5.5-1.6.amzn1.noarch already installed and latest version

Although it was AWS failure and I spend some hours to find the root cause I’m still glad to use AWS services 🙂

PS: you can see the IP ranges used by AWS at: https://ip-ranges.amazonaws.com/ip-ranges.json