How to use environment variables in a training job on AWS Sagemaker

The issue

Recently I wanted to use environment variables in a Docker Container which is running a training job on AWS Sagemaker. Unfortunately using environment variables when starting the container is only possible for containers used for inference (which means the model has already been created). In the following diagram I visualized the main components of AWS Sagemaker and at which step environment variables are supported or not.

Environment variables are only supported for Docker containers running as Inference Containers

In this article I will describe how to use environment variables in a training job on AWS Sagemaker. As it’s not supported by AWS I only found a hackaround at stackoverflow (https://stackoverflow.com/questions/51215092/setup-env-variable-in-aws-sagemaker-container-bring-your-own-container). As it’s only described shortly at stackoverflow I created this article to add some comments to this solution. If you know a better way feel free to leave a comment 🙂

The solution

When you create a training job you can define hyperparameters for tuning the training. Inside the Docker Container it has to be placed at /opt/ml/input/config/hyperparameters.json

The idea of the workaround is to put the content of the environment variables to this file. An example is shown below:

{
    "batch_size": 100,
    "epochs": 10,
    "learning_rate": 0.1,
    "momentum": 0.9,
    "log_interval": 100,
    "aws_access_key_id": "ABCDEDF",
    "aws_secret_access_key": "123456"
  }

In this example I added two additional parameters called aws_access_key_id and aws_secret_access_key for passing IAM credentials to the container which for example later can be used by the AWS CLI or python script running in the container.

Now you can access the values using the tool jq:

jq -r ".aws_access_key_id" /opt/ml/input/config/hyperparameters.json
jq -r ".aws_secret_access_key" /opt/ml/input/config/hyperparameters.json

A complete example using this approach consists of these steps:

Create a bash script saved as exporting_vars.sh which exports the variables:

#!/bin/bash

export AWS_ACCESS_KEY_ID=$(jq -r ".aws_access_key_id" /opt/ml/input/config/hyperparameters.json)
export AWS_SECRET_ACCESS_KEY=$(jq -r ".aws_secret_access_key" /opt/ml/input/config/hyperparameters.json)

If the variables should be available not only inside the bash script exporting_vars.sh but also in the parent shell you have to source the script:

source exporting_vars.sh

You can do the sourcing e.g. in your main entrypoint script.

Now the variables are available the same way as passing same directly via Docker environment variables. Log in your container and try:

echo $AWS_SECRET_ACCESS_KEY
echo $AWS_ACCESS_KEY_ID

I would really appreciate it if AWS supports environment variables for training jobs out of the box. Let’s see when this will happen.

How to save google’s Lighthouse score at presigned S3 URLS and proceed/abort GitLab CI/CD Pipeline accordingly

The issue

I needed to create a job within a GitLab CI/CD Pipeline which calculates google’s Lighthouse score (https://developers.google.com/web/tools/lighthouse) for a specific website url containing the new code from this pipeline run. The calculated Lighthouse score results should be saved at a presigned AWS S3 url with a limited time to live (ttl). Only if the score is high enough the GitLab CI/CD pipeline should deploy the new code. With this additional step I wanted to avoid increased Lighthouse score on my website caused by the new deployment/release.

The solution

I’m running Lighthouse within a Docker container, so it’s easy to reuse it for other projects. As a starting point I needed chrome running in headless mode. On DockerHub I found this nice container:https://hub.docker.com/r/justinribeiro/chrome-headless/

Based on this container I added some environment variables to adjust Lighthouse and S3 configurations during runtime. This is my Dockerfile:

FROM justinribeiro/chrome-headless

# default values for environment variables
ENV URL=https://www.allaboutaws.com
ENV AWS_ACCESS_KEY_ID=EMPTY
ENV AWS_SECRET_ACCESS_KEY=EMPTY
ENV AWS_DEFAULT_REGION=EMPTY
ENV AWS_S3_LINK_TTL=EMPTY
ENV AWS_S3_BUCKET=EMPTY
ENV LIGHTHOUSE_SCORE_THRESHOLD=0.80

USER root
RUN apt-get update && \
    apt-get install -y bc curl gnupg2 sudo && \
    curl -sL https://deb.nodesource.com/setup_10.x | bash - && \
    apt-get install -y nodejs && \
    npm install -g lighthouse && \
    curl -O https://bootstrap.pypa.io/get-pip.py && \
    python get-pip.py && \
    pip install awscli && \
    apt-get purge --auto-remove -y python gnupg2 curl && \
    rm -rf /var/lib/apt/lists/*

RUN mkdir /tmp/lighthouse_score && chown chrome:chrome /tmp/lighthouse_score

ADD ./run.sh /tmp/run.sh

ENTRYPOINT /bin/bash /tmp/run.sh

The environment variables in the Dockerfile are:

  • ENV URL: the URL to run Lighthouse against
  • ENV AWS_ACCESS_KEY_ID: AWS Access Key for storing results in S3
  • ENV AWS_SECRET_ACCESS_KEY: AWS Secret Key for storing results in S3
  • ENV AWS_DEFAULT_REGION: AWS Region for S3
  • ENV AWS_S3_LINK_TTL: Time to live of presigned S3 URL
  • ENV AWS_S3_BUCKET: Name of S3 bucket containing Lighthouse result
  • ENV LIGHTHOUSE_SCORE_THRESHOLD: threshold for continuing or aborting GitLab CI/CD Pipeline regarding Lighthouse score

Run lighthouse and save results at presigned S3 URL

After installing all necessary packages the entrypoint bash script run.sh is started. The file is shown below:

#!/bin/bash

FILEPATH=/tmp/lighthouse_score
FILENAME=$(date "+%Y-%m-%d-%H-%M-%S").html
S3_PATH=s3://$AWS_S3_BUCKET/$FILENAME


echo "running lighthouse score against: " $URL
sudo -u chrome lighthouse --chrome-flags="--headless --disable-gpu --no-sandbox" --no-enable-error-reporting --output html --output-path $FILEPATH/$FILENAME $URL


if { [ ! -z "$AWS_ACCESS_KEY_ID" ] && [ "$AWS_ACCESS_KEY_ID" == "EMPTY" ]; } ||
  { [ ! -z "$AWS_SECRET_ACCESS_KEY" ] && [ "$AWS_SECRET_ACCESS_KEY" == "EMPTY" ]; } || 
  { [ ! -z "$AWS_DEFAULT_REGION" ] && [ "$AWS_DEFAULT_REGION" == "EMPTY" ]; } || 
  { [ ! -z "$S3_PATH" ] && [ "$S3_PATH" == "EMPTY" ]; } ;
then 
    printf "\nYou can find the lighthouse score result html file on your host machine in the mapped volume directory.\n" 
else
    echo "uploading lighthouse score result html file to S3 Bucket: $S3_PATH ..."
    aws s3 cp $FILEPATH/$FILENAME $S3_PATH
    if [ ! -z $AWS_S3_LINK_TTL ] && [ $AWS_S3_LINK_TTL == "EMPTY" ]; 
    then
        printf "\r\nSee the results of this run at (valid 24hrs (default) till the link expires):\n\n\r"
        aws s3 presign $S3_PATH --expires-in 86400 
        printf "\n"
    else
        printf "\n\rSee the results of this run at (valid $AWS_S3_LINK_TTL till the link expires):\n\n\r"
        aws s3 presign $S3_PATH --expires-in $AWS_S3_LINK_TTL
        printf "\n"
    fi
fi;


PERFORMANCE_SCORE=$(cat $FILEPATH/$FILENAME | grep -Po \"id\":\"performance\",\"score\":\(.*?\)} | sed 's/.*:\(.*\)}.*/\1/g')
if [ $(echo "$PERFORMANCE_SCORE > $LIGHTHOUSE_SCORE_THRESHOLD"|bc) -eq "1" ];
then
    echo "The Lighthouse Score is $PERFORMANCE_SCORE which is greater than $LIGHTHOUSE_SCORE_THRESHOLD, proceed with the CI/CD Pipeline..."
    exit 0
else
    echo "The Lighthouse Score is $PERFORMANCE_SCORE which is smaller than $LIGHTHOUSE_SCORE_THRESHOLD, DON'T proceed with the CI/CD Pipeline. Exiting now."
    exit 1
fi;

This script does the following:

  • run Lighthouse in chrome with headless mode
  • save the results as a html file in the container at /tmp/lighthouse_score using a file name containing the current date
  • if the environment variables are set, upload the html to the specified S3 bucket and presign the file using the cli command aws s3 presign
  • extract the performance score from the html file using grep and sed
  • output a message text to proceed or stop the GitLab Pipeline depending on whether $PERFORMANCE_SCORE > $LIGHTHOUSE_SCORE_THRESHOLD and return the value 0 (proceed with GitLab pipeline) or 1 (don’t proceed with GitLab pipeline)

Integrate it into Gitlab CI/CD Pipeline

The GitLab CI/CD Job in gitlab-ci.yml could look like the following YAML snippet:

calculate_lighthouse_score:
  stage: testing
  image: docker:latest
  only:
    - dev
  variables:
    URL: https://allaboutaws.com
    S3_REGION: us-east-1
    S3_LINK_TTL: 86400
    S3_BUCKET: MY-S3-BUCKET/lighthouse
    LIGHTHOUSE_SCORE_THRESHOLD: "0.50"

  script:
    - docker pull sh39sxn/lighthouse-signed
    - docker run -e URL=$URL \
      -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY \
      -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_KEY \
      -e AWS_DEFAULT_REGION=$S3_REGION \
      -e AWS_S3_LINK_TTL=$S3_LINK_TTL \
      -e AWS_S3_BUCKET=$S3_BUCKET \
      -e LIGHTHOUSE_SCORE_THRESHOLD=$LIGHTHOUSE_SCORE_THRESHOLD \
      sh39sxn/lighthouse-signed:latest

In GitLab logs you will see an output like:

Lighthouse Score containing presigned S3 URL in GitLab CI/CD Pipeline.

Save Lighthouse results On your machine

If you want you can run the Docker container on your machine and get the results from the container as they are stored at /tmp/lighthouse_score within the container. You have to mount a directory from your host machine to the container using docker volumes. The run statement would be:

docker run -it -v /tmp:/tmp/lighthouse_score -e URL=https://allaboutaws.com sh39sxn/lighthouse-signed-s3:latest

You find the lighthouse result on your host machine at /tmp.

External Links

I uploaded the files to my GitHub Repo at https://github.com/sh39sxn/lighthouse-signed-s3 and the prebuild Container is saved in my DockerHub Repo at https://hub.docker.com/r/sh39sxn/lighthouse-signed-s3.

How to auto-scale AWS ECS containers based on SQS queue metrics

The issue

There are a many tutorials describing how to auto-scale EC2 instances based on CPU Utilization or Memory Utilization of the host system. Similar approaches can be found to scale ECS Containers automatically based on the CPU/Memory metrics supported by default in ECS (see https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html). If your ECS tasks/containers process messages in a SQS queue (e.g. Laravel Queue Workers), you can still use CPU or Memory metrics as an indication for scaling in and out. In my opinion it’s much more reliable and significant if you scale in/out based on the number of messages waiting in the SQS queue to be processed. In this post I’m describing how to do Auto-scaling of ECS Containers based on SQS queue metrics.

The solution

As a starting point I used the tutorial from AWS at https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html AWS describes how to auto-scale EC2 instances based on SQS. In my case I’m scaling ECS tasks.

define IAM User Permissions

Af first I created an IAM User to access and modify the relevant AWS resources. You can use the following IAM policy for this user. Just replace the placeholder for the AWS Region, AWS Account ID, ECS Cluster Name, ECS Service Name and your SQS Queue Name:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecs:UpdateService",
                "ecs:DescribeServices",
                "sqs:GetQueueAttributes"
            ],
            "Resource": [
                "arn:aws:ecs:eu-central-1:123456789:service/My-ECS-Cluster/My-ECS-Service",
                "arn:aws:sqs:eu-central-1:123456789:my-sqs-queue"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecs:ListServices",
                "cloudwatch:PutMetricData",
                "ecs:ListTasks",
                "ecs:DescribeServices",
                "cloudwatch:GetMetricStatistics",
                "ecs:DescribeTasks",
                "cloudwatch:ListMetrics",
                "ecs:DescribeClusters",
                "ecs:ListClusters"
            ],
            "Resource": "*"
        }
    ]
}

Publish SQS queue Metrics to CloudWatch

At first we need to track the number of messages waiting in the SQS queue to be processed. For this I coded the following bash script:

#!/bin/bash

AWS_ACCOUNT_ID=${1:-123456789}
SQS_QUEUE_NAME=${2:-My-SQS-Queue}
ECS_CLUSTER=${3:-My-ECS-Cluster}
ECS_SERVICE=${4:-My-ECS-Service}
CW_METRIC=${5:-BacklogPerECSTask}
CW_NAMESPACE=${6:-ECS-SQS-Autoscaling}
CW_DIMENSION_NAME=${7:-SQS-Queue}
CW_DIMENSION_VALUE=${8:-My-SQS-Queue}

ApproximateNumberOfMessages=$(aws sqs get-queue-attributes --queue-url https://sqs.$AWS_DEFAULT_REGION.amazonaws.com/$AWS_ACCOUNT_ID/$SQS_QUEUE_NAME --attribute-names All | jq -r '.[] | .ApproximateNumberOfMessages')
echo "ApproximateNumberOfMessages: " $ApproximateNumberOfMessages

NUMBER_TASKS=$(aws ecs list-tasks --cluster $ECS_CLUSTER --service-name $ECS_SERVICE | jq '.taskArns | length')
echo "NUMBER_TASKS: " $NUMBER_TASKS

MyBacklogPerWorker=$((($ApproximateNumberOfMessages / $NUMBER_TASKS) + ($ApproximateNumberOfMessages % $NUMBER_TASKS > 0)))
echo "MyBacklogPerWorker: " $MyBacklogPerWorker

# send average number of current backlogs of the workers as a custom Metric to CloudWatch
aws cloudwatch put-metric-data --metric-name $CW_METRIC --namespace $CW_NAMESPACE \
  --unit None --value $MyBacklogPerWorker --dimensions $CW_DIMENSION_NAME=$CW_DIMENSION_VALUE

In the beginning some variables are defined. You can pass the variables as arguments to the bash script and define default values in case you call the script without arguments. I won’t explain each of them in detail because they should be self-explanatory. The most important points are:

You get the number of messages available for retrieval from the SQS queue via the CLI command get-queue-attributes (see https://docs.aws.amazon.com/cli/latest/reference/sqs/get-queue-attributes.html) Using the json tool jq allows us to easily extract the needed value ApproximateNumberOfMessages from the json formatted result.

The metric ApproximateNumberOfMessages can not be found in the SQS metrics. You only get this value via CLI command. Unfortunately I didn’t find any information how AWS calculates this value. In my impression it’s somehow calculated using the metrics NumberOfMessagesSent and ApproximateNumberOfMessagesVisible which are available via the AWS Management Console, CLI and API.

In the next step we calculate the current backlog of our ECS tasks (in my case called workers as I coded this for Laravel queue workers). ApproximateNumberOfMessages is divided by the current number of running ECS tasks which we get via the command ecs list-tasks. The result is saved in the variable MyBacklogPerWorker and this value is pushed to the custom Cloudwatch Metric via the CLI Command put-metric-data (see https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/put-metric-data.html).

I decided to run the bash script every 1 minute via a cronjob (see later section explaining the Docker container).

Calculate Backlogs and Scale in/OUT based on CloudWatch Metrics

I’m using the size of the backlog per task/worker as a threshold for scaling in or out. I defined the variable LATENCY and PROCESSING_TIME. LATENCY is the maximum allowed number of seconds till a message from the SQS queue should be processed (queue delay). PROCESSING_TIME is the average number of seconds an ECS task needs for processing a message from the SQS queue. Deviding those two values defines the allowed backlog per ECS task/worker. In the code snippet below 10 (20/2=10) messages is the maximum of messages one ECS task should need to process.

LATENCY=${1:-20} # maximum allowed latency (seconds) for processing a message
PROCESSING_TIME=${2:-2}  # average number of seconds to process an image
backlog_per_worker_allowed=$(($LATENCY / $PROCESSING_TIME)) # number of messages a worker can process within the allowed latency timeframe

I’m running the second bash script (shown below) every 5 minutes. It will get the custom CloudWatch metric send by the first bash script for the last 20 minutes from now and calculate the average backlog for all currently running ECS tasks. If the average value of the backlog for all currently running ECS tasks is higher than the defined threshold we will scale out.

The whole bash script looks like:

#!/bin/bash

LATENCY=${1:-20} # maximum allowed latency (seconds) for processing a message
PROCESSING_TIME=${2:-2}  # average number of seconds to process an image
ECS_CLUSTER=${3:-My-ECS-Cluster}
ECS_SERVICE=${4:-My-ECS-Service}
CW_METRIC=${5:-BacklogPerECSTask}
CW_NAMESPACE=${6:-ECS-SQS-Autoscaling}
CW_DIMENSION_NAME=${7:-SQS-Queue}
CW_DIMENSION_VALUE=${8:-My-SQS-Queue}
MAX_LIMIT_NUMBER_QUEUE_WORKERS=${9:-200}

ceil() {
  if [[ "$1" =~ ^[0-9]+$ ]]
    then
        echo $1;
        return 1;
  fi                                                                      
  echo "define ceil (x) 
        {if (x<0) {return x/1} \
          else {if (scale(x)==0) {return x} \
            else {return x/1 + 1 }}} ; ceil($1)" | bc
}

backlog_per_worker_allowed=$(($LATENCY / $PROCESSING_TIME)) # number of messages a worker can process within the allowed latency timeframe
echo "backlog_per_worker_allowed: " $backlog_per_worker_allowed

# get backlogs of the worker for the last 10 minutes
export LC_TIME=en_US.utf8
CF_JSON_RESULT=$(aws cloudwatch get-metric-statistics --namespace $CW_NAMESPACE --dimensions Name=$CW_DIMENSION_NAME,Value=$CW_DIMENSION_VALUE --metric-name $CW_METRIC \
 --start-time "$(date -u --date='5 minutes ago')" --end-time "$(date -u)" \
 --period 60 --statistics Average)
echo "CF_JSON_RESULT: " $CF_JSON_RESULT

# sum up the average values of the last 10 minutes
SUM_OF_AVERAGE_CW_VALUES=$(echo $CF_JSON_RESULT | jq '.Datapoints | .[].Average' | awk '{ sum += $1 } END { print sum }')

echo "SUM_OF_AVERAGE_VALUES: " $SUM_OF_AVERAGE_CW_VALUES

# count the number of average values the CW Cli command returned (varies between 4 and 5 values)
NUMBER_OF_CW_VALUES=$(echo $CF_JSON_RESULT | jq '.Datapoints' | jq length)

echo "NUMBER_OF_CW_VALUES: " $NUMBER_OF_CW_VALUES

# calculate average number of backlog for the workers in the last 10 minutes
AVERAGE_BACKLOG_PER_WORKER=$(echo "($SUM_OF_AVERAGE_CW_VALUES / $NUMBER_OF_CW_VALUES)" | bc -l )
echo "AVERAGE_BACKLOG_PER_WORKER: " $AVERAGE_BACKLOG_PER_WORKER

# calculator factor to scale in/out, then ceil up to next integer to be sure the scaling is sufficient
FACTOR_SCALING=$(ceil $(echo "($AVERAGE_BACKLOG_PER_WORKER / $backlog_per_worker_allowed)" | bc -l) )
echo "FACTOR_SCALING: " $FACTOR_SCALING

# get current number of ECS tasks
CURRENT_NUMBER_TASKS=$(aws ecs list-tasks --cluster $ECS_CLUSTER --service-name $ECS_SERVICE | jq '.taskArns | length')
echo "CURRENT_NUMBER_TASKS: " $CURRENT_NUMBER_TASKS

# calculate new number of ECS tasks, print leading 0 (0.43453 instead of .43453)
NEW_NUMBER_TASKS=$( echo "($FACTOR_SCALING * $CURRENT_NUMBER_TASKS)" | bc -l |  awk '{printf "%f", $0}')
echo "NEW_NUMBER_TASKS: " $NEW_NUMBER_TASKS


## we run more than enough workers currently, scale in slowly by 20 %
if [ $FACTOR_SCALING -le "1" ];
then
  NEW_NUMBER_TASKS=$( echo "(0.8 * $CURRENT_NUMBER_TASKS)" | bc -l)
fi;

echo "NEW_NUMBER_TASKS: " $NEW_NUMBER_TASKS

# round number of tasks to int
NEW_NUMBER_TASKS_INT=$( echo "($NEW_NUMBER_TASKS+0.5)/1" | bc )


if [ ! -z $NEW_NUMBER_TASKS_INT ];
    then
        if [ $NEW_NUMBER_TASKS_INT == "0" ];
          then
              NEW_NUMBER_TASKS_INT=1 # run at least one worker
        fi;
        if [ $NEW_NUMBER_TASKS_INT -gt $MAX_LIMIT_NUMBER_QUEUE_WORKERS ];
          then
              NEW_NUMBER_TASKS_INT=$MAX_LIMIT_NUMBER_QUEUE_WORKERS # run not more than the maximum limit of queue workers
        fi;
fi;

echo "NEW_NUMBER_TASKS_INT:" $NEW_NUMBER_TASKS_INT

# update ECS service to the calculated number of ECS tasks
aws ecs update-service --cluster $ECS_CLUSTER --service $ECS_SERVICE --desired-count $NEW_NUMBER_TASKS_INT 1>/dev/null

There have been some issues I want to mention:

  • I needed to set the environment variable LC_TIME to en_US.utf8 in order to get the right output from the unix commands date -u and date -u –date=’10 minutes ago’ when calling the CLI command aws cloudwatch get-metric-statistics for the last 10 minutes.
  • I used the tool bc (Basic Calculator) to do math operations like division from floating point numbers
  • the bash function ceil() at the beginning of the script rounds up floating point number to the next larger integer (if the argument is already an integer, it just returns the argument)
  • FACTOR_SCALING is calculated by dividing the currently calculated average backlog per ECS task by the allowed backlog per ECS task, it’s rounded up to the next larger integer using the function ceil():
FACTOR_SCALING=$(ceil $(echo "($AVERAGE_BACKLOG_PER_WORKER / $backlog_per_worker_allowed)" | bc -l) )
  • the new number of ECS tasks is calculated by the product of the FACTOR_SCALING and the currently running number of ECS tasks CURRENT_NUMBER_TASKS:
NEW_NUMBER_TASKS=$( echo "($FACTOR_SCALING * $CURRENT_NUMBER_TASKS)" | bc -l |  awk '{printf "%f", $0}')
  • this value is rounded to an integer
echo "($NEW_NUMBER_TASKS+0.5)/1" | bc
  • There is an edge case you have to take care: when FACTOR_SCALING is 1 it means the we run enough ECS tasks at the moment, so we should scale in. Otherwise we would keep running the same amount of ECS tasks forever and would never scale in as FACTOR_SCALING is always at least 1 (see above point, FACTOR_SCALING is rounded up to the next higher integer which means >= 1). In this case I defined to scale in by 20%:
## we run more than enough workers currently, scale in slowly by 20 %
if [ $FACTOR_SCALING -le "1" ];
then
  NEW_NUMBER_TASKS=$( echo "(0.8 * $CURRENT_NUMBER_TASKS)" | bc -l)
fi;
  • I added a variable MAX_LIMIT_NUMBER_QUEUE_WORKERS which is used as the maximum number of queue workers running at the same time. I’m using this as a security measure in case my script fails somehow and wants to start way to many workers (which could be expensive).
        if [ $NEW_NUMBER_TASKS_INT -gt $MAX_LIMIT_NUMBER_QUEUE_WORKERS ];
          then
              NEW_NUMBER_TASKS_INT=$MAX_LIMIT_NUMBER_QUEUE_WORKERS # run not more than the maximum limit of queue workers
        fi;
  • after all these calculations we call the AWS CLI command aws ecs update-service to update the ECS service to the new number of ECS tasks, only errors are printed to stdout to avoid the huge default output from this CLI command:
# update ECS service to the calculated number of ECS tasks
aws ecs update-service --cluster $ECS_CLUSTER --service $ECS_SERVICE --desired-count $NEW_NUMBER_TASKS_INT 1>/dev/null

Run the bash Scripts via Cron in a Docker Container

To run the first bash script called publish-Backlog-per-Worker.sh every 1 minute and the second bash script called scaling.sh every 10 minutes I created a Docker container for it (which itself is running as an ECS task). The Dockerfile looks like:

FROM alpine:latest

LABEL maintainer="https://allaboutaws.com"
ARG DEBIAN_FRONTEND=noninteractive

USER root

RUN apk add --update --no-cache \
    jq \
    py-pip \
    bc \
    coreutils \
    bash

# update pip
RUN pip install --upgrade pip

RUN pip install awscli --upgrade

# Configure cron
COPY ./docker/workers/scaling/crontab /etc/cron/crontab

# Init cron
RUN crontab /etc/cron/crontab

WORKDIR /code/
COPY ./docker/workers/scaling/scaling.sh /code/
COPY ./docker/workers/scaling/publish-Backlog-per-Worker.sh /code

COPY ./docker/workers/scaling/entrypoint.sh /etc/app/entrypoint
RUN chmod +x /etc/app/entrypoint
ENTRYPOINT /bin/sh /etc/app/entrypoint

EXPOSE 8080

It’s an alpine container in which the necessary tools jq, bc, coreutils (for command date), bash and aws cli are installed.

The entrypoint file starts the cron daemon:

#!/bin/sh
set -e

crond -f

The file crontab which is copied inside the container (Don’t forget to put a new line at the end of this file! Cronjob needs it!):

*/10 * * * * /bin/bash /code/scaling.sh $LATENCY $PROCESSING_TIME $ECS_CLUSTER $ECS_SERVICE $CW_METRIC $CW_NAMESPACE $CW_DIMENSION_NAME $CW_DIMENSION_VALUE $MAX_LIMIT_NUMBER_QUEUE_WORKERS
* * * * * /bin/bash /code/publish-Backlog-per-Worker.sh $AWS_ACCOUNT_ID $SQS_QUEUE_NAME $ECS_CLUSTER $ECS_SERVICE $CW_METRIC $CW_NAMESPACE $CW_DIMENSION_NAME $CW_DIMENSION_VALUE

As you can see the arguments for the bash scripts are environment variables. I set them when starting the container.

How to build the Docker container

docker build -t ecs-autoscaling-sqs-metrics:latest -f ./Cronjob.Dockerfile .

How to run the Docker Container

docker run -it -e AWS_DEFAULT_REGION=eu-central-1 -e AWS_ACCESS_KEY_ID=XXX -e AWS_SECRET_ACCESS_KEY=XXX -e AWS_ACCOUNT_ID=XXX -e LATENCY=20 -e PROCESSING_TIME=2 -e SQS_QUEUE_NAME=My-SQS-Queue -e ECS_CLUSTER=My-ECS-Cluster -e ECS_SERVICE=My-ECS-Service -e CW_METRIC=MyBacklogPerTask -e CW_NAMESPACE=ECS-SQS-Scaling -e CW_DIMENSION_NAME=SQS-Queue -e CW_DIMENSION_VALUE=My-SQS-Queue -e MAX_LIMIT_NUMBER_QUEUE_WORKERS=200 ecs-autoscaling-sqs-metrics:latest

If you want to run this Docker Container as an ECS task, too, you can use this task definition using the prebuild docker image from DockerHub:

{
    "family": "queue-worker-autoscaling",
    "networkMode": "bridge",
    "taskRoleArn": "arn:aws:iam::123456789:role/ecsTaskRole",
    "containerDefinitions": [
        {
            "name": "cronjob",
            "image": "sh39sxn/ecs-autoscaling-sqs-metrics:latest",
            "memoryReservation": 256,
            "cpu": 512,
            "essential": true,
            "portMappings": [{
                "hostPort": 0,
                "containerPort": 8080,
                "protocol": "tcp"
            }],
            "environment": [{
                "name": "AWS_DEFAULT_REGION",
                "value": "eu-central-1"
            },{
                "name": "AWS_ACCESS_KEY_ID",
                "value": "XXX"
            },{
                "name": "AWS_SECRET_ACCESS_KEY",
                "value": "XXX"
            },{
                "name": "AWS_ACCOUNT_ID",
                "value": "123456789"
            },{
                "name": "SQS_QUEUE_NAME",
                "value": "My-SQS-Queue"
            },{
                "name": "LATENCY",
                "value": "20"
            },{
                "name": "PROCESSING_TIME",
                "value": "2"
            },{
                "name": "ECS_CLUSTER",
                "value": "MY-ECS-CLUSTER"
            },{
                "name": "ECS_SERVICE",
                "value": "My-QUEUE-WORKER-SERVICE"
            },{
                "name": "CW_METRIC",
                "value": "BacklogPerECSTask"
            },{
                "name": "CW_NAMESPACE",
                "value": "ECS-SQS-Autoscaling"
            },{
                "name": "CW_DIMENSION_NAME",
                "value": "SQS-Queue"
            },{
                "name": "CW_DIMENSION_VALUE",
                "value": "My-SQS-Queue"
            },{
                "name": "MAX_LIMIT_NUMBER_QUEUE_WORKERS",
                "value": "200"
            }],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                  "awslogs-group": "/ecs/Queue-Worker-Autoscaling",
                  "awslogs-region": "eu-central-1",
                  "awslogs-stream-prefix": "ecs"
                }
              }
        }
    ]
}

I added a logging configuration for CloudWatch Logs. This makes it easier to track and debug the algorithm. Don’t forget to create the CloudWatch log group /ecs/Queue-Worker-Autoscaling before starting the ECS task. Otherwise it will fail because the log group has to exist before you start the ECS task which pushes log to it.

Results

Using an example timeframe I show you how the auto-scaling of ECS containers based on SQS metrics works at the end.

The following screenshot shows the metric ApproximateNumberOfMessagesVisible which is significant for the current workload.

Above you can see two peaks at around 23:50 and 2:20. The custom metric showing the current Backlog per ECS Task fits to it as you see here:

The CloudWatch Logs from the Cronjob ECS Tasks shows that the algorithm recognized that the average backlog per worker is too high and the number of workers is increased from 2 to 24.

Custom Metric BacklogPerECSTask

10 minutes later the script checks again the average backlog per ECS Task and again scales in as the it’s still too high:

Custom Metric BacklogPerECSTask

Here you can see a graphical representation of the number of ECS Tasks:

Number of ECS Tasks

You can see how the number of ECS Tasks increased and then descreased in steps by 20% as defined in the bash script.

External Links

I uploaded the files to my GitHub Repo at https://github.com/sh39sxn/ecs-autoscaling-sqs-metrics and the prebuild Container is saved in my DockerHub Repo at https://hub.docker.com/r/sh39sxn/ecs-autoscaling-sqs-metrics.