AWS Compute Blog
Monitor Cluster State with Amazon ECS Event Stream
Thanks to my colleague Jay Allen for this great blog on how to use the ECS Event stream for operational tasks.
—-
In the past, in order to obtain updates on the state of a running Amazon ECS cluster, customers have had to rely on periodically polling the state of container instances and tasks using the AWS CLI or an SDK. With the new Amazon ECS event stream feature, it is now possible to retrieve near real-time, event-driven updates on the state of your Amazon ECS tasks and container instances. Events are delivered through Amazon CloudWatch Events, and can be routed to any valid CloudWatch Events target, such as an AWS Lambda function or an Amazon SNS topic.
In this post, I show you how to create a simple serverless architecture that captures, processes, and stores event stream updates. You first create a Lambda function that scans all incoming events to determine if there is an error related to any running tasks (for example, if a scheduled task failed to start); if so, the function immediately sends an SNS notification. Your function then stores the entire message as a document inside of an Elasticsearch cluster using Amazon Elasticsearch Service, where you and your development team can use the Kibana interface to monitor the state of your cluster and search for diagnostic information in response to issues reported by users.
Understanding the structure of event stream events
An ECS event stream sends two types of event notifications:
- Task state change notifications, which ECS fires when a task starts or stops
- Container instance state change notifications, which ECS fires when the resource utilization or reservation for an instance changes
A single event may result in ECS sending multiple notifications of both types. For example, if a new task starts, ECS first sends a task state change notification to signal that the task is starting, followed by a notification when the task has started (or has failed to start); additionally, ECS also fires container instance state change notifications when the utilization of the instance on which ECS launches the task changes.
Event stream events are sent using CloudWatch Events, which structures events as JSON messages divided into two sections: the envelope and the payload. The detail section of each event contains the payload data, and the structure of the payload is specific to the event being fired. The following example shows the JSON representation of a container state change event. Notice that the properties at to the top level of the JSON document describe event properties, such as the event name and time the event occurred, while the detail section contains the information about the task and container instance that triggered the event.
The following JSON depicts an ECS task state change event signifying that the essential container for a task running on an ECS cluster has exited, and thus the task has been stopped on the ECS cluster:
{
"version": "0",
"id": "8f07966c-b005-4a0f-9ee9-63d2c41448b3",
"detail-type": "ECS Task State Change",
"source": "aws.ecs",
"account": "244698725403",
"time": "2016-10-17T20:29:14Z",
"region": "us-east-1",
"resources": [
"arn:aws:ecs:us-east-1:123456789012:task/cdf83842-a918-482b-908b-857e667ce328"
],
"detail": {
"clusterArn": "arn:aws:ecs:us-east-1:123456789012:cluster/eventStreamTestCluster",
"containerInstanceArn": "arn:aws:ecs:us-east-1:123456789012:container-instance/f813de39-e42c-4a27-be3c-f32ebb79a5dd",
"containers": [
{
"containerArn": "arn:aws:ecs:us-east-1:123456789012:container/4b5f2b75-7d74-4625-8dc8-f14230a6ae7e",
"exitCode": 1,
"lastStatus": "STOPPED",
"name": "web",
"networkBindings": [
{
"bindIP": "0.0.0.0",
"containerPort": 80,
"hostPort": 80,
"protocol": "tcp"
}
],
"taskArn": "arn:aws:ecs:us-east-1:123456789012:task/cdf83842-a918-482b-908b-857e667ce328"
}
],
"createdAt": "2016-10-17T20:28:53.671Z",
"desiredStatus": "STOPPED",
"lastStatus": "STOPPED",
"overrides": {
"containerOverrides": [
{
"name": "web"
}
]
},
"startedAt": "2016-10-17T20:29:14.179Z",
"stoppedAt": "2016-10-17T20:29:14.332Z",
"stoppedReason": "Essential container in task exited",
"updatedAt": "2016-10-17T20:29:14.332Z",
"taskArn": "arn:aws:ecs:us-east-1:123456789012:task/cdf83842-a918-482b-908b-857e667ce328",
"taskDefinitionArn": "arn:aws:ecs:us-east-1:123456789012:task-definition/wpunconfiguredfail:1",
"version": 3
}
}
Setting up an Elasticsearch cluster
Before you dive into the code for handling events, set up your Elasticsearch cluster. On the console, choose Elasticsearch Service, Create a New Domain. In Elasticsearch domain name, type elasticsearch-ecs-events, then choose Next.

For Step 2: Configure cluster, accept all of the defaults by choosing Next.
For Step 3: Set up access policy, choose Next. This page lets you establish a resource-based policy for accessing your cluster; to allow access to the cluster’s actions, use an identity-based policy associated with your Lambda function.
Finally, on the Review page, choose Confirm and create. This starts spinning up your cluster.
While your cluster is being created, set up the SNS topic and Lambda function you need to start capturing and issuing notifications about events.
Create an SNS topic
Because your Lambda function emails you when a task fails unexpectedly due to an error condition, you need to set up an Amazon SNS topic to which your Lambda function can write.
In the console, choose SNS, Create Topic. For Topic name, type ECSTaskErrorNotification, and then choose Create topic.
When you’re done, copy the Topic ARN value, and save it to a text editor on your local desktop; you need it to configure permissions for your Lambda function in the next step. Finally, choose Create subscription to subscribe to an email address for which you have access, so that you receive these event notifications. Remember to click the link in the confirmation email, or you won’t receive any events.
The eagle-eyed among you may realize that you haven’t given your future Lambda function permission to call your SNS topic. You grant this permission to the Lambda execution role when you create your Lambda function in the following steps.
Handling event stream events in a Lambda function
For the next step, create your Lambda function to capture events. Here’s the code for your function (written in Python 2.7):
import requests
import json
from requests_aws_sign import AWSV4Sign
from boto3 import session, client
from elasticsearch import Elasticsearch, RequestsHttpConnection
es_host = '<insert your own Amazon ElasticSearch endpoint here>'
sns_topic = '<insert your own SNS topic ARN here>'
def lambda_handler(event, context):
# Establish credentials
session_var = session.Session()
credentials = session_var.get_credentials()
region = session_var.region_name or 'us-east-1'
# Check to see if this event is a task event and, if so, if it contains
# information about an event failure. If so, send an SNS notification.
if "detail-type" not in event:
raise ValueError("ERROR: event object is not a valid CloudWatch Logs event")
else:
if event["detail-type"] == "ECS Task State Change":
detail = event["detail"]
if detail["lastStatus"] == "STOPPED":
if detail["stoppedReason"] == "Essential container in task exited":
# Send an error status message.
sns_client = client('sns')
sns_client.publish(
TopicArn=sns_topic,
Subject="ECS task failure detected for container",
Message=json.dumps(detail)
)
# Elasticsearch connection. Note that you must sign your requests in order
# to call the Elasticsearch API anonymously. Use the requests_aws_sign
# package for this.
service = 'es'
auth=AWSV4Sign(credentials, region, service)
es_client = Elasticsearch(host=es_host,
port=443,
connection_class=RequestsHttpConnection,
http_auth=auth,
use_ssl=True,
verify_ssl=True)
es_client.index(index="ecs-index", doc_type="eventstream", body=event)
Break this down: First, the function inspects the event to see if it is a task change event. If so, it further looks to see if the event is reporting a stopped task, and whether that task stopped because one of its essential containers terminated. If these conditions are true, it sends a notification to the SNS topic that you created earlier.
Second, the function creates an Elasticsearch connection to your Amazon ES instance. The function uses the requests_aws_sign library to implement Sig4 signing because, in order to call Amazon ES, you need to sign all requests with the Sig4 signing process. After the Sig4 signature is generated, the function calls Amazon ES and adds the event to an index for later retrieval and inspection.
To get this code to work, your Lambda function must have permission to perform HTTP POST requests against your Amazon ES instance, and to publish messages to your SNS topic. Configure this by setting up your Lambda function with an execution role that grants the appropriate permission to these resources in your account.
To get started, you need to prepare a ZIP file for the above code that contains both the code and its prerequisites. Create a directory named lambda_eventstream, and save the code above to a file named lambda_function.py. In your favorite text editor, replace the es_host and sns_topic variables with your own Amazon ES endpoint and SNS topic ARN, respectively.
Next, on the command line (Linux, Windows or Mac), change to the directory that you just created, and run the following command for pip (the de facto standard Python installation utility) to download all of the required prerequisites for this code into the directory. You need to ship these dependencies with your code, as they are not pre-installed on the instance that runs your Lambda function.
NOTE: You need to be on a machine with Python and pip already installed. If you are using Python 2.7.9 or greater, pip is installed as part of your standard Python installation. If you are not using Python 2.7.9 or greater, consult the pip page for installation instructions.
pip install requests_aws_sign elasticsearch -t .
Finally, zip all of the contents of this directory into a single zip file. Make sure that the lambda-eventstream.py file is at the top of the file hierarchy within the zip file, and that it is not contained within another directory. From within the lambda-eventstream directory, you can use the following command on Linux and MacOS systems:
zip lambda-eventstream.zip *
On Windows clients with the 7-Zip utility installed, you can run the following command from PowerShell or, if you’re really so inclined, a command prompt:
7z a -tzip lambda-eventstream.zip *
Now that your function and its dependencies are properly packaged, install and test it. Navigate to the Lambda console, choose Create a Lambda Function, and then on the Select Blueprint page, choose Blank function. Choose Next on the Configure triggers screen; you wire up your function to your ECS event stream in the next section.
On the Configure function page, for Name, enter lambda-eventstream. For Runtime, choose Python 2.7. Under Lambda function code, for Code entry type, choose Upload a .ZIP file, and choose Upload to select the ZIP file that you just created.

Under Lambda function handler and role, for Role, choose Create a custom role. This opens a new window for configuring your policy. For IAM Role, choose Create a New IAM Role, and type a name. Then choose View Policy Document, Edit. Paste in the IAM policy below, making sure to replace every instance of AWSAccountID with your own AWS account ID.
{
"Version":"2012-10-17",
"Statement":[
{
"Effect":"Allow",
"Action":"lambda:InvokeFunction",
"Resource":"arn:aws:lambda:us-east-1:<AWSAccountID>:function:ecs-events",
"Principal":{
"Service":"events.amazonaws.com"
},
"Condition":{
"ArnLike":{
"AWS:SourceArn":"arn:aws:events:us-east-1:<AWSAccountID>:rule/eventstream-rule"
}
},
"Sid":"TrustCWEToInvokeMyLambdaFunction"
},
{
"Effect":"Allow",
"Action":"logs:CreateLogGroup",
"Resource":"arn:aws:logs:us-east-1:<AWSAccountID>:*"
},
{
"Effect":"Allow",
"Action":[
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource":[
"arn:aws:logs:us-east-1:<AWSAccountID>:log-group:/aws/lambda/ecs-events:*"
]
},
{
"Effect": "Allow",
"Action": [
"es:ESHttpPost"
],
"Resource": "arn:aws:es:us-east-1:<AWSAccountID>:domain/ecs-events-cluster/*"
},
{
"Effect": "Allow",
"Action": [
"sns:Publish"
],
"Resource": "arn:aws:sns:us-east-1:<AWSAccountID>:ECSTaskErrorNotification"
}
]
}
This policy establishes every permission that your Lambda function requires for execution, including permission to:
- Create a new CloudWatch Events log group, and save all outputs from your Lambda function to this group
- Perform HTTP PUT commands on your Elasticsearch cluster
- Publish messages to your SNS topic
When you’re done, you can test your configuration by scrolling up to the sample event stream message provided earlier in this post, and using it to test your Lambda function in the console. On the dashboard page for your new function, choose Test, and in the Input test event window, enter the JSON-formatted event from earlier.
Note that, if you haven’t correctly input your account ID in the correct places in your IAM policy file, you may receive a message along the lines of:
User: arn:aws:sts::123456789012:assumed-role/LambdaEventStreamTake2/awslambda_421_20161017203411268 is not authorized to perform: es:ESHttpPost on resource: ecs-events-cluster.
Edit the policy associated with your Lambda execution role in the IAM console and try again.
Send event stream events to your Lambda function
Almost there! Now with your SNS topic, Elasticsearch cluster, and Lambda function all in place, the only remaining element is to wire up your ECS event stream events and route them to your Lambda function. The CloudWatch Events console offer everything you need to set this up quickly and easily.
From the console, choose CloudWatch, Events. On Step 1: Create Rule, under Event selector, choose Amazon EC2 Container Service. CloudWatch Events enables you to filter by the type of message (task state change or container instance state change), as well as to select a specific cluster from which to receive events. For the purposes of this post, keep the default settings of Any detail type and Any cluster.
Under Targets, choose Lambda function. For Function, choose lambda-eventstream. Behind the scenes, this sends events from your ECS clusters to your Lambda function and also creates the service role required for CloudWatch Events to call your Lambda function.
Verify your work
Now it’s time to verify that messages sent from your ECS cluster flow through your Lambda function, trigger an SNS message for failed tasks, and are stored in your Elasticsearch cluster for future retrieval. To test this workflow, you can use the following ECS task definition, which attempts to start the official WordPress image without configuring an SQL database for storage:
{
"taskDefinition": {
"status": "ACTIVE",
"family": "wpunconfiguredfail",
"volumes": [],
"taskDefinitionArn": "arn:aws:ecs:us-east-1:244698725403:task-definition/wpunconfiguredfail:1",
"containerDefinitions": [
{
"environment": [],
"name": "web",
"mountPoints": [],
"image": "wordpress",
"cpu": 99,
"portMappings": [
{
"protocol": "tcp",
"containerPort": 80,
"hostPort": 80
}
],
"memory": 100,
"essential": true,
"volumesFrom": []
}
],
"revision": 1
}
}
Create this task definition using either the AWS Management Console or the AWS CLI, and then start a task from this task definition. For more detailed instructions, see Launching a Container Instance.
A few minutes after launching this task definition, you should receive an SNS message with the contents of the task state change JSON indicating that the task failed. You can also examine your Elasticsearch cluster in the console by selecting the name of your cluster and choosing Indicates, ecs-index. For Count, you should see that you have multiple records stored.
You can also search the messages that have been stored by opening up access to your Kibana endpoint. Kibana provides a host of visualization and search capabilities for data stored in Amazon ES. To open up access to Kibana to your computer, find your computer’s IP address, and then choose Modify access policy for your Elasticsearch cluster. For Set the domain access policy to, choose Allow access to the domain from specific IP(s) and enter your IP address.
(A more robust and scalable solution for securing Kibana is to front it with a proxy. Details on this approach can be found in Karthi Thyagarajan’s post How to Control Access to Your Amazon Elasticsearch Service Domain.)
You should now be able to kick the Kibana endpoint for your cluster, and search for messages stored in your cluster’s indexes.
Conclusion
After you have this basic, serverless architecture set up for consuming ECS cluster-related event notifications, the possibilities are limitless. For example, instead of storing the events in Amazon ES, you could store them in Amazon DynamoDB, and use the resulting tables to build a UI that materializes the current state of your clusters.
You could also use this information to drive container placement and scaling automatically, allowing you to “right-size” your clusters to a very granular level. By delivering cluster state information in near-real time using an event-driven model as opposed to a pull model, the new ECS event stream feature opens up a much wider array of possibilities for monitoring and scaling your container infrastructure.
If you have questions or suggestions, please comment below.
Introducing Simplified Serverless Application Deployment and Management

Orr Weinstein, Sr. Product Manager, AWS Lambda
Today, AWS Lambda launched the AWS Serverless Application Model (AWS SAM); a new specification that makes it easier than ever to manage and deploy serverless applications on AWS. With AWS SAM, customers can now express Lambda functions, Amazon API Gateway APIs, and Amazon DynamoDB tables using simplified syntax that is natively supported by AWS CloudFormation. After declaring their serverless app, customers can use new CloudFormation commands to easily create a CloudFormation stack and deploy the specified AWS resources.
AWS SAM
AWS SAM is a simplification of the CloudFormation template, which allows you to easily define AWS resources that are common in serverless applications.
You inform CloudFormation that your template defines a serverless app by adding a line under the template format version, like the following:
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Serverless resource types
Now, you can start declaring serverless resources….
AWS Lambda function (AWS::Serverless::Function)
Use this resource type to declare a Lambda function. When doing so, you need to specify the function’s handler, runtime, and a URI pointing to an Amazon S3 bucket that contains your Lambda deployment package.
If you want, you could add managed policies to the function’s execution role, or add environment variables to your function. The really cool thing about AWS SAM is that you can also use it to create an event source to trigger your Lambda function with just a few lines of text. Take a look at the example below:
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
MySimpleFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.handler
Runtime: nodejs4.3
CodeUri: s3://<bucket>/MyCode.zip
Events:
MyUploadEvent:
Type: S3
Properties:
Id: !Ref Bucket
Events: Create
Bucket:
Type: AWS::S3::Bucket
In this example, you declare a Lambda function and provide the required properties (Handler, Runtime and CodeUri). Then, declare the event source—an S3 bucket to trigger your Lambda function on ‘Object Created’ events. Note that you are using the CloudFormation intrinsic ref function to set the bucket you created in the template as the event source.
Amazon API Gateway API (AWS::Serverless::Api)
You can use this resource type to declare a collection of Amazon API Gateway resources and methods that can be invoked through HTTPS endpoints. With AWS SAM, there are two ways to declare an API:
1) Implicitly
An API is created implicitly from the union of API events defined on AWS::Serverless::Function. An example of how this is done is shown later in this post.
2) Explicitly
If you require the ability to configure the underlying API Gateway resources, you can declare an API by providing a Swagger file, and the stage name:
MyAPI:
Type: AWS::Serverless::Api
Properties:
StageName: prod
DefinitionUri: swaggerFile.yml
Amazon DynamoDB table (AWS::Serverless::SimpleTable)
This resource creates a DynamoDB table with a single attribute primary key. You can specify the name and type of your primary key, and your provisioned throughput:
MyTable:
Type: AWS::Serverless::SimpleTable
Properties:
PrimaryKey:
Name: id
Type: String
ProvisionedThroughput:
ReadCapacityUnits: 5
WriteCapacityUnits: 5
In the event that you require more advanced functionality, you should declare the AWS::DynamoDB::Table resource instead.
App example
Now, let’s walk through an example that demonstrates how easy it is to build an app using AWS SAM, and then deploy it using a new CloudFormation command.
Building a serverless app
In this example, you build a simple CRUD web service. The “front door” to the web service is an API that exposes the “GET”, “PUT”, and “DELETE” methods. Each method triggers a corresponding Lambda function that performs an action on a DynamoDB table.
This is how your AWS SAM template should look:
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Simple CRUD web service. State is stored in a DynamoDB table.
Resources:
GetFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.get
Runtime: nodejs4.3
Policies: AmazonDynamoDBReadOnlyAccess
Environment:
Variables:
TABLE_NAME: !Ref Table
Events:
GetResource:
Type: Api
Properties:
Path: /resource/{resourceId}
Method: get
PutFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.put
Runtime: nodejs4.3
Policies: AmazonDynamoDBFullAccess
Environment:
Variables:
TABLE_NAME: !Ref Table
Events:
PutResource:
Type: Api
Properties:
Path: /resource/{resourceId}
Method: put
DeleteFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.delete
Runtime: nodejs4.3
Policies: AmazonDynamoDBFullAccess
Environment:
Variables:
TABLE_NAME: !Ref Table
Events:
DeleteResource:
Type: Api
Properties:
Path: /resource/{resourceId}
Method: delete
Table:
Type: AWS::Serverless::SimpleTable
There are a few things to note here:
- You start the template by specifying Transform: AWS::Serverless-2016-10-31. This informs CloudFormation that this template contains AWS SAM resources that need to be ‘transformed’ to full-blown CloudFormation resources when the stack is created.
- You declare three different Lambda functions (GetFunction, PutFunction, and DeleteFunction), and a simple DynamoDB table. In each of the functions, you declare an environment variable (TABLENAME) that leverages the CloudFormation intrinsic ref function to set TABLENAME to the name of the DynamoDB table that you declare in your template.
- You do not use the CodeUri attribute to specify the location of your Lambda deployment package for any of your functions (more on this later).
- By declaring an API event (and not declaring the same API as a separate AWS::Serverless::Api resource), you are telling AWS SAM to generate that API for you. The API that is going to be generated from the three API events above looks like the following:
/resource/{resourceId}
GET
PUT
DELETE
Next, take a look at the code:
'use strict';
console.log('Loading function');
let doc = require('dynamodb-doc');
let dynamo = new doc.DynamoDB();
const tableName = process.env.TABLE_NAME;
const createResponse = (statusCode, body) => {
return {
"statusCode": statusCode,
"body": body || ""
}
};
exports.get = (event, context, callback) => {
var params = {
"TableName": tableName,
"Key": {
id: event.pathParameters.resourceId
}
};
dynamo.getItem(params, (err, data) => {
var response;
if (err)
response = createResponse(500, err);
else
response = createResponse(200, data.Item ? data.Item.doc : null);
callback(null, response);
});
};
exports.put = (event, context, callback) => {
var item = {
"id": event.pathParameters.resourceId,
"doc": event.body
};
var params = {
"TableName": tableName,
"Item": item
};
dynamo.putItem(params, (err, data) => {
var response;
if (err)
response = createResponse(500, err);
else
response = createResponse(200, null);
callback(null, response);
});
};
exports.delete = (event, context, callback) => {
var params = {
"TableName": tableName,
"Key": {
"id": event.pathParameters.resourceId
}
};
dynamo.deleteItem(params, (err, data) => {
var response;
if (err)
response = createResponse(500, err);
else
response = createResponse(200, null);
callback(null, response);
});
};
Notice the following:
- There are three separate functions, which have handlers that correspond to the handlers you defined in the AWS SAM file (‘get’, ‘put’, and ‘delete’).
- You are using env.TABLE_NAME to pull the value of the environment variable that you declared in the AWS SAM file.
Deploying a serverless app
OK, now assume you’d like to deploy your Lambda functions, API, and DynamoDB table, and have your app up and ready to go. In addition, assume that your AWS SAM file and code files are in a local folder:
- MyProject
- app_spec.yml
- index.js
To create a CloudFormation stack to deploy your AWS resources, you need to:
- Zip the index.js file.
- Upload it to an S3 bucket.
- Add a CodeUri property, specifying the location of the zip file in the bucket for each function in app_spec.yml.
- Call the CloudFormation CreateChangeSet operation with app_spec.yml.
- Call the CloudFormation ExecuteChangeSet operation with the name of the changeset you created in step 4.
This seems like a long process, especially if you have multiple Lambda functions (you would have to go through steps 1 to 3 for each function). Luckily, the new ‘Package’ and ‘Deploy’ commands from CloudFormation take care of all five steps for you!
First, call package To perform steps 1 to 3. You need to provide the command with the path to your AWS SAM file, an S3 bucket name, and a name for the new template that will be created (which will contain an updated CodeUri property):
aws cloudformation package --template-file app_spec.yml --output-template-file new_app_spec.yml --s3-bucket <your-bucket-name>
Next, call Deploy with the name of the newly generated SAM file, and with the name of the CloudFormation stack. In addition, you need to provide CloudFormation with the capability to create IAM roles:
aws cloudformation deploy --template-file new_app_spec.yml --stack-name <your-stack-name> --capabilities CAPABILITY_IAM
And voila! Your CloudFormation stack has been created, and your Lambda functions, API Gateway API, and DynamoDB table have all been deployed.
Conclusion
Creating, deploying and managing a serverless app has never been easier. To get started, visit our docs page or check out the serverless-application-model repo on GitHub.
If you have questions or suggestions, please comment below.
Simplify Serverless Applications with Environment Variables in AWS Lambda

Gene Ting, Solutions Architect
Lambda developers often want to configure their functions without changing any code. In this post, we show you how to use environment variables to pass settings to your Lambda function code and libraries.
Creating and updating a Lambda function
First, create a Lambda function that uses some environment variables. Here’s a simple but realistic example that allows you to control the log level of a Lambda function by setting an environment variable called, “LOG_LEVEL”. After you have created the code, pass values into LOG_LEVEL so your code can read it.

'use strict';
const logLevels = {error: 4, warn: 3, info: 2, verbose: 1, debug: 0};
// get the current log level from the current environment if set, else set to info
const currLogLevel = process.env.LOG_LEVEL != null ? process.env.LOG_LEVEL : 'info';
// print the log statement, only if the requested log level is greater than the current log level
function log(logLevel, statement) {
if(logLevels[logLevel] >= logLevels[currLogLevel] ) {
console.log(statement);
}
}
// return the monthly payment, rounded to the cent. FOR DEMO PURPOSES ONLY
function monthlyPayment(principal, rate, years) {
var fv = Math.pow(1+rate/1200, years * 12);
return Number(Math.round(principal * (rate/1200*fv) / (fv - 1) + 'e2') + 'e-2');
}
exports.handler = (event, context, callback) => {
log('debug', "principal: " + event.principal + " - rate: " + event.rate + " - years: " + event.years);
callback(null, monthlyPayment(event.principal, event.rate, event.years));
};
Now, here’s how to get those values into the code. You can do this through the console, you can also do it programmatically with full API and CLI support. In the console, a new section below the Lambda function allows you to specify environment variables.

By default, Lambda chooses the default KMS service key for Lambda. Passing custom keys is supported, but not required.

If you use the default KMS service key for Lambda, then you dont need additional IAM permissions in your execution role – your role works automatically without any changes.
If you supply your own, custom KMS key, then you need to allow “kms:Decrypt”, as shown below in a basic execution role.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"kms:Decrypt"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": "logs:CreateLogGroup",
"Resource": "arn:aws:logs:us-east-1:xxxxxxxxxxxx:*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:us-east-1:xxxxxxxxxxxx:log-group:/aws/lambda/mortgageCalc:*"
]
}
]
}
Test your function and see environment variables in action.
Testing your Lambda function
First, in the Lambda console, configure a test event for this function, using the following JSON snippet:
{
"principal": 100000,
"rate": 6,
"years": 30
}
Next, run the function by pressing “Test”:

You should verify the result of the calculator, but what’s really interesting is what you see in the logs. With the logging level set to ‘info’, no debug information should appear:

Change LOG_LEVEL to ‘debug’ and re-run the function:

Choose “Test” again, and examine the logs: you should see that the additional debugging logs appear. Console output can be found within the log stream for the function in Amazon CloudWatch Logs:

Targeting a table based on a given phase of a deployment lifecycle
In the example above, you used environment variables to modify the behavior of a Lambda function without changing its code. Here’s another typical use of environment variables: Providing stage-specific settings for code as it moves through lifecycle stages from development to deployment. Environment variables can be used to provide settings for resources, such as a development database password versus the production database password.
The example code below shows how to point an Amazon DynamoDB client in Java to a table that varies across stages. It also demonstrates how environment variables can be read in a Java-based Lambda function, using System.getenv.
//Write to the appropriate table based on the current environment AmazonDynamoDBClient dynamoDB = new AmazonDynamoDBClient(); Table table = dynamoDB.getTable(System.getenv(TARGET_TABLE)); Item item = new Item(); table.putItem(item);
Based on the value of “TARGET_TABLE”, the function connects to different tables. Because you change only the configuration, not your code, you know that the behavior is unchanged as you promote from one stage to another; only the environment variable settings change.
Conclusion
Environment variables bring a new level of simplicity to working with Lambda functions. In this post, we described how to use environment variables to:
- Change Lambda function behavior, such as switching the logging level, without changing the code
- Configure access to stage-specific resources, such as a DynamoDB table name or a SQL table password, as your code progresses from development to production.
For more information, see Environment Variables in the AWS Lambda Developer Guide. Happy coding everyone, and have fun creating awesome serverless applications!
Binary Support for API Integrations with Amazon API Gateway
A year ago, the Microservices without the Servers post showed how Lambda can be used for creating image thumbnails. This required the client to Base64 encode the binary image file before calling the image conversion API as well as Base64 decode the response before it could be rendered. With the recent Amazon API Gateway launch of Binary Support for API Integrations, you can now specify media types that you want API Gateway to treat as binary and send and receive binary data through API endpoints hosted on Amazon API Gateway.
After this feature is configured, you can specify if you would like API Gateway to either pass the Integration Request and Response bodies through, convert them to text (Base64 encoding), or convert them to binary (Base64 decoding). These options are available for HTTP, AWS Service, and HTTP Proxy integrations. In the case of Lambda Function and Lambda Function Proxy Integrations, which currently only support JSON, the request body is always converted to JSON.
In this post, I show how you can use the new binary support in API Gateway to turn this Lambda function into a public API, which you can use to include a binary image file in a POST request and get a thumbnail version of that same image. I also show how you can now use API Gateway and Lambda to create a thumbnail service, which you can use to include a binary image file in a POST request and get a thumbnail version of the same image.
Walkthrough
To get started, log in to the AWS Management Console to set up a Lambda integration, using the image-processing-service blueprint.
Create the Lambda function
In the Lambda console, choose Create a Lambda Function.

In the blueprint filter step, for Select runtime , type in 'image' and then choose image-processing-service.

Do not set up a trigger. Choose Next.

In the Configure function step, specify the function name, such as 'thumbnail'.

In the Lambda function handler and role step, for Role , choose Create new role from template(s), and specify the role name (e.g., 'myMicroserviceRole'). Finally, choose Next. For more details, see AWS Lambda Permissions Model.

Review your Lambda function configuration and choose Create Function.

You have now successfully created the Lambda function that will create a thumbnail.

Create an API and POST method
In this section, you set up an API Gateway thumbnail API to expose a publically accessible RESTful endpoint.
In the API Gateway console, choose Create API.

For API name , enter 'Thumbnail', add a description, and choose Create API.

In the created API, choose Resources , Actions , and Create Method.

To create the method, choose POST and select the checkmark.

To set up the POST method, for Integration type , select Lambda Function , select the appropriate Lambda region, and enter 'thumbnail' for Lambda Function. Choose Save.

In the Add Permission to Lambda Function dialog box, choose OK to enable API Gateway to invoke the 'thumbnail' Lambda function.

Set up the integration
Now, you are ready to set up the integration. In the main page, open Integration Request.

On the Integration Request page, expand Body Mapping Templates.

For Request body passthrough , choose When there are no templates defined (recommended). For Content-Type , enter "image/png".

Choose Add mapping template and add the following template. The thumbnail Lambda function requires that you pass an operation to execute, in this case "thumbnail", and the image payload "base64Image" you are passing in, which is "$input.body". Review the following JSON and choose Save.

Specify which media types need to be handled as binary. Choose [API name], Binary Support.

Choose Edit , specify the media type (such as "image/png") to be handled as binary, and then choose Save.

Deployment
Now that the API is configured, you need to deploy it. On the thumbnail Resources page, choose Action , Deploy API.

For Deployment stage , select [New Stage], specify a stage name, and then choose Deploy.

A stage has been created for you; you receive the Invoke URL value to be used for your thumbnail API.

Testing
Now, you are ready to test the newly created API. Download your favorite .png image (such as apigateway.png), and issue the following curl command. Update the .png image file name and the Invoke URL value accordingly.
$ curl --request POST -H "Accept: image/png" -H "Content-Type: image/png" --data-binary "@apigateway.png" https://XXXXX.execute-api.us-east-1.amazonaws.com/prod > apigateway-thumb.png
You should now be able to open the created images in your favorite image viewer to confirm that resizing has occurred.

Summary
This is just one example of how you can leverage the new binary capabilities of Binary Support in API Gateway. For more examples, see the API Gateway Payload Encodings topic in the Amazon API Gateway Developer Guide.
If you have questions or suggestions, please comment below.
Building a Backup System for Scaled Instances using AWS Lambda and Amazon EC2 Run Command
Diego Natali, AWS Cloud Support Engineer
When an Auto Scaling group needs to scale in, replace an unhealthy instance, or re-balance Availability Zones, the instance is terminated, data on the instance is lost and any on-going tasks are interrupted. This is normal behavior but sometimes there are use cases when you might need to run some commands, wait for a task to complete, or execute some operations (for example, backing up logs) before the instance is terminated. So Auto Scaling introduced lifecycle hooks, which give you more control over timing after an instance is marked for termination.
In this post, I explore how you can leverage Auto Scaling lifecycle hooks, AWS Lambda, and Amazon EC2 Run Command to back up your data automatically before the instance is terminated. The solution illustrated allows you to back up your data to an S3 bucket; however, with minimal changes, it is possible to adapt this design to carry out any task that you prefer before the instance gets terminated, for example, waiting for a worker to complete a task before terminating the instance.
Using Auto Scaling lifecycle hooks, Lambda, and EC2 Run Command
You can configure your Auto Scaling group to add a lifecycle hook when an instance is selected for termination. The lifecycle hook enables you to perform custom actions as Auto Scaling launches or terminates instances. In order to perform these actions automatically, you can leverage Lambda and EC2 Run Command to allow you to avoid the use of additional software and to rely completely on AWS resources.
For example, when an instance is marked for termination, Amazon CloudWatch Events can execute an action based on that. This action can be a Lambda function to execute a remote command on the machine and upload your logs to your S3 bucket.
EC2 Run Command enables you to run remote scripts through the agent running within the instance. You use this feature to back up the instance logs and to complete the lifecycle hook so that the instance is terminated.
The example provided in this post works precisely this way. Lambda gathers the instance ID from CloudWatch Events and then triggers a remote command to back up the instance logs.

Set up the environment
Make sure that you have the latest version of the AWS CLI installed locally. For more information, see Getting Set Up with the AWS Command Line Interface.
Step 1 – Create an SNS topic to receive the result of the backup
In this step, you create an Amazon SNS topic in the region in which to run your Auto Scaling group. This topic allows EC2 Run Command to send you the outcome of the backup. The output of the aws iam create-topic command includes the ARN. Save the ARN, as you need it for future steps.
aws sns create-topic --name backupoutcome
Now subscribe your email address as the endpoint for SNS to receive messages.
aws sns subscribe --topic-arn <enter-your-sns-arn-here> --protocol email --notification-endpoint <your_email>
Step 2 – Create an IAM role for your instances and your Lambda function
In this step, you use the AWS console to create the AWS Identity and Access Management (IAM) role for your instances and Lambda to enable them to run the SSM agent, upload your files to your S3 bucket, and complete the lifecycle hook.
First, you need to create a custom policy to allow your instances and Lambda function to complete lifecycle hooks and publish to the SNS topic set up in Step 1.
- Log into the IAM console.
- Choose Policies, Create Policy
- For Create Your Own Policy, choose Select.
- For Policy Name, type “ASGBackupPolicy”.
- For Policy Document, paste the following policy which allows to complete a lifecycle hook:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"autoscaling:CompleteLifecycleAction",
"sns:Publish"
],
"Effect": "Allow",
"Resource": "*"
}
]
}
Create the role for EC2.
- In the left navigation pane, choose Roles, Create New Role.
- For Role Name, type “instance-role” and choose Next Step.
- Choose Amazon EC2 and choose Next Step.
- Add the policies AmazonEC2RoleforSSM and ASGBackupPolicy.
- Choose Next Step, Create Role.
Create the role for the Lambda function.
- In the left navigation pane, choose Roles, Create New Role.
- For Role Name, type “lambda-role” and choose Next Step.
- Choose AWS Lambda and choose Next Step.
- Add the policies AmazonSSMFullAccess, ASGBackupPolicy, and AWSLambdaBasicExecutionRole.
- Choose Next Step, Create Role.
Step 3 – Create an Auto Scaling group and configure the lifecycle hook
In this step, you create the Auto Scaling group and configure the lifecycle hook.
- Log into the EC2 console.
- Choose Launch Configurations, Create launch configuration.
- Select the latest Amazon Linux AMI and whatever instance type you prefer, and choose Next: Configuration details.
- For Name, type “ASGBackupLaunchConfiguration”.
- For IAM role, choose “instance-role” and expand Advanced Details.
- For User data, add the following lines to install and launch the SSM agent at instance boot:
#!/bin/bash sudo yum install amazon-ssm-agent -y sudo /sbin/start amazon-ssm-agent
- Choose Skip to review, Create launch configuration, select your key pair, and then choose Create launch configuration.
- Choose Create an Auto Scaling group using this launch configuration.
- For Group name, type “ASGBackup”.
- Select your VPC and at least one subnet and then choose Next: Configuration scaling policies, Review, and Create Auto Scaling group.
Your Auto Scaling group is now created and you need to add the lifecycle hook named “ASGBackup” by using the AWS CLI:
aws autoscaling put-lifecycle-hook --lifecycle-hook-name ASGBackup --auto-scaling-group-name ASGBackup --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING --heartbeat-timeout 3600
Step 4 – Create an S3 bucket for files
Create an S3 bucket where your data will be saved, or use an existing one. To create a new one, you can use this AWS CLI command:
aws s3api create-bucket --bucket <your_bucket_name>
Step 5 – Create the SSM document
The following JSON document archives the files in “BACKUPDIRECTORY” and then copies them to your S3 bucket “S3BUCKET”. Every time this command completes its execution, a SNS message is sent to the SNS topic specified by the “SNSTARGET” variable and completes the lifecycle hook.
In your JSON document, you need to make a few changes according to your environment:
| Auto Scaling group name (line 12) | “ASGNAME=’ASGBackup'”, |
| Lifecycle hook name (line 13) | “LIFECYCLEHOOKNAME=’ASGBackup'”, |
| Directory to back up (line 14) | “BACKUPDIRECTORY=’/var/log'”, |
| S3 bucket (line 15) | “S3BUCKET='<your_bucket_name>'”, |
| SNS target (line 16) | “SNSTARGET=’arn:aws:sns:’${REGION}’:<your_account_id>:<your_sns_ backupoutcome_topic>” |
Here is the document:
{
"schemaVersion": "1.2",
"description": "Backup logs to S3",
"parameters": {},
"runtimeConfig": {
"aws:runShellScript": {
"properties": [
{
"id": "0.aws:runShellScript",
"runCommand": [
"",
"ASGNAME='ASGBackup'",
"LIFECYCLEHOOKNAME='ASGBackup'",
"BACKUPDIRECTORY='/var/log'",
"S3BUCKET='<your_bucket_name>'",
"SNSTARGET='arn:aws:sns:'${REGION}':<your_account_id>:<your_sns_ backupoutcome_topic>'",
"INSTANCEID=$(curl http://169.254.169.254/latest/meta-data/instance-id)",
"REGION=$(curl http://169.254.169.254/latest/meta-data/placement/availability-zone)",
"REGION=${REGION::-1}",
"HOOKRESULT='CONTINUE'",
"MESSAGE=''",
"",
"tar -cf /tmp/${INSTANCEID}.tar $BACKUPDIRECTORY &> /tmp/backup",
"if [ $? -ne 0 ]",
"then",
" MESSAGE=$(cat /tmp/backup)",
"else",
" aws s3 cp /tmp/${INSTANCEID}.tar s3://${S3BUCKET}/${INSTANCEID}/ &> /tmp/backup",
" MESSAGE=$(cat /tmp/backup)",
"fi",
"",
"aws sns publish --subject 'ASG Backup' --message \"$MESSAGE\" --target-arn ${SNSTARGET} --region ${REGION}",
"aws autoscaling complete-lifecycle-action --lifecycle-hook-name ${LIFECYCLEHOOKNAME} --auto-scaling-group-name ${ASGNAME} --lifecycle-action-result ${HOOKRESULT} --instance-id ${INSTANCEID} --region ${REGION}"
]
}
]
}
}
}
- Log into the EC2 console.
- Choose Command History, Documents, Create document.
- For Document name, enter “ASGLogBackup”.
- For Content, add the above JSON, modified for your environment.
- Choose Create document.
Step 6 – Create the Lambda function
The Lambda function uses modules included in the Python 2.7 Standard Library and the AWS SDK for Python module (boto3), which is preinstalled as part of Lambda. The function code performs the following:
- Checks to see whether the SSM document exists. This document is the script that your instance runs.
- Sends the command to the instance that is being terminated. It checks for the status of EC2 Run Command and if it fails, the Lambda function completes the lifecycle hook.
- Log in to the Lambda console.
- Choose Create Lambda function.
- For Select blueprint, choose Skip, Next.
- For Name, type “lambda_backup” and for Runtime, choose Python 2.7.
- For Lambda function code, paste the Lambda function from the GitHub repository.
- Choose Choose an existing role.
- For Role, choose lambda-role (previously created).
- In Advanced settings, configure Timeout for 5 minutes.
- Choose Next, Create function.
Your Lambda function is now created.
Step 7 – Configure CloudWatch Events to trigger the Lambda function
Create an event rule to trigger the Lambda function.
- Log in to the CloudWatch console.
- Choose Events, Create rule.
- For Select event source, choose Auto Scaling.
- For Specific instance event(s), choose EC2 Instance-terminate Lifecycle Action and for Specific group name(s), choose ASGBackup.
- For Targets, choose Lambda function and for Function, select the Lambda function that you previously created, “lambda_backup”.
- Choose Configure details.
- In Rule definition, type a name and choose Create rule.
Your event rule is now created; whenever your Auto Scaling group “ASGBackup” starts terminating an instance, your Lambda function will be triggered.
Step 8 – Test the environment
From the Auto Scaling console, you can change the desired capacity and the minimum for your Auto Scaling group to 0 so that the instance running starts being terminated. After the termination starts, you can see from Instances tab that the instance lifecycle status changed to Termination:Wait. While the instance is in this state, the Lambda function and the command are executed.
You can review your CloudWatch logs to see the Lambda output. In the CloudWatch console, choose Logs and /aws/lambda/lambda_backup to see the execution output.
You can go to your S3 bucket and check that the files were uploaded. You can also check Command History in the EC2 console to see if the command was executed correctly.
Conclusion
Now that you’ve seen an example of how you can combine various AWS services to automate the backup of your files by relying only on AWS services, I hope you are inspired to create your own solutions.
Auto Scaling lifecycle hooks, Lambda, and EC2 Run Command are powerful tools because they allow you to respond to Auto Scaling events automatically, such as when an instance is terminated. However, you can also use the same idea for other solutions like exiting processes gracefully before an instance is terminated, deregistering your instance from service managers, and scaling stateful services by moving state to other running instances. The possible use cases are only limited by your imagination.
Learn more about:
I’ve open-sourced the code in this example in the awslabs GitHub repo; I can’t wait to see your feedback and your ideas about how to improve the solution.
Amazon ECS sessions at re:Invent
Come learn about containers—from the basics to production topics such as scaling and security—from customers and Amazon ECS subject matter experts at this year’s re:Invent conference. We’re excited to learn from you and hear what you think about our recently launched features. Containers are highlighted at Thursday’s Containers Mini Con at The Mirage:
- CON301 – Operations Management with Amazon ECS
- CON302 – Development Workflow with Docker and Amazon ECS
- CON303 – Introduction to Container Management on AWS
- CON308 – Service Integration Delivery and Automation Using Amazon ECS
- CON309 – Running Microservices on Amazon ECS
- CON310 – Running Batch Jobs on Amazon ECS
- CON311 – Operations Automation and Infrastructure Management with Amazon ECS
- CON312 – Deploying Scalable SAP Hybris Clusters using Docker
- CON313 – Netflix: Container Scheduling, Execution, and Integration with AWS
- CON316 – State of the Union: Containers
- CON401 – Amazon ECR Deep Dive on Image Optimization
- CON402 – Securing Container-Based Applications
There are also two hands-on workshops:
- CON314 – Workshop: Build a Recommendation Engine on Amazon ECS
- CON315 – Workshop: Deploy a Swift Web Application on Amazon ECS
There are other breakout sessions that talk about Amazon ECS; three that I’d like to highlight are:
- GAM401 – Riot Games: Standardizing Application Deployments Using Amazon ECS and Terraform
- NET203 – From EC2 to ECS: How Capital One uses Application Load Balancer Features to Serve Traffic at Scale
- DEV313 – Infrastructure Continuous Deployment Using AWS CloudFormation
You can also join us for an open Q&A session at the Dev Lounge, watch ECS demos at the Demo Pavilion, and ask us questions in the AWS Booth at re:Invent Central.
We look forward to meeting you at re:Invent 2016!
AWS Lambda sessions at re:Invent 2016
Vyom Nagrani, Manager of Product Management, AWS Lambda
AWS Lambda was announced on 13th Nov, 2014, which makes us 2 years old today! We have come a long way in these last 2 years, with many small and big customers using AWS Lambda in production, many new features, and a broader partner network.
Come talk to the AWS Lambda team at re:Invent 2016, where we plan to host a re:Source Mini Con at The Mirage focused on Serverless Computing. The AWS Lambda subject matter experts and existing AWS Lambda customers will be presenting on the following breakout sessions:
Breakout Sessions at Serveless Computing Mini Con
- SVR202 – What’s New with AWS Lambda
- SVR301 – Real-time Data Processing Using AWS Lambda
- SVR302 – Optimizing the Data Tier in Serverless Web Applications
- SVR303 – Coca-Cola: Running Serverless Applications with Enterprise Requirements
- SVR304 – bots + serverless = ❤
- SVR305 – ↑↑↓↓←→←→ BA Lambda Start
- SVR306 – Serverless Computing Patterns at Expedia
- SVR307 – Application Lifecycle Management in a Serverless World
- SVR308 – Content and Data Platforms at Vevo: Rebuilding and Scaling from Zero in One Year
- SVR311 – The State of Serverless Computing
- SVR401 – Using AWS Lambda to Build Control Systems for Your AWS Infrastructure
- SVR402 – Operating Your Production API
There will be many other breakout sessions in other tracks that talk about AWS Lambda, a few ones I’d like to highlight are:
Additional Breakout Sessions related to AWS Lambda
- ALX302 – Build a Serverless Back End for Your Alexa-Based Voice Interactions
- ARC202 – Accenture Cloud Platform Serverless Journey
- ARC312 – Compliance Architecture: How Capital One Automates the Guard Rails for 6,000 Developers
- BDM303 – JustGiving: Serverless Data Pipelines, Event-Driven ETL, and Stream Processing
- CMP211 – Getting Started with Serverless Architectures
- DAT309 – How Fulfillment by Amazon (FBA) and Scopely Improved Results and Reduced Costs with a Serverless Architecture
- DEV205 – Monitoring, Hold the Infrastructure: Getting the Most from AWS Lambda
- DEV301 – Amazon CloudWatch Logs and AWS Lambda: A Match Made in Heaven
- DEV308 – Chalice: A Serverless Microframework for Python
Feel you know enough about AWS Lambda, and would rather build something on top of it? We will also be hosting the following Workshops:
Workshops related to AWS Lambda
- SPL02 – Spotlight Lab: Serverless Architectures Using Amazon CloudWatch Events and Scheduled Events with AWS Lambda
- ALX203 – Workshop: Creating Voice Experiences with Alexa Skills: From Idea to Testing in Two Hours
- DCS205 – Workshop: Building Serverless Bots on AWS – Botathon
- SVR309 – Wild Rydes Takes Off – The Dawn of a New Unicorn
- SVR310 – All Your Chats are Belong to Bots: Building a Serverless Customer Service Bot
- MBL305 – Developing Mobile Apps and Serverless Microservices for Enterprises using AWS
- MBL306 – Serverless Authentication and Authorization: Identity Management for Serverless Architectures
Still have questions, join us for an open Q&A session at the Dev Lounge at the Venetian at 3:00 pm on Thursday 12/1. And of course, we will be at the AWS Booth at re:Invent Central at the Expo hall, come talk to us at the Compute table.
Looking forward to meeting you at re:Invent 2016!
Distributed Deep Learning Made Easy
This is a guest post from my colleagues Naveen Swamy and Joseph Spisak.
———————————
Machine learning is a field of computer science that enables computers to learn without being explicitly programmed. It focuses on algorithms that can learn from and make predictions on data.
Most recently, one branch of machine learning, called deep learning, has been deployed successfully in production with higher accuracy than traditional techniques, enabling capabilities such as speech recognition, image recognition, and video analytics. This higher accuracy comes, however, at the cost of significantly higher compute requirements for training these deep models.
One of the major reasons for this rebirth and rapid progress is the availability and democratization of cloud-scale computing. Training state-of-the-art deep neural networks can be time-consuming, with larger networks like ResidualNet taking several days to weeks to train, even on the latest GPU hardware. Because of this, a scale-out approach is required.
Accelerating training time has multiple benefits, including:
- Enabling faster iterative research, allowing scientists to push the state of the art faster in domains such as computer vision or speech recognition.
- Reducing the time-to-market for intelligent applications, allowing AI applications that consume trained, deep learning models to access newer models faster.
- Absorbing new data faster, helping to keep deep learning models current.
AWS CloudFormation, which creates and configures Amazon Web Services resources with a template, simplifies the process of setting up a distributed deep learning cluster. The CloudFormation Deep Learning template uses the Amazon Deep Learning AMI (supporting MXNet, TensorFlow, Caffe, Theano, Torch, and CNTK frameworks) to launch a cluster of Amazon EC2 instances and other AWS resources needed to perform distributed deep learning. CloudFormation creates all resources in the customer account.
EC2 Cluster Architecture

Resources created by the Deep Learning template
The Deep Learning template creates a stack that contains the following resources:
- A VPC in the customer account.
- The requested number of worker instances in an Auto Scaling group within the VPC. These worker instances are launched in a private subnet.
- A master instance in a separate Auto Scaling group that acts as a proxy to enable connectivity to the cluster via SSH. CloudFormation places this instance within the VPC and connects it to both the public and private subnets. This instance has both public IP addresses and DNS.
- A security group that allows external SSH access to the master instance.
- Two security groups that open ports on the private subnet for communication between the master and workers.
- An IAM role that allows users to access and query Auto Scaling groups and the private IP addresses of the EC2 instances.
- A NAT gateway used by the instances within the VPC to talk to the outside world.
The startup script enables SSH forwarding on all hosts. Enabling SSH is essential because frameworks such as MXNet makes use of SSH for communication between master and worker instances during distributed training. The startup script queries the private IP addresses of all the hosts in the stack, appends the IP address and worker alias to /etc/hosts, and writes the list of worker aliases to /opt/deeplearning/workers.
The startup script sets up the following environment variables:
-
$DEEPLEARNING_WORKERS_PATH: The file path that contains the list of workers
-
$DEEPLEARNING_WORKERS_COUNT: The total number of workers
-
$DEEPLEARNING_WORKER_GPU_COUNT: The number of GPUs on the instance
Launch a CloudFormation Stack
Note: To scale to the desired number of instances beyond the default limit, file a support request.
-
Download the Deep Learning template from the MXNet GitHub repo.
-
Open the CloudFormation console, and then choose Create New Stack.

-
Choose Choose File to upload the template, and then choose Next:

-
For Stack name, enter a descriptive stack name.
-
Choose a GPU InstanceType, such as a P2.16xlarge.
-
For KeyName, choose an EC2 key pair.
-
For SSHLocation, choose a valid CIDR IP address range to allow SSH access to the master instance and stack.
-
For Worker Count, type a value. The stack provisions the worker count + 1, with the additional instance acting as the master. The master also participates in the training/evaluation. Choose Next.

-
(Optional) Under Tags, type values for Key and Value. This allows you to assign metadata to your resources.
(Optional) Under Permissions, you can choose the IAM role that CloudFormation uses to create the stack. Choose Next.

-
Under Capabilities, select the checkbox to agree to allow CloudFormation to create an IAM role. An IAM role is required for correctly setting up a stack.

-
To create the CloudFormation stack, choose Create

-
To see the status of your stack, choose Events. If stack creation fails, for example, because of an access issue or an unsupported number of workers, troubleshoot the issue. For information about troubleshooting the creation of stacks, see Troubleshooting AWS CloudFormation. The event log records the reason for failure.
Log in to the master instance.
SSH agent forwarding securely connects the instances within the VPC that is connected to the private subnet. The idea is based on Securely Connect to Linux Instances Running in a Private Amazon VPC.
- Find the public DNS/IP of the master.
-
Open the Amazon EC2 console.
-
In the navigation pane, under Auto Scaling, choose Auto Scaling Groups.
-
On the Auto Scaling page, search for the group ID and select it.
-
On the Instances tab, find the instance ID of the master instance.
-
Choose the instance to find the public DNS/IP address used for login.
- Enable SSH agent forwarding.
- Run MXNet distributed training.
The CloudFormation stack output contains the Auto Scaling group in which the master instance is launched. Note the Auto Scaling group ID for MasterAutoScalingGroup.

This enables communication with all instances in the private subnet. Using the DNS/IP address from Step 1, modify the SSH configuration to include these lines:
Host IP/DNS-from-above
ForwardAgent yes
The following example shows how to run MNIST with data parallelism. Note the use of the DEEPLEARNING_* environment variables:
#terminate all running Python processes across workers
while read -u 10 host; do ssh $host "pkill -f python" ; done 10<$DEEPLEARNING_WORKERS_PATH
#navigate to the mnist image-classification example directory
cd ~/src/mxnet/example/image-classification
#run the MNIST distributed training example
../../tools/launch.py -n $DEEPLEARNING_WORKERS_COUNT -H $DEEPLEARNING_WORKERS_PATH python train_mnist.py --gpus $(seq -s , 0 1 $(($DEEPLEARNING_WORKER_GPU_COUNT - 1))) --network lenet --kv-store dist_sync
These steps are only a subset. For more information about running distributed training, see Run MXNet on Multiple Devices.
FAQ
1. How do I change the IP addresses that are allowed to SSH to the master instance?
The CloudFormation stack output contains the security group that controls the inbound IP addresses for SSH access to the master instance. Use this security group to change your inbound IP addresses.
2. When an instance is replaced, are the IP addresses of the instances updated?
No. You must update IP addresses manually.
3. Does the master instance participate in training/validation?
Yes. Because most deep learning tasks involve GPUs, the master instance acts both as a proxy and as a distributed training/validation instance.
4. Why are the instances in an Auto Scaling group?
Auto Scaling group maintains the number of desired instances by launching a new instance if an existing instance fails. There are two Auto Scaling groups: one for the master and one for the workers in the private subnet. Because only the master instance has a public endpoint to access the hosts in the stack, if the master instance becomes unavailable, you can terminate it and the associated Auto Scaling group automatically launches a new master instance with a new public endpoint.
5. When a new worker instance is added or an existing instance replaced, does CloudFormation update the IP addresses on the master instance?
No, this template does not have the capability to automatically update the IP address of the replacement instance.
Build Serverless Applications in AWS Mobile Hub with New Cloud Logic and User Sign-in Features
Last month, we showed you how to power a mobile back end using a serverless stack, with your business logic in AWS Lambda and the resulting cloud APIs exposed to your app through Amazon API Gateway. This pattern enables you to create and test mobile cloud APIs backed by business logic functions you develop, all without managing servers or paying for unused capacity. Further, you can share your business logic across your iOS and Android apps.
Today, AWS Mobile Hub is announcing a new Cloud Logic feature that makes it much easier for mobile app developers to implement this pattern, integrate their mobile apps with the resulting cloud APIs, and connect the business logic functions to a range of AWS services or on-premises enterprise resources. The feature automatically applies access control to the cloud APIs in API Gateway, making it easy to limit access to app users who have authenticated with any of the user sign-in options in Mobile Hub, including two new options that are also launching today:
- Fully managed email- and password-based app sign-in
- SAML-based app sign-in
In this post, we show how you can build a secure mobile back end in just a few minutes using a serverless stack.
Get started with AWS Mobile Hub
We launched Mobile Hub last year to simplify the process of building, testing, and monitoring mobile applications that use one or more AWS services. Use the integrated Mobile Hub console to choose the features you want to include in your app.
With Mobile Hub, you don’t have to be an AWS expert to begin using its powerful back-end features in your app. Mobile Hub then provisions and configures the necessary AWS services on your behalf and creates a working quickstart app for you. This includes IAM access control policies created to save you the effort of provisioning security policies for resources such as Amazon DynamoDB tables and associating those resources with Amazon Cognito.
Get started with Mobile Hub by navigating to it in the AWS console and choosing your features.

New user sign-in options
We are happy to announce that we now support two new user sign-in options that help you authenticate your app users and provide secure access to control to AWS resources.
The Email and Password option lets you easily provision a fully managed user directory for your app in Amazon Cognito, with sign-in parameters that you configure. The SAML Federation option enables you to authenticate app users using existing credentials in your SAML-enabled identity provider, such as Active Directory Federation Service (ADFS). Mobile Hub also provides ready-to-use app flows for sign-up, sign-in, and password recovery codes that you can add to your own app.
Navigate to the User Sign-in tile in Mobile Hub to get started and choose your sign-in providers.

Read more about the user sign-in feature in this blog and in the Mobile Hub documentation.
Enhanced Cloud Logic
We have enhanced the Cloud Logic feature (the right-hand tile in the top row of the above Mobile Hub screenshot), and you can now easily spin up a serverless stack. This enables you to create and test mobile cloud APIs connected to business logic functions that you develop. Previously, you could use Mobile Hub to integrate existing Lambda functions with your mobile app. With the enhanced Cloud Logic feature, you can now easily create Lambda functions, as well as API Gateway endpoints that you invoke from your mobile apps.
The feature automatically applies access control to the resulting REST APIs in API Gateway, making it easy to limit access to users who have authenticated with any of the user sign-in capabilities in Mobile Hub. Mobile Hub also allows you to test your APIs within your project and set up the permissions that your Lambda function needs for connecting to software resources behind a VPC (e.g., business applications or databases), within AWS or on-premises. Finally, you can integrate your mobile app with your cloud APIs using either the quickstart app (as an example) or the mobile app SDK; both are custom-generated to match your APIs. Here’s how it comes together:

Create an API
After you have chosen a sign-in provider, choose Configure more features. Navigate to Cloud Logic in your project and choose Create a new API. You can choose to limit access to your Cloud Logic API to only signed-in app users:

Under the covers, this creates an IAM role for the API that limits access to authenticated, or signed-in, users.

Quickstart app
The resulting quickstart app generated by Mobile Hub allows you to test your APIs and learn how to develop a mobile UX that invokes your APIs:

Multi-stage rollouts
To make it easy to deploy and test your Lambda function quickly, Mobile Hub provisions both your API and the Lambda function in a Development stage, for instance, https://<yoururl>/Development. This is mapped to a Lambda alias of the same name, Development. Lambda functions are versioned, and this alias is always points to the latest version of the Lambda function. This way, changes you make to your Lambda function are immediately reflected when you invoke the corresponding API in API Gateway.
When you are ready to deploy to production, you can create more stages in API Gateway, such as Production. This gives you an endpoint such as https://<yoururl>/Production. Then, create an alias of the same name in Lambda but point this alias to a specific version of your Lambda function (instead of $LATEST). This way, your Production endpoint always points to a known version of your Lambda function.
Summary
In this post, we demonstrated how to use Mobile Hub to create a secure serverless back end for your mobile app in minutes using three new features – enhanced Cloud Logic, email and password-based app sign-in, and SAML-based app sign-in. While it was just a few steps for the developer, Mobile Hub performed several underlying steps automatically–provisioning back-end resources, generating a sample app, and configuring IAM roles and sign-in providers–so you can focus your time on the unique value in your app. Get started today with AWS Mobile Hub.
Real World AWS Scalability
This is a guest post from Linda Hedges, Principal SA, High Performance Computing.
—–
One question we often hear is, “How well will my application scale on AWS?” For high performance computing (HPC) workloads that cross multiple nodes, the cluster network is at the heart of scalability concerns.
AWS uses advanced Ethernet networking technology, which, like all things AWS, is designed for scale, security, high availability, and low cost. This network is exceptional and continues to benefit from Amazon’s rapid pace of development. Again and again, customers find that the most demanding applications run very well on AWS!
Many have speculated that highly coupled workloads require a name-brand network fabric to achieve good performance. For most applications, this is simply not the case. As with all clusters, the devil is in the details and some applications benefit from cluster tuning.
This post discusses the scalability of a representative, real-world application and provides a few performance tips for achieving excellent application performance using STAR-CCM+ as an example. For more HPC-specific information, see High Performance Computing.
Computational fluid dynamics at TLG Aerospace
TLG Aerospace, a Seattle-based aerospace engineering services company, runs most of their STAR-CCM+ computational fluid dynamics (CFD) cases on AWS. For a detailed case study describing TLG Aerospace’s experience and the results they achieved, see TLG Aerospace.
This post uses one of their CFD cases as an example to understand AWS scalability. By leveraging Amazon EC2 Spot Instances, which allow customers to purchase unused capacity at significantly reduced rates, TLG Aerospace consistently achieves an 80% cost savings compared to their previous cloud and on-premises HPC cluster options. TLG Aerospace experiences solid value, terrific scale-up, and nearly limitless case throughput—all with no queue wait!
Scale-up
HPC applications such as CFD depend heavily on the application’s ability to scale compute tasks efficiently in parallel across multiple compute resources. Parallel performance is often evaluated by determining an application’s scale-up. Scale-up is a function of the number of processors used and is defined as the time it takes to complete a run on one processor, divided by the time it takes to complete the same run on the number of processors used for the parallel run.

As an example, consider an application with a time to completion, or turn-around time of 32 hours when run on one processor. If the same application runs in one hour when run on 32 processors, then the scale-up is 32 hours of time on 1 processor / 1 hour time on 32 processors, or equal to 32 for 32 processes. Scaling is considered to be excellent when the scale-up is close to or equal to the number of processors on which the application is run.
If the same application took 8 hours to complete on 32 processors, it would have a scale-up of only 4: 32 (time on one processor) / 8 (time to complete on 32 processors). A scale-up of 4 on 32 processors is considered to be poor.
Strong scaling vs. weak scaling
In addition to characterizing the scale-up of an application, scalability can be further characterized as “strong” or “weak”. Note that the term “weak”, as used here, does not mean inadequate or bad but is a technical term facilitating the description of the type of scaling that is sought.
Strong scaling offers a traditional view of application scaling, where a problem size is fixed and spread over an increasing number of processors. As more processors are added to the calculation, good strong scaling means that the time to complete the calculation decreases proportionally with increasing processor count.
In comparison, weak scaling does not fix the problem size used in the evaluation, but purposely increases the problem size as the number of processors also increases. The ratio of the problem size to the number of processors on which the case is run is held constant. For a CFD calculation, problem size most often refers to the size of the grid or mesh for a similar configuration.
An application demonstrates good weak scaling when the time to complete the calculation remains constant as the ratio of compute effort to the number of processors is held constant. Weak scaling offers insight into how an application behaves with varying case size.
Scale-up as a function of increasing processor count is shown in Figure 1 for the STAR-CCM+ case data provided by TLG Aerospace. This is a demonstration of “strong” scalability. The blue line shows what ideal or perfect scalability looks like. The purple triangles show the actual scale-up for the case as a function of increasing processor count. Excellent scaling is seen to well over 400 processors for this modest-sized 16M cell case, as evidenced by the closeness of these two curves. This example was run on Amazon EC2 c3.8xlarge instances, each an Intel E5-2680, providing either 16 cores or 32 Hyper-Threading processors using Intel Hyper-Threading Technology (HTT).
Figure 1: Strong Scaling Demonstrated for a 16M Cell STARCCM+ CFD Calculation
Threads vs. cores
AWS customers can choose to run their applications on either threads or cores. For an application like STAR-CCM+, excellent linear scaling can be seen when using either threads or cores, though we always recommend testing specific cases and applications.
For this example, threads were chosen as the processing basis. Running on threads offered a few percentage points in performance improvement when compared to running the same case on cores. Note that the number of available cores is equal to half of the number of available threads.
Processor counts
The scalability of real-world problems is directly related to the ratio of the compute effort per-core to the time required to exchange data across the network. The number of grid cells or mesh size of a CFD case provides a strong indication of how much computational effort is required for a solution. Thus, larger cases scale to even greater processor counts than for the modest sized case discussed here.
STAR-CCM+ has been shown to demonstrate exceptional “weak” scaling on AWS. That’s not shown here, though weak scaling is reflected in Figure 2 by plotting the cells per processor on the horizontal axis. The purple line in Figure 2 shows scale-up as a function of grid cells per processor. The vertical axis for scale-up is on the left-hand side of the graph as indicated by the purple arrow. The green line in Figure 2 shows efficiency as a function of grid cells per processor. The vertical axis for efficiency is shown on the right side of the graph and is indicated with a green arrow. Efficiency is defined as the scale-up divided by the number of processors used in the calculation.
Figure 2: Scale-up and Efficiency as a Function of Cells per Processor
Weak scaling is evidenced by considering the number of grid cells per processor as a measure of compute effort. Holding the grid cells per processor constant while increasing total case size demonstrates weak scaling. Weak scaling is not shown here, because only one CFD case is used.
Efficiency
Fewer grid cells per processor means reduced computational effort per processor. Maintaining efficiency while reducing cells per processor demonstrates the excellent strong scalability of STAR-CCM+ on AWS.
Efficiency remains at about 100% between approximately 250,000 grid cells per thread (or processor) and 100,000 grid cells per thread. Efficiency starts to fall off at about 100,000 grid cells per thread. An efficiency of at least 80% is maintained until 25,000 grid cells per thread. Decreasing grid cells per processor leads to decreased efficiency because the total computational effort per processor is reduced. Note that the perceived ability to achieve more than 100% efficiency (here, at about 150,000 cells per thread) is common in scaling studies, is case-specific, and often related to smaller effects such as timing variation and memory caching.
Turn-around time and cost
Plots of scale-up and efficiency offer an understanding about how a case or application scales. The bottom line, though, is that what really matters to most HPC users is case turn-around time and cost. A plot of turn-around time versus CPU cost for this case is shown in Figure 3. As the number of threads are increased, the total turn-around time decreases. But as the number of threads increases, the inefficiency also increases, which leads to increased costs. The cost shown is based on a typical Spot price for the c3.8xlarge and only includes the computational costs. Small costs are also incurred for data storage. Note that the Spot market price varies from day to day.
Figure 3: Cost for per Run Based on Spot Pricing ($0.35 per hour for c3.8xlarge) as a function of Turn-around Time
Minimum cost and turn-around time were achieved with approximately 100,000 cells per thread. Many users choose a cell count per thread to achieve the lowest possible cost. Others may choose a cell count per thread to achieve the fastest turn-around time.
If a run is desired in 1/3rd the time of the lowest price point, it can be achieved with approximately 25,000 cells per thread. (Note that many users run STAR-CCM+ with significantly fewer cells per thread than this.) While this increases the compute cost, other concerns—such as license costs or schedules—can be overriding factors. For this 16M cell case, the added inefficiency results in an increase in run price from $3 to $4 for computing. Many find the reduced turn-around time well worth the price of the additional instances.
Cluster tuning tips
As with any cluster, good performance requires attention to the details of the cluster setup. While AWS allows for the quick set up and take down of clusters, performance is affected by many of the specifics in that setup. This post provides some examples.
Placement groups
On AWS, a placement group is a grouping of instances within a single Availability Zone that allow for low latency between the instances. Placement groups are recommended for all applications where low latency is a requirement. A placement group was used to achieve the best performance from STAR-CCM+. For more information, see Placement Groups in the Amazon EC2 User Guide for Linux Instances.
Amazon Linux OS
Amazon Linux is a version of Linux maintained by Amazon. The distribution is designed to provide a stable, secure, and highly performant environment. Amazon Linux is optimized to run on AWS and offers excellent performance for running HPC applications. For the case presented here, the operating system used was Amazon Linux. Other Linux distributions are also performant. However, we strongly recommend that for Linux HPC applications, you use a minimum of the version 3.10 Linux kernel, to be sure of using the latest Xen libraries. For more information, see Amazon Linux AMI.
Amazon EBS storage
Amazon Elastic Block Store (Amazon EBS) is a persistent, block-level storage device often used for cluster storage on AWS. EBS provides reliable block-level storage volumes that can be attached (and removed) from an Amazon EC2 instance. A standard EBS General Purpose SSD (gp2) volume is all that is required to meet the needs of STAR-CCM+, and was used for this post. Other HPC applications may require faster I/O to prevent data writes from being a bottleneck to turn-around speed but also, many HPC applications only require the less expensive throughput optimized EBS volumes. For these applications, other storage options exist. For more information, see Storage.
Intel Hyper-Threading Technology (HTT)
As mentioned previously, STAR-CCM+, like many other CFD solvers, runs well on both threads and cores. HTT can improve the performance of some MPI applications depending on the application, case, and size of the workload allocated to each thread; it may also slow performance. The one-size-fits-all nature of the static cluster compute environments means that most HPC clusters disable HTT.
Generally, computationally intensive workloads run best on cores while those that are I/O bound run best on threads. Again, a few percentage points increase in performance was discovered for this case, by running with threads. If there is no time to evaluate the effect of HTT on case performance, then we recommend that HTT be disabled. When disabled, it is important to bind the core to designated CPU, also known as processor or CPU affinity. It almost universally improves performance over unpinned cores for computationally intensive workloads.
Time Stamp Counter
Occasionally, an application includes frequent time measurement in the code; perhaps this is done for performance tuning. Under these circumstances, performance can be improved by setting the clock source to the TSC (Time Stamp Counter). This tuning was not required for this application but is mentioned here for completeness.
Summary
When you evaluate an application, we recommend using a meaningful, real world use case. A case that is too large or small won’t reflect the performance and scalability achievable in everyday operation. The only way you’ll know positively how an application will perform on AWS is to try it!
AWS offers solid strong scaling and exceptional weak scaling. Excellent performance can be achieved on AWS for most applications. In addition to low cost and quick turn-around time, important considerations for HPC also include throughput and availability. AWS offers nearly limitless throughput, security, cost-savings, and high-availability making queues a “thing of the past”. A long queue wait makes for a long case turn-around time, regardless of the scale.
If you have questions or suggestions, please comment below.

