AWS Database Blog

Migrating Microsoft SQL Server Enterprise Workloads to Amazon RDS: Part 1

by Prudhvi Janga | on 07 NOV 2016 | in Migration, RDS SQL Server | Permalink | Comments

Prudhvi Janga is a software development engineer in Amazon Web Services

SQL Server Native Backups and Restores Using RDS and Amazon S3

If you’re thinking about moving your on-premises Microsoft SQL Server databases to Amazon RDS, one option to migrate your databases is to use native SQL Server backups. Backups can be uploaded to AWS using Amazon Simple Storage Service (Amazon S3). These backups can also be encrypted using the AWS Key Management Service (AWS KMS). Once the backup has been uploaded to Amazon S3, you can then restore your databases onto an RDS SQL Server instance. We recommend this option for bulk migration if having only a minimum downtime is not a critical requirement for you. You can also use this feature to:

Move databases between Amazon RDS SQL Server DB instances.
Back up and restore individual databases, as opposed to entire RDS instances.
Store and transfer backups in and out of RDS while preserving complete compatibility with native SQL Server installations.
Move database backups to RDS to use as a disaster recovery solution for on-site SQL Server installations.

Setting Up
RDS offers option groups to enable certain features on an RDS instance. You can associate an option group with multiple options. Each option typically represents one feature. Depending on the feature, the option might have additional settings to be configured.

To enable native backups on your RDS instances, you create an option group, add the SQLSERVER_BACKUP_RESTORE option, and assign it to an RDS instance. The steps following show how.

Choose Option Groups on the navigation pane in the RDS dashboard in the AWS Management Console.
Choose Create Group to create a new option group.

Note: You can also select one of your existing groups and add the option. However, the default option groups cannot be modified. The default groups start with the word default and can be easily identified. So, unless you already have an option group that you have manually created in the past, you will have to create a new one.
Type the option group name, description, and other details, and choose Create.

Note: Option groups are specific to particular RDS engines, engine editions, and major engine versions. If you have multiple RDS instances running on different editions and versions of SQL Server, you’ll have to create an option group for each edition and version combination.
Select the option group that you created in the previous step and choose Add Option.
In the Add Option section, fill in the details.
Typically in an on-premises setting, DBAs back up databases to disk and move the backup to external storage by using a file transfer mechanism. However, backups made by using this feature in RDS are directly uploaded to S3. So, RDS needs permissions to access your S3 bucket to upload and download database backups.To provide this access to RDS, you can do one of two things. You can provide your own AWS Identity and Access Management (IAM) role—this approach comes in handy when you already have an IAM role that contains a bucket where you want RDS to place backups or retrieve backups from. Or you can let RDS set up an IAM role for you in the background by providing the necessary information. If you let RDS set up the IAM role:
- You can specify an existing S3 bucket or create a new one.
- An S3 prefix is optional. We recommend it only in cases where you want to restrict access to a shared bucket to the files in folders with that specified prefix.
- You can enable encryption, to encrypt and decrypt the backups using KMS. If you enable encryption, you’ll need to specify a KMS key to encrypt or decrypt backups. We will cover this process in a separate blog post.
Once the option group creation succeeds, you can see the IAM role that was created in the option groups’ screen. You can also edit it by going to the IAM service dashboard.
You can assign the new option group to an existing or a new RDS instance.To assign the option group to an existing RDS instance, navigate to the RDS dashboard, choose Instances, and then select the instance that you want to associate the option group with. For Instance Actions, choose Modify. You can modify the option group associated with the instance under Database Options:Note: After you have associated the instance with the option group, it will take anywhere from 10 to 15 minutes for the modifications to complete. When the instance returns to available status, you can start using the feature related to that option group with your RDS instance.

Troubleshooting Errors during Setup
For possible errors during setup, some recommendations follow.

Error

This error occurs when RDS can’t generate temporary credentials to access an S3 bucket using the IAM role specified. In this case, it’s likely that the IAM role provided doesn’t give RDS access to the S3 bucket. To fix this issue, you can modify the role and provide RDS access (rds.amazonaws.com) under the trust relationship tab of the IAM role.

IAMRole

This error occurs when the AWS console is unable to retrieve existing IAM roles, which happens if IAM service is unavailable temporarily. To address this issue, try again later.

Backing Up or Restoring a Database
To perform backups and restores, RDS provides four stored procedures in the MSDB system database. The master user can run or grant access to others to run these stored procedures. Following are the stored procedures and their details.

Backup Task Stored Procedure

Backup
Restore Task Stored Procedure

Restore
Task Status Stored Procedure

TaskStatus
Task Cancellation Stored Procedure

TaskCancel
After you issue a backup or restore task by using one of these stored procedures, RDS outputs task information as shown following. You can also retrieve this information later by running the rds_task_status stored procedure.

TaskProc TaskProc1

The following table gives some details about these columns.

TaskImg

Troubleshooting Backup and Restore Errors
When you run the stored procedures, you might encounter some validation errors. These errors are displayed immediately, and a task will not be created if you encounter these errors. You can find common validation errors and suggested fixes following—here, assume your database is named AdventureWorks.

Error	Suggestion
Cannot find database AdventureWorks.	This error shows up if you try to back up a database that doesn’t exist on the RDS instance.
Database AdventureWorks is not online.	If the database you are trying to back up is not online, you’ll see this error. Bring the database online and try again.
Parameter @source_db_name cannot be empty, please provide database name to backup.	If the source database name supplied to backup is empty (only contains spaces), you will see this error. Correct the name and try again.
Database backups can only be performed by members of db_owner or db_backupoperator roles in source database.	If you are not the database owner or don’t have a backup operator role for the source database you’re trying to back up, you will see this error. Get yourself access as db_owner or db_backupoperator in the database and try again.
Database backup/restore option is not enabled or is in the process of being enabled. Please try again later.	If the initial option group setup encountered an error or is incomplete, you will see this error. To correct the issue and try again, check instructions in Setting Up earlier in this blog post.
S3 object ARN cannot be empty.	Because the storage used by RDS for backups is S3, you will need to specify a valid S3 Amazon Resource Name (ARN). Find more about resource ARNs in Amazon Resource Names (ARNs) and AWS Service Namespaces in the AWS General Reference. The S3 ARN should include the bucket name, an optional prefix, and either the name of the backup file you want created or the name of a file to restore from: `arn:aws:s3:region:account-id: bucket_name/optional_prefix/file_name.extension`
Parameter @overwrite_S3_backup_file must be either 0 or 1.	If you have an existing backup file in S3 and you don’t specify the @overwrite_S3_backup_file parameter, you will get an error. This parameter is used to specify whether you want an existing backup file to be overridden with the latest backup or not. Here, 0 indicates that you don’t want to overwrite the file present in S3 if one already exists, and 1 indicates otherwise.
Cannot backup system databases or rdsadmin.	Only customer databases can be backed up. Correct the source database name and try again.
A task has already been issued for database: AdventureWorks with task Id: 1, please try again later.	If a backup task is already in progress on the source database specified, you will see this error. You can try again after the current task is complete.
Database restores can only be performed by database logins with [CREATE ANY DATABASE] permissions.	Only logins with CREATE ANY DATABASE privileges on the RDS instance can perform restores. Correct the permissions and try again.
No database name specified to restore.	To restore a backup, you have to specify a non-empty database name. Correct the target database name and try again.
Database name cannot start or end with a space.	Remove the starting or trailing spaces from the database name and try again.
Database AdventureWorks already exists. Cannot restore database with the same name.	You cannot restore a backup to an existing database or restore differential backups or transaction logs using this feature.
Cannot find a CREATED or an IN_PROGRESS task: 1 to cancel.	Only tasks that are in CREATED or IN_PROGRESS lifecycles can be canceled. The same error will also show up if the task ID you are looking for doesn’t exist.
Could not find the specified task. Execute without any parameters to show all tasks.	This error appears when you run rds_task_status with an invalid database name or task ID. You can run the stored procedure without any parameters if you don’t have the task ID handy.
Error executing procedure. Please reconnect to the RDS endpoint and retry.	You’ll usually see this error on mirrored instances if you are connected to a mirror after a failover. Reconnect to the endpoint and try again.

Sometimes, the backup or restore task is created successfully but you encounter errors afterward and the task errors out. You can find the related error information in the task_info column that is displayed by running the rds_task_status stored procedure. Following, you can find some common errors and how you can correct them.

Error	Suggestion
Access Denied.	This error comes up when the IAM role setup in step 5 of Setting Up doesn’t provide sufficient access to RDS to access the bucket list or the bucket itself. Make sure that the IAM role is still associated with the option group and RDS instance and has sufficient privileges.
Task was initiated on server name: EC2AMAZ-ABCDEFG, current server name: EC2AMAZ-MNOPQRS. Cannot complete the task, please try again.	Usually you’ll see this error when a task is in progress on a mirrored instance and a failover happens. The new principal or primary does not know about the task in progress. This error can also appear when the task is in progress and the underlying EC2 machine has been replaced with a different one. In either case, reissue the task to solve the error.
Could not parse the amazon resource name string provided: `arn:aws:s3:::bucket_name`	This error means that the S3 ARN you have provided isn’t in the correct format. Correct the ARN and reissue the task. Find more about resource ARN format in Amazon Resource Names (ARNs) and AWS Service Namespaces in the AWS General Reference.
Unable to retrieve temporary credentials for: SQLSERVER_BACKUP_RESTORE	This error means that RDS couldn’t generate temporary credentials to access S3. Make sure the IAM role involved is still associated with the option group and the RDS instance and has sufficient privileges.
The S3 bucket location: US-EAST-1 does not match the RDS instance region: US-WEST-2. Please specify a bucket that is in the same region as RDS instance.	Currently, RDS only supports backups and restores to and from S3 in the same AWS Region as the RDS instance. Cross-region backup and restore is not available. However, you can use S3 cross-region replication to copy the backup from a bucket in one region to a bucket in another region. Find more on cross-region replication Cross-Region Replication in the Amazon S3 Developer Guide.
RESTORE_DB task is unable to locate the backup file in S3. Try again and specify an existing backup file.	A restore database task requires that a backup file exists in S3. If the task cannot find the backup file, this error is thrown. Correct the issue and try again.
BACKUP_DB task found the backup file already present in S3. Rename the existing file, or use the overwrite parameter to overwrite the existing file.	If a backup file already exists on S3 but the overwrite option was not specified for rds_backup_database, you will see this error. To overwrite the file on S3, set @overwrite_S3_backup_file to 1.
The KMS master key region: US-EAST-1 does not match the RDS instance region: US-WEST-2. Please specify a key that is in the same region as RDS instance.	If you see this error, the KMS key supplied to encrypt or decrypt the backup file is not in the same region as the RDS instance. Supply a KMS key that is present in the same region.
KMS master key is not enabled. Please correct the issue and try again.	If you see this error, the specified KMS key is not enabled to encrypt or decrypt the backup file. Enable the KMS key or use a different KMS key.
Please provide KMS ARN with a region name.	A KMS key requires a region name to be specified. Correct the key’s ARN and reissue the task. Find more about resource ARNs in Amazon Resource Names (ARNs) and AWS Service Namespaces in the AWS General Reference.
No encrypted envelope key present. Unable to generate S3 object metadata.	If the backup file does not contain the encryption metadata (the envelope key) required for decryption, you will get this error. Make sure you are using an encrypted backup to restore from.
No initialization vector present. Unable to generate S3 object metadata.	If the backup file does not contain the encryption metadata (the initialization vector) required for decryption, you will get this error. Make sure you are using an encrypted backup to restore from.
Object metadata (x-amz-key) incomplete, cannot decrypt the given S3 object.	If the backup file does not contain the encryption metadata (the envelope key) required for decryption, you will get this error. Make sure you are using an encrypted backup to restore from.
Object metadata (x-amz-iv) incomplete, cannot decrypt the given S3 object.	If the backup file does not contain the encryption metadata (the initialization vector) required for decryption, you will get this error. Make sure you are using an encrypted backup to restore from.
Cannot generate S3 chunks that are greater than 524288000 bytes to upload to S3.	If the task errors out with this message, the size of the backup for the specified database is greater than the currently supported maximum size of 1 TB.
There is not enough space on the disk to perform restore database operation.	Before a restore starts, RDS makes an estimate about how much disk space will be required by the database that is going to be restored. If that estimate exceeds the amount of free space left on the disk, you will see this error. You can restore the database to a different RDS DB instance that has enough disk space.
Database AdventureWorks cannot be restored because there is already an existing database with the same file_guids on the instance. Database AdventureWorks cannot be restored because there is already an existing database with the same family_guid on the instance.	RDS does not allow restoring a copy of the same database onto the same DB instance. A copy of the same database is defined as one for which the family_guid or file_guids values are the same. For example, if there are two full database backups in S3 for AdventureWorks called AdventureWorks_2015 and AdventureWorks_2016, you can only restore one of them to one RDS DB instance. If you run into this problem, you can fix it two ways: Restore the copies to two different RDS instances. On premises, make a new database and copy the schema and data from one of the database copies. For example, in SQL Server Management Studio (SSMS) you can use the import/export wizard. Then create a backup of the new database before you restore it to the same RDS instance.
THE IN-PROGRESS TASK COULD NOT BE RETRIEVED. RE-ISSUING THE TASK.	Sometimes, if the backup or restore task encounters an error it can’t recover from, the task is reissued on customer’s behalf by RDS automagically.
Aborted the task because of a task failure or an overlap with your preferred backup window for RDS automated backup.	This generic error message appears whenever a backup task encounters any of the error messages preceding. If this is the only error message in task_info, the task has been aborted because of a conflict with RDS automated backups. To avoid this conflict, you can temporarily turn off automated backups and either reissue your backup or issue your backup outside the backup window.
Unable to retrieve family guid for database name: AdventureWorks	You might see this error when the database you have restored contains a large log file and SQL Server is performing crash recovery. Shrink the log file on premises before creating a backup, uploading it to S3, and issuing the restore task.
Unable to retrieve reserved database size.	If RDS can’t retrieve the database size from the database on the RDS DB instance to estimate the backup size, you will see this error. This error can happen if the database is not online. Bring the database online and try again.
Empty restore file list result retrieved.	If a restore task cannot retrieve the file list information from the database backup, you will see this message. Make sure the backup file in S3 isn’t corrupt.
The specified bucket does not exist	This error means that the bucket name in the S3 ARN does not exist. Make sure the bucket exists and is associated with the IAM role, and try again.
Cannot find server certificate with thumbprint ‘0xE82F1BCC5F20DMNO24334331’.	Backups encrypted with Transparent Data Encryption (TDE) aren’t supported by backup and restore. As an alternative to TDE, you can encrypt the backup using a KMS key on-premises and restore the encrypted backup onto an RDS DB instance. Find more info about encrypting a file using KMS in Protecting Data Using Client-Side Encryption in the Amazon S3 Developer Guide. Another option is to just use server-side encryption. However, client-side encryption is the preferred approach.
The database was backed up on a server running version 13.00.1601. That version is incompatible with this server, which is running version 12.00.4422.	RDS does not support SQL Server 2016 yet, so if you try to restore a backup from SQL Server 2016 on-premises it fails to restore on RDS.

In addition to these errors, you might receive errors from SQL Server if there are problems performing a backup or restore. For example, you might receive errors when restoring from a backup file that is not backward compatible, restoring a database greater than 10 GB on a SQL Server express instance, or restoring when the backup is not correctly formed, to name a few.

Stay tuned for our next blog post on migrating enterprise workloads using AWS Data Migration Service (AWS DMS).

Implement an Early Warning System for Transaction ID Wraparound in Amazon RDS for PostgreSQL

by Shawn McCoy | on 31 OCT 2016 | in RDS PostgreSQL | Permalink | Comments

Shawn McCoy is a database engineer in Amazon Web Services.

As a PostgreSQL database owner, you know that critical maintenance operations are handled by the autovacuum process. The effectiveness of this background process is constantly being improved with each release. For most PostgreSQL users, the default settings will work just fine. However, some workloads and usage patterns require customized parameters. I’ll cover two areas in this post: First, what you can do to monitor your Amazon RDS for PostgreSQL database’s transaction ID health, and second, common things I’ve helped customers work through once a problem is identified.

What is “transaction ID” health?
A PostgreSQL database can have two billion “in-flight” unvacuumed transactions before PostgreSQL takes dramatic action to avoid data loss. If the number of unvacuumed transactions reaches (2^31 - 10,000,000), the log starts warning that vacuuming is needed. If the number of unvacuumed transactions reaches (2^31 - 1,000,000), PostgreSQL sets the database to read-only mode and requires an offline, single-user, standalone vacuum. This vacuum requires multiple hours or days of downtime (depending on database size). A very detailed explanation of transaction ID wraparound is found in the PostgreSQL documentation.

Early warning is the key
We now know that around 2.1 billion unvacuumed transactions is “where the world ends” to quote the source code comment. There are several reasons a transaction can become this old, but let’s first look at how you can be alerted of this pending problem. Although it can be monitored with manual queries, we have introduced an Amazon CloudWatch metric, MaximumUsedTransactionIDs, to make setting this kind of alert easier for you. To generate this metric, the Amazon RDS agent runs this query:

SELECT max(age(datfrozenxid)) FROM pg_database;

From my experience working with customers, I recommend an alarm when this metric reaches a value of 1 billion. For some situations, a low severity warning at 500 million might also be useful. Let’s take a look at configuring a CloudWatch alarm, as described in the CloudWatch documentation.

First, choose CloudWatch in AWS Management Console:

AWSConsole

Then choose Alarms on the navigation pane:

Alarms

Choose Create Alarm to create a new alarm:

CreateAlarm

Select the MaximumUsedTransactionIDs metric and press Return to bring up a list of your instances. For this example, I’m going to just select one of them. If you have several instances, you can group several together and get a single alarm.

CreateAlarmI

After selecting your target instance, choose Next to set up the threshold:

Then set up the threshold (1000000000) and notification email:

DefineAlarm

In my example, I’ve already exceeded the threshold, so I get an email with this wording:

AlarmDetail

This message will also show up in the CloudWatch dashboard:

Metric

Setting this threshold to 1 billion should give you plenty of time to investigate the problem. It really depends on the problem’s cause how much time it will take to fix. The default autovacuum_freeze_max_age value is 200 million. If the age of the oldest transaction is 1 billion, autovacuum is having a problem keeping this threshold at the target of 200 million. It’s a luxury to have extra time to fix these issues. There is nothing more frustrating than trying to fix this type of problem while keeping the system under normal workloads.

The monitor has sent an alarm—now what?
Let’s say that the monitor has sent the alarm email and you know that there is a problem. I’ll walk through the steps that I take in identifying solutions to the most common problems I’ve helped customers resolve. One thing I like to do after I have identified the problem is to execute a verbose vacuum command and see how long the action takes. This approach can help me get out of immediate trouble and plan my long-term fix.

The three things to identify are:

Where autovacuum is currently working.
Which database is aging.
Which tables are aging.

With this information, you can start to diagnose what is happening.

What is autovacuum doing?
To find out what autovacuum is doing, you can take the following action. This action requires that you are running RDS Postgres 9.3.12 or later, 9.4.7 or later, or 9.5.2 or later to have full visibility into the running rdsadmin processes. This query will show the information we want:

SELECT datname, usename, pid, waiting, current_timestamp - xact_start
AS xact_runtime, query
FROM pg_stat_activity WHERE upper(query) like '%VACUUM%' ORDER BY
xact_start;

Which databases are aging?
To find out which databases are aging, the following query will tell us where to start work:

SELECT datname, age(datfrozenxid) FROM pg_database ORDER BY 2 DESC LIMIT 20;

Which tables are aging?
To find out which tables are aging, connect to the first database indicated in the result from the preceding query. This database will have the table containing the oldest transaction, and this query will show the top 20 oldest tables and their size. Most likely, these tables are the ones you can observe autovacuum working on in the first query.

SELECT c.oid::regclass as table_name,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age,
pg_size_pretty(pg_table_size(c.oid)) as table_size
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind = 'r'
ORDER BY 2 DESC LIMIT 20;

Let’s look at two different scenarios based on the output of these queries.

The case of long-running autovacuum sessions
In our first example, there are sessions that have been running for a long time. This situation usually indicates maintenance_work_mem is set too small for the size of the table.

Query

These sessions will most likely never complete. These are two common causes of this scenario:

The top problem I have seen is that maintenance_work_mem is too small for the size of the larger tables. This critical parameter determines how much tuple information the autovacuum worker can hold in memory while it’s processing tables. If this parameter is too small for a table, it will require multiple passes on the table to complete. For more information, see RDS recommendations for working with the maintenance_work_mem.
If you have adjusted your maintenance_work_mem parameter and the workers are still not keeping up, you might need to adjust the individual parameter setting for the larger tables. The benefit of this approach is that you can allow autovacuum to remain less aggressive on your entire system and just target larger tables. With the default parameters, autovacuum will take a nap of autovacuum_vacuum_cost_delay (20 milliseconds) every time the internal counter hits auto_vacuum_cost_limit. For a large table, it can be very beneficial to have autovacuum take no “breaks” at the expense of additional resource consumption. To make sure that’s the case, use this setting:
```
ALTER TABLE mytable SET (autovacuum_vacuum_cost_delay=0);
```
For more information, see RDS recommendations for working with the other autovacuum parameters and how you set them.

The case where autovacuum sessions seem to complete, but overall fall behind
The scenario where autovacuum sessions seem to complete but really fall behind can be a little trickier to fix. By default, autovacuum_max_workers is set to 3. This setting is usually fine. However, if you have an extremely high number of tables (30,000 or more), three workers can be overwhelmed. Recall the query above to check what autovacuum is doing. The results here seem to indicate things are okay:

DataName

This result actually looks normal. The autovacuum sessions haven’t been running long, and all three are working on tables. But we received an alarm indicating our database age has passed our threshold of 1 billion. So what is the problem?

If we look at the query that checks the age of databases and tables, we see that there is something preventing autovacuum from keeping up:

Select

Select1

At this point, I check exactly how many tables I have for autovacuum to work through, because these queries don’t indicate any problem other than that there are still enough tables aging to trigger my alarm.

SELECT count(*) FROM pg_tables;

Comamnd3
Schema design questions aside, there must be too many tables for our default three workers to traverse. The typical fix is to increase autovacuum_max_workers. Two things to note are that the cost parameter auto_vacuum_cost_limit is evenly distributed among active workers and that each worker can consume up to maintenance_work_mem. In this case, I recommend the following settings to see if progress is made and the CloudWatch graph shows a downward trend:

autovacuum_max_workers = 6
auto_vacuum_cost_limit = 1500
maintenance_work_mem = RDS default

I also recommend taking a look at the schema to see if improvements can be made in the design and lessen the overall number of tables.

Other scenarios
This post identifies a process to monitor this critical PostgreSQL metric, gives steps to start troubleshooting the problem, and describes a couple of common scenarios I have worked with RDS customers to fix. Other situations can arise, but I hope that this post helps you start the troubleshooting process.

In conjunction with your CloudWatch metric, it’s also useful to enable autovacuum logging so you can see where time is being spent. However, it might be beneficial to execute a manual “vacuum freeze” on a table that is approaching that 2 billion “age” that we talked about at the beginning of this post. It’s useful to see how long the operation takes. When using the verbose keyword, you get a better sense at what is happening—for example, if you don’t need those indexes that cause this operation to take additional time, it’s a good time to get rid of them! Here the RDS documentation outlines the steps to execute a manual verbose vacuum.

You can check out more information about autovacuum in Working with PostgreSQL Autovacuum on Amazon RDS in the RDS documentation. If you have any questions, please let us know in the comments following.

Introducing AWS Schema Conversion Tool Version 1.0.500

by Eran Schitzer | on 27 OCT 2016 | in Schema Conversion Tool (SCT) | Permalink | Comments

Eran Schitzer is a product manager in Amazon Web Services.

Today we are pleased to announce the release of version 1.0.500 of the AWS Schema Conversion Tool (SCT).

Since version 1.0.100 was released at re:Invent last year, the SCT has helped customers convert their schema and database objects from commercial engines to Amazon Relational Database Services (RDS) open-source, engine-based databases. Because customer obsession is one of the key leadership principles at Amazon, we’ve used feedback from our customers to add many new and exciting capabilities to SCT version 1.0.500.

Some of those capabilities include:

Copy your on-premises schema to an Amazon RDS engine: You can now use SCT to copy your existing on-premises database schema to an Amazon RDS database instance running the same engine. This feature can help you analyze potential cost savings of moving to the cloud and changing your license type.
An SCT assessment report provides statistics on the conversion of schema from the source to the target database. A sample SCT assessment report looks like this:

Figure 1: SCT assessment report showing conversion statistics for copying SQL Server to Amazon RDS for SQL Server
For more information, see Creating and Using the Assessment Report.
Create mapping rules: Before you use SCT to convert your schema, you can now set up rules that move objects from one schema to another and change the names of objects. For more information, see Creating Mapping Rules in the AWS Schema Conversion Tool.
Export a JSON file to AWS DMS: After converting your schema to your preferred database engine, you can now generate a JSON file, which will be used by AWS Data Migration Service. The JSON file contains the list of objects to be migrated by AWS DMS and any transformations on those objects.

For more information about version 1.0.500, see AWS Schema Conversion Tool User Guide. To download SCT, go to https://aws.amazon.com/dms/#sct.

Caching Hudl’s News Feed with Amazon ElastiCache for Redis

by Darin Briskman | on 24 OCT 2016 | in ElastiCache, Redis | Permalink | Comments

Darin Briskman (@briskmad) is a developer evangelist for Amazon Web Services.

We hope you enjoy this guest post from Joel Hensley, Engineering Director for the Community Tribe at Hudl, a leading software company that is revolutionizing the way coaches and athletes prepare for and stay ahead of the competition. You can reach Joel @fusbal.

Hudl offers coaches and athletes the tools to edit and share video, study associated play diagrams, and create quality highlight reels for entertainment and recruiting purposes. When football season starts in August, coaches and athletes from all over the country sign in to Hudl every day. After users sign in, they land on a tailored news feed like the one you see below. It includes team content and accounts the user is following. Because this page is our users’ first impression of Hudl, its performance is critical. That’s why we cache the news feed by using Amazon ElastiCache for Redis.

Hudl

Before we discuss the details, it’s important to understand our data model. There are six main collections used by the feed:

Let’s consider two users: Sally and Pete. Sally decides to follow Pete. At Hudl, this means Sally is a follower of Pete and Pete is a friend of Sally. When Pete posts something, that content is added to his user timeline and to Sally’s home timeline. When Sally signs in to view her feed, the content on her home timeline is displayed in reverse chronological order. If she clicks on Pete, she sees his user timeline and can view all of his posts.
Each time a user hits his/her feed, we grab a batch of post IDs from the user’s home timeline, fetch those posts, and load all of the users referenced in those posts. Since the feed’s creation in April 2015, the database has grown quickly. The total size is now up to 120 GB. We currently have 18 million follower relationships and 30 million pieces of content. So where does caching come into play?

Redis
In the world of caching, there are two primary options: Memcached and Redis. For a long time at Hudl, our default option was Memcached, but with the introduction of our news feed, we decided to explore the Redis data structures. We’re really excited by what we found:

Lists
This alone would have been reason enough to use Redis! Timelines are stored as lists, and being able to represent them that way in cache is amazing. As posts are added to timelines, we simply do an LPUSH (add to the front) followed by an LTRIM (used to cap the list at a max size). The best part is that we don’t have to invalidate the cache as posts are added because it’s kept in sync with the database.

Hashes
Displaying the number of followers and friends for a user is a critical component of any feed. By storing these as fields in a hash for each user, we can quickly call HINCRBY to keep the values in sync with the database. No need to invalidate the cache whenever a follow or unfollow occurs.

Sets
We love to use RabbitMQ to retry failed operations. Sets are the perfect way to guarantee we won’t accidentally insert the same post on a user’s timeline more than once. There’s no need for an extra database call. We use the post ID as the cache key, each user ID as the member, and then call SISMEMBER and SADD.

Amazon ElastiCache
We needed to spin up and configure a server so we could start testing Redis. Because we love AWS, we decided to try Amazon ElastiCache for Redis. Within minutes, our first test node was up and running Redis. We connected to it through the StackExchange.Redis C# driver.
By using ElastiCache, we were able to easily configure our Redis deployment, node size, and security groups. We were also able to use Amazon CloudWatch to monitor key metrics. We were able to create separate test and production clusters without having to wait for an infrastructure engineer to manually set up and configure the servers.
Here are the details of our production cluster:

Node type: cache.r3.4xlarge (118 GB)
Replication enabled
Multi-AZ
2 read replicas
Launched in VPC

Our final deployment step was to configure alerts through Stackdriver, which integrates seamlessly with ElastiCache.
We were most interested in these three metrics:

Current connections: If these drop to 0, our web servers can no longer access the cache and require immediate attention.
Average bytes used for cache percentage: If this reaches 95% or more, we might need to move to a larger node type or lower our expiration times.
Swap usage: If this reaches 1 GB or more, the Redis server requires immediate attention.

Results
Since its launch more than a year ago, we’ve been very happy with the performance of the news feed. This year, during the week of September 5, 1.2 million unique users accessed their feeds. The feed service averaged 300 requests per second, with a peak of 800.
Here are some quick ElastiCache stats from that week:

Total cached items: 21 million
Cache hits: 175K/min (average), 350K/min (peak)
Network in: 43 MB/min (average), 101 MB/min (peak)
Network out: 600 MB/min (average), 1.25 GB/min (peak)

CachedHits

Let’s take a look at two calls in the feed service:

Get Timeline
This operation just fetches the timeline list from Redis. There are no dependencies.

Hydrate Timeline
This call takes the post IDs in the timeline and loads all referenced users and posts. This includes time spent loading records from the MongoDB database if they are not cached and then caching them. This is the primary call used when loading the feed on the web and on our iOS and Android apps.

Timeline

Based on the success of the news feed, ElastiCache for Redis is quickly becoming our default option for caching. In the last year, five other Hudl services have made the switch from Memcached to Redis. It’s easy to set up, offers blazing-fast performance, and gives users all of the benefits that Redis has to offer. If you haven’t already, we strongly recommend trying Amazon ElastiCache for Redis. Please let us know how it works for you.

For more information, see these sources:
Hudl: http://www.hudl.com/
Hudl Bits blog: http://www.hudl.com/bits
What is Redis? https://aws.amazon.com/elasticache/what-is-redis/
Amazon ElastiCache for Redis: https://aws.amazon.com/elasticache/redis/

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

by Michael Labib | on 17 OCT 2016 | in ElastiCache, IoT, Redis | Permalink | Comments

Michael Labib, Specialist Solutions Architect, In-Memory NoSQL

You can use a number of AWS services to address the challenges associated with ingesting and analyzing time series data. In this blog post, I focus on IoT sensor data, but the principles and techniques described here can be applied to other data sources.

What Is Time Series Data?
A time series is simply a series of observations captured over a period of time. These observations can be stock quotes, application logs, clickstream data, Twitter feeds, device sensor data, or any other time-stamped data source.

When collected over a period of time, the data points tell a story about the behaviors, patterns, and trends occurring in the observed systems; this data can be used to improve an experience or process. The more data points captured, the richer the information gathered from that data. Accelerating the data analysis and ingestion process allows you to react more quickly to the information.

For example, let’s say you are gathering information from social networks in order to data mine what people are discussing. The sooner you’re able to process that streaming data, the sooner you’ll be able to target potential customers with ads or campaigns and improve your product’s position in the market. The same is true with log data. The sooner you’re able to determine the health of your applications, the sooner you can proactively fix faults in your systems and prevent disruption to your customers.

Storing those data points over time allows you to observe historical trends so you can predict and plan for future events.

In this post, we’ll design an IoT solution that captures sensor information and persists it in Amazon ElastiCache for Redis.

Use Case Requirements
Our use case is to gather sensor information from a consumer thermostat in order to understand the environments in which our customers will use them. We also want to identify, through information gathered by the sensors, any potential issues related to extreme temperature conditions.

We’ll first build a proof of concept to process and analyze the sensor information. Then, we’ll change our architecture to store historical information for additional reporting.

These are the requirements for our solution:

Velocity: Our solution needs to be able to support hundreds of thousands of requests per second. We’ll be capturing device sensor data that includes information like temperature, humidity, device IP addresses, and other metadata.
Volume: Although we are primarily concerned with the most active information, in the future we’ll want to store historical data. For this reason, our solution should be extensible.
Variety: We want a flexible data model so our solution can adapt to changing requirements.
Order: For accurate reporting and trend analysis, the solution must be able to maintain the order of the data that is ingested.
Cost and agility: We want to go to market quickly while being cost-conscious. We don’t want to worry about patching, racking and stacking, and so on.
High availability and scaling: The solution must be highly available and able to scale to support additional devices and load. We also want to be able to apply real-time processing of data for analytics.
Security: We need to be able maintain end-to-end security to protect the data we are capturing.

Architecture

Let’s use AWS services to build an architecture:

Arch

These services offer layered security, ease of use, scalability, high availability, and zero administration.

Data ingestion: AWS IoT automatically scales to support billions of devices and trillions of messages. It also allows you to easily process and route messages to other AWS services.
Data processing: AWS Lambda automatically scales your application by running your function in response to a trigger. Just write code and upload to AWS Lambda. It’s completely serverless. When integrated with AWS IoT, it can also act as a continuous event data processor.
Data storage: Amazon ElastiCache for Redis is a lightning-fast, in-memory data store that supports extremely high requests rates and data ingestion. It provides flexible scaling and advanced data structures that can speed development and reduce computational complexity. There is also no cost for requests to the service or for amounts of throughput. You pay only for the instances you provision, which makes it ideal to query against.

Data Model

This is the data model we’ll be using to capture active sensor data.

DataModel

A Redis Sorted Set is used to capture and maintain the order of the events (in this case, sensor readings). The key for this data structure is SensorData. This set will maintain the order by using the timestamp value, which acts as the score. The date of the event is converted to a LONG value, which is sorted in reverse order to display the events starting from the newest. We are also storing the deviceId as the value for this set.

Alternatively, we could use a unique ID such as an event ID or request ID to capture the data for a historical view. (This true for the hash key, too.)

A Redis Hash is used to capture the temperature sensor data. An individual Hash is created for each deviceId, which is also the key for the Hash. The deviceId value is stored in the SensorData sorted set. When used together, the sorted set and deviceId provide the time the event occurred and the data associated with the device.

Redis Strings allow you to increment the interpreted integer of a string value for a given key. Our consumer thermostats will be used in various climates, so we’ll increment each individual counter when the temperature reading falls within a given range. These counters will provide analytics in the case we are interested in totals.

Now let’s build our solution!
Step 1: Create a Redis cluster and Redis client on EC2

For simplicity, we have created an AWS CloudFormation template for you to use.

Download the sensor-demo-redis.json template. Open the AWS CloudFormation console and run the template.

The sensor-demo-redis.json template will create a Redis cluster and an EC2 instance. The redis-cli and the Node.js application (temp-sensor.js) have already been deployed to the instance. The temp-sensor.js application will emulate a device sensor, which will be consumed by AWS IoT. The redis-cli will be used to query the Redis cluster. For your convenience, Node.js is also installed on the EC2 instance to run the sensor emulator application.

Step 2: Configure AWS IoT

IoT

Now we are ready to create a thing and connect our IoT device (which is an emulator application) to the AWS IoT device gateway. The device gateway allows secure, low latency, bi-directional communication between connected devices and AWS. AWS IoT also includes a rules engine, which we will use to enable continuous processing of data sent by our device. We will configure the AWS IoT rules engine to trigger an AWS Lambda function in a later step.

1. Sign in to the AWS Management Console and open the AWS IoT console at https://aws.amazon.com/iot.

2. Create a thing to represent your device. In Name, type SensorHub, and then choose Create.

IotHub

3. Choose your thing (SensorHub), and then choose Connect a device.

ddRule

4. For the device type, choose NodeJS, and then choose Generate certificate and policy.

AWSIot

5. Download the keys and certificate, and then choose Confirm & start connecting.

StartConnecting

6. Copy the connection information. You will need it for the next step.

Step 3: Run the Node.js emulation code

The sensor emulator code is already deployed to your EC2 instance and located in /home/ec2-user. The code is available here for your review.

The application includes the following structure:

node_modules are the required temp_sensor.js Node.js dependencies.
utils/iot-properties.file is the file where you will place the parameters required by AWS IoT. The utils directory is also where you will place the certificates provided by AWS IoT.
temp-sensor.js is the Node.js application that will emulate sensor data.

Let’s take a look at the code.

First, we will retrieve the properties stored in the iot-properties.file that we’ll be using to securely authenticate and connect to AWS IoT.

var properties = PropertiesReader ('./utils/iot-properties.file');

const device = awsIot.device ({
"host": properties.get('host'),
"port": properties.get('port'),
"clientId": properties.get('clientId'),
"thingName": properties.get('thingName'),
"caPath": properties.get('caPath'),
"certPath": properties.get('certPath'),
"keyPath": properties.get('keyPath'),
"region": properties.get('region')
});

Note: Be sure the property names in the code match exactly those in the iot-properties.file and place your AWS IoT keys in the utils directory.

In the following section of the code, we generate random data and publish it to the AWS IoT topic named ‘temp_readings’.

device.on('connect', function() {

console.log('\n===========Emulating Sensor Data=================\n');

setInterval(function () {

for (i=10; i>=0; i--) {
//Generate Random Sensor Data
var temperature = Math.floor((Math.random() * 110) + 1);
var deviceId = Math.floor((Math.random() * 5000) + 1);
var IP = Faker.Internet.ip();
var humidity = Math.floor((Math.random() * 100) + 1);

console.log('deviceId= ' + deviceId + ' temperature= ' + temperature + ' humidity=' + humidity + ' IP=' + IP );
device.publish('temp_readings', JSON.stringify (

{ "deviceId" : deviceId,
"temperature" : temperature,
"deviceIP" : IP,
"humidity" : humidity
}
));
}

}, 4000);

});

1. Connect to your EC2 instance by using SSH. Use the user name ec2-user and the SSH key that was created and downloaded earlier.
To find the hostname of the EC2 instance, look on the Outputs tab of the CloudFormation stack or go to the EC2 dashboard.
2. Place the AWS IoT certificate and private key you downloaded earlier in the./utils directory, and then update the ./utils/iot-properties.file with your connection information and the names of the key and certificate.
3. Run the Node.js application with node temp-sensor.js.
4. Go to the AWS IoT console.
5. Choose MQTT Client and create a test client to test whether temperature readings are being captured. Choose Connect.

MQTClientActions
6. Subscribe to the temp_readings topic, which is where the emulation code is publishing temperature readings. Choose Subscribe.

MQTClient
7. If everything was set up correctly, sensor data should start being published.

Task 4: Deploy a Lambda function and AWS IoT rule

This task will create the AWS Lambda function that will be used by the AWS IoT rules engine to persist the temperature readings into Amazon ElastiCache for Redis. The AWS IoT rule will trigger the execution of the Lambda function. The code will also create the Redis data model.

1. Download and unzip the PersistSensorData.zip package.
The application includes the following structure:

node_modules contains the required PersistSensorData.js Node.js dependencies.
PersistSensorData.js is the node application that will process the sensor events and persist the data into Amazon ElastiCache for Redis.

Let’s take a look at the code.

First, we will log the details of the event, making a connection to the Redis cluster using the node_redis client. We’ll use the Redis multi command for transaction support.

var redis = require("redis");

exports.handler = function(event, context) {

console.log("Request received:\n", JSON.stringify(event));
console.log("Context received:\n", JSON.stringify(context));

console.log(" deviceId:" + event.deviceId +
" temperature:" + event.temperature +
" deviceIP:" + event.deviceIP +
" humidity:" + event.humidity );

client = redis.createClient("redis://your-redis-endpoint:6379");
multi = client.multi();

2. Update the Redis connection endpoint in the Lambda function with the Redis cluster endpoint inside the PersistSensorData.js function.

To find the Redis cluster endpoint, open the Amazon ElastiCache console. In the Nodes column, click the 1 node link.

CacheCluster1

Your Redis cluster endpoint is the endpoint value.

NodeStatus

Now update the Redis endpoint in the Lambda function with this endpoint. Leave the port value in there.

For example, this: redis.createClient("redis://your-redis-endpoint:6379");

becomes this:
redis.createClient("redis:// sen-re-a4mbqq65e605.rhpopm.0001.usw2.cache.amazonaws.com:6379");

In the following section of the code, we increment the Redis counters to group similar temperature readings. Those groupings will appear on a web dashboard.

if (event.temperature > 85 ) {

multi.incr("climate:hot");
climate = "hot";
multi.publish("Warnings",
" deviceID:" + event.deviceId +
" temperature:" + event.temperature + "" +
" awsRequestId:" + context.awsRequestId + " timestamp:" + date );

} else if (event.temperature > 75 ) {

multi.incr("climate:warm");
climate = "warm";

} else if (event.temperature > 65 ) {

multi.incr("climate:cool");
climate = "cool";

} else {

multi.incr("climate:cold");
climate = "cold";
}

We also update our sorted set SensorData with the deviceId value and sort it by the date (stored as a LONG), which acts as our score.

multi.zadd("SensorData", date, event.deviceId);

We then store the sensor data in a Redis Hash for each deviceId. In addition to the data we captured from the sensor emulator code, we’ve added properties like awsRequestId, timestamp, and climate. Finally, we execute the commands within the transaction.

multi.hmset(event.deviceId, "temperature", event.temperature,
"deviceIP", event.deviceIP,
"humidity", event.humidity,
"climate", climate,
"timestamp", date,
"awsRequestId", context.awsRequestId);

multi.exec(function (err, replies) {

if (err) {

console.log('error updating event: ' + err);
context.fail('error updating event: ' + err);

} else {

console.log('updated event ' + replies);
context.succeed(replies);
client.quit();

}
});
}

3. After you have updated the Redis connection endpoint in the Lambda function, zip the function and node_modules directory. Name the zipped artifact PersistSensorData.zip.

4. Open the AWS Lambda console.

5. When prompted to select a blueprint, choose Skip.

6. When prompted to configure triggers, choose Next. We will complete the configuration of the trigger in the AWS IoT console.

7. Name your function PersistSensorData and choose Upload.

ConfigureFunction

8. For Handler, type PersistSensorData.handler. For Role, choose Create a custom role.

LambdaFunction

9. For IAM Role, choose a basic role that includes VPC execution, and then choose Allow.
Note: If you already have a role, you might need to open the IAM console and add the AWSLambdaVPCAccessExecutionRole managed policy to it.

HideDetails

10. Specify the VPC, subnets, and security groups this function should use. For VPC, we can use the default VPC and any of the subnets in it. For Security Group, choose the group created by the CloudFormation stack. It will start with SensorDemo-EC2InstanceSecurityGroup unless you gave your stack a different name. Choose Next.

VPC

11. Review the information on the page, and then choose Create Function.

Congrats

Close this tab. Now that the Lambda function has been created, we can finish the configuration of our AWS IoT rule.

12. Open the AWS IoT console, and confirm the correct region is selected.

13. Choose Create a resource.

14. Choose Create a rule.

CreateRule

15. For Name, type PersistToRedis.
16. For Attribute, type * (asterisk).
17. For Topic filter, type temp_readings.
18. Leave Condition blank.
19. For Choose an action, choose Lambda.

LambdaAction

20. Choose Add action.

LambdaFunctionName

21. Choose Create.

Task 4: Analyze sensor data by using redis-cli

1. Connect to your EC2 instance using SSH. Use the user name ec2-user and the the SSH key you created or downloaded earlier.
To find the hostname of the EC2 instance, look on the Outputs tab of the CloudFormation stack or navigate to the EC2 console. You should see the instance running in the console.

2. Navigate to the following directory and use the redis-cli to connect to Redis.

/usr/local/bin

3. Connect to your Redis cluster (for example,
redis-cli -h sen-re-a4mbqq65e605.rhpopm.0001.usw2.cache.amazonaws.com -p 6379).

4. The following are some commands you can use to analyze the sensor data.
Note: Make sure the node application, temp_sensor.js, is running. You might want to have two SSH sessions open.

What are the last n events?
zrevrangebyscore SensorData +inf -inf WITHSCORES LIMIT 0 5

Which events occurred during this time range?
zrangebyscore SensorData (1410000000000 14400000000000

Which devices match a particular pattern?
zscan SensorData 0 MATCH 2* COUNT 100

What is the temperature for a specific device?
hget 2942 temperature (This assumes there is a deviceId with a value of 2942.)

What are the totals for each climate type?
get climate:warm get climate:cool get climate:cold get climate:hot

Show me all of the data for a specific device.
hgetall 2942

For a complete list of Redis commands, see http://redis.io/commands.
Task 5: Deploy the sensor dashboard application

Arch1

Now we’ll deploy a web dashboard that will display the ingested sensor data so we can interact with it. The dashboard information includes what’s contained in the Sorted Set, Hash, and Counters. It also includes notifications from the Redis PUB/SUB channel.

The purpose of this dashboard is to illustrate how easily data can be retrieved from Redis. The .war application is available for download. The front end of the application has AJAX code that continuously queries the data in Redis through our Java service. The Java Redis client we’re using is Jedis.

We’ll use Amazon Elastic Beanstalk to deploy the dashboard.

Sorted set (SensorData)
We are using a set collection we named events and hydrating it with data from our SensorData data structure. For our web dashboard, we simply converted the data into a JSON object that is used in the dashboard UI. We’re using the ZREVRANGEBYSCORE command to return the reverse order of the values based on our timestamp (the score). This will give us our most recent events first. We are also retrieving the top 24 values each time we execute this command to display in the UI.

Gson gson = new Gson();
JsonObject sensors = new JsonObject();

Set events = jedis.zrevrangeByScoreWithScores("SensorData", Double.MAX_VALUE, Double.MIN_VALUE, 0, 24);

for (Tuple event : events)
sensors.add(event.getElement(), gson.toJsonTree(event.getScore()));

Hash (deviceId)
Each time we click a deviceId on our web dashboard, we query the Redis Hash for details of that deviceId. We simply create a Java map and hydrate it with results from the HGETALL Redis command, which returns all of the properties in a Redis Hash for a particular key (our deviceId). We also convert that Java map to a JSON object so it can be easily consumed by the UI.

Gson gson = new Gson();
Map<String, String> eventData = jedis.hgetAll(deviceId);
String json = gson.toJson(eventData);

Counters
For the chart on the dashboard that will show the temperature readings (the counts of the different climates), we simply make a GET request on each Redis key we want included. The counts are in a JSON object so they can be easily consumed by the UI.

Gson gson = new Gson();

JsonObject counters = new JsonObject();
counters.add("chartCold",gson.toJsonTree(jedis.get("climate:cold")));
counters.add("chartWarm",gson.toJsonTree(jedis.get("climate:warm")));
counters.add("chartHot",gson.toJsonTree(jedis.get("climate:hot")));

Pub/Sub
The PersistSensorData Lambda function processes the temperature readings and publishes a notification to the channel “Warnings.” The web dashboard is subscribed to this channel.

jedis.subscribe(new JedisPubSub() {

public void onMessage(String channel, String message) {

writer.println(message + "");
writer.flush();
asyncContext.complete();
}

public void onSubscribe(String channel, int subscribedChannels) {
}

}, "Warnings");

Let’s deploy this code!

1. Open the AWS Elastic Beanstalk console.

2. Choose Create New Application.

3. Provide a name and description for the application, and then choose Next.

AppInfo

4. Choose Create web server.

NewEnv

5. For Predefined configuration, choose Tomcat. Because this is just a demo, for Environment type, choose Single instance. Choose Next.

EnvType

6. Upload the sample dashboard application, and then choose Next.

AppSource

7. For Environment name, type a name for your application, and then choose Next.

EnvInfo
8. Select the Create this environment inside a VPC box, and then choose Next.

AddResources

9. On Configuration Details, , choose an instance type, and then select the EC2 key pair you created. You can also provide your email address to get notifications about changes to your environment. Choose Next.

ConfigDetails

10. Add tags to your environment, and then choose Next.

EnvTags

11. Add your VPC configuration information, including the subnets and security group, and then choose Next. The subnets and security group should match the values you used when you configured the Lambda function.

Note: Be sure to select the same subnet in which your ElastiCache cluster was created.

12. Choose the appropriate permissions for Elastic Beanstalk, and then choose Next.

Permissions

13. Review the configuration, and then launch the environment.

14. After the application has been launched, choose Configuration.

Config

15. Next, edit the software configuration.

SoftwareConfig

16. Scroll to the bottom of the page, add the following environment properties, and then choose Apply.

For Property Name, type REDIS_CONNECTION_STRING.
For Property Value, add your Redis endpoint.

PropertyNAme

Your Elastic Beanstalk environment will be updated.

17. Navigate to your Elastic Beanstalk environment. You should see sensor data being populated to the dashboard.

BeansTackEnv

If you don’t, be sure to run your Node.js application and verify:

The Elastic Beanstalk application security group has access to write to the ElastiCache for Redis cluster.
The Lambda trigger is configured. Open the AWS Lambda console and choose the Triggers tab.

Success

Conclusion

Amazon ElastiCache for Redis is an excellent storage solution for time series data. It supports extremely high request rates, advanced data structures, high availability, and persistence. And because it’s fully managed, it requires zero administration.

AWS IoT and AWS Lambda are also fully managed services that scale automatically to ingest and process device data.

To help you get started, a free tier for ElastiCache for Redis is available for your use. Use it to kick off your projects and please let us know about your results!

Scaling Your Amazon RDS Instance Vertically and Horizontally

by Marie Yap | on 10 OCT 2016 | in RDS MySQL | Permalink | Comments

Marie Yap is a solutions architect for Amazon Web Services.

As a managed service, Amazon RDS takes care of the scaling of your relational database so your database can keep up with the increasing demands of your application or applications.

In this blog post, we are going to take a look into how we can vertically and horizontally scale your RDS instance. You can scale vertically to address the growing demands of an application that uses a roughly equal number of reads and writes. Or you can scale horizontally for read-heavy applications.

Vertical Scaling
To handle a higher load in your database, you can vertically scale up your master database with a simple push of a button. There are currently over 18 instance sizes that you can choose from when resizing your RDS MySQL, PostgreSQL, MariaDB, Oracle, or Microsoft SQL Server instance. For Amazon Aurora, you have 5 memory-optimized instance sizes to choose from. The wide selection of instance types allows you to choose the best resource and cost for your database server.

Following are some things to consider when scaling up an RDS instance:

Before you scale, make sure you have the correct licensing in place for commercial engines (SQL Server, Oracle) especially if you Bring Your Own License (BYOL). One important thing to call out is that for commercial engines, you are restricted by the license, which is usually tied to the CPU sockets or cores.
Determine when you want to apply the change. You have an option to apply the change immediately or during the maintenance window specified for the instance.
Storage and instance type are decoupled. When you scale your database instance up or down, your storage size remains the same and is not affected by the change. You can separately modify your DB instance to increase the allocated storage space or improve the performance by changing the storage type (such as to General Purpose SSD to Provisioned IOPS SSD).
There is minimal downtime when you are scaling up on a Multi-AZ environment because the standby database gets upgraded first, then a failover will occur to the newly sized database. A Single-AZ instance will be unavailable during the scale operation.

To change the instance type, choose Modify from the Instance Actions menu on the RDS console.

InstanceActions

Then choose the new DB instance class.

InstanceType

Finally, determine if you want to apply the change immediately or not. To apply the change immediately, select the Apply Immediately check box at the bottom of the Modify page. If you don’t apply the change immediately, then the change will be scheduled to occur during the preferred maintenance window that you defined.

Horizontal Scaling
In addition to scaling your master database vertically, you can also improve the performance of a read-heavy database by using read replicas to horizontally scale your database. RDS MySQL, PostgreSQL, and MariaDB can have up to 5 read replicas, and Amazon Aurora can have up to 15 read replicas.

Read replicas allow you to create read-only copies that are synchronized with your master database. You can also place your read replica in a different AWS Region closer to your users for better performance. Also, you can use read replicas to increase the availability of your database by promoting a read replica to a master for faster recovery in the event of a disaster. However, read replicas are not a replacement for the high availability and automatic failover capabilities that Multi-AZ provides.

Currently, RDS read replicas support transparent load balancing of queries or connections. Each replica has a unique Domain Name Service (DNS) endpoint so that an application can implement load balancing by connecting to the replica endpoint. Let’s look at the options on how we can make applications aware of RDS read replicas.

If your application is using the native MySQL driver, there are MySQL Connectors that allow you to do read/write splitting and read-only endpoint load balancing without a major change to your application. For example, if you have a PHP application, you can use the MySQL native driver’s PHP Mysqlnd replication and load-balancing plugin.

In addition to using a MySQL Connector, you can add a load balancer between your application and database servers. You make this addition so that you have a single database endpoint presented to the application. This approach allows for a more dynamic environment where you can transparently add or remove read replicas behind the load balancer without constantly updating the database connection string of the application. You can also perform a custom health check by using scripts.

MySQLConnector

As shown in the diagram, you can use a transport or a layer 4 load balancer together with the MySQL Connector. Currently, the Elastic Load Balancing (ELB) load balancer does not support the routing of traffic to RDS instances. Therefore, you might want to consider other options such as HAProxy, which is a open-source software-based load balancer that many people use. In this solution, you can configure HAProxy to listen on one port for read queries and another port for write queries.

Another option is to use a layer 7 SQL-aware load balancer, which allows you to forward queries to your databases using complex rules. This type of load balancer has a more sophisticated capability of understanding how to properly perform the read/write splits on multi-statements than a MySQL Connector does. This solution handles the scaling issues in a distributed database environment, so you don’t have to handle scaling on the application layer, resulting in little or no change to the application itself. To accomplish this, there are several open-source solutions (such as MaxScale, ProxySQL, and MySQL Proxy) and also commercial solutions, some of which can be found in the AWS Marketplace.

Conclusion
In summary, you can scale your RDS configuration up or out to meet the growing needs of your applications. RDS takes care of the heavy lifting in scaling your database so you can focus more on your application or applications.

Configuring the AWS Schema Conversion Tool

by Wendy Neu | on 26 SEP 2016 | in Migration, Schema Conversion Tool (SCT) | Permalink | Comments

Wendy Neu is a big data architect at Amazon Web Services.

The AWS Schema Conversion Tool (AWS SCT) is a project-style Java application that can be installed on a local computer or an Amazon EC2 instance. The tool can be downloaded from AWS and is supported on MS Windows, Mac OSX, and Ubuntu and Fedora Linux.

You can use the AWS Schema Conversion Tool to convert your existing database schema from one database engine to another. You can convert relational OLTP schema, or data warehouse schema. Your converted schema is suitable for an Amazon RDS MySQL DB instance, an Amazon Aurora DB cluster, an Amazon RDS PostgreSQL database instance, or an Amazon Redshift cluster.

SCT is simple to use and takes only minutes to get started after you install the tool locally on your machine. It requires only connectivity to the source and target databases. SCT uses JDBC drivers to connect to the source and target databases. When you set up your first DB connection to a new source, the wizard interface will prompt you for a driver. All supported drivers are listed by database engine in the Installing and Updating section of the documentation.

When both the source and target database connections are configured, simply right-click on the schema you want to analyze and choose Create Report.

CreateReport

SCT will examine in detail all of the objects in that schema, including the schema itself, tables, constraints, types, sequences, synonyms, collection types, views, procedures, functions, and packages. It will convert as much as possible automatically and provide detailed information about items it couldn’t convert. The report can be saved as a .csv file or a .pdf file for review.

Navigate to the Action Items tab in the report to see the items that the tool couldn’t convert and understand how much work you have to do.

ActionItems

When you’ve looked at the report, choose Actions, Convert schema from the menu bar to generate the data definition language (DDL) statements for the target database.

ConvertSchema

DDL for the target database is displayed in the project console, and you can edit it before applying it with SCT or simply saving it as an .sql file for application later.

SQL

In development, it’s a good idea to apply the changes directly from within SCT to a sandbox version of your database to understand what objects will be automatically converted, how many objects are involved, and what level of development effort is ahead. Beyond the sandbox and early development phases, it’s best practice to adhere to normal software lifecycle procedures within your organization. Scripted conversions will facilitate key processes like code reviews, automated builds, and automated deployments.

Deployment

Working with Global Settings
The AWS Schema Conversion Tool is very easy to use and lets a DBA or developer move very quickly to analysis and deployment. However, sometimes the default settings might need to be modified. For example, such a change might be necessary if the wrong driver was entered through the wizard for a database engine, a single schema includes thousands of objects and places the Java virtual machine (JVM) under memory pressure, or verbose logging must be enabled. AWS Schema Conversion Tool exposes fine control over all these key areas through the Global settings dialog box. The sections following offer some suggestions on how to use global settings for fine control over these areas.

All sections following are accessed through the SCT console by choosing Settings and then Global Settings from the top menu bar.

GlobalSettings

Driver settings
You must have the required JDBC drivers installed for the AWS Schema Conversion Tool to connect to your source or target database. When a driver is downloaded, you can upload the driver into the tool through the wizard or through the Global settings dialog box. The advantage of uploading the drivers through the Global settings dialog box is that you can enter or update all of the drivers at one time.

Choose Drivers from the side navigation panel in the Global settings dialog box, and set the target database driver by choosing Browse, navigating to the local machine, and choosing the .jar file to associate with SCT for a particular database engine.

GlobalSEttings2

All projects within SCT for that database engine will use the selected driver.

Performance and memory settings

SCT has menu options that control how aggressively the tool will consume JVM memory:

Fast conversion, but large memory consumption – This option optimizes for a fast conversion but might require more memory for the object reference cache.
Slower conversion, lower memory consumption – This option minimizes the amount of memory used but results in a slower conversion. Use this option if your computer has a limited amount of memory.
Balance speed with memory consumption – This option optimizes provides a balance between memory use and conversion speed.

To set these options, choose Performance and Memory from the side navigation panel in the Global settings dialog box, and change Performance and Memory Management Policy to the desired setting.

GlobalSettings3

JVM size settings
The default amount of memory allocated to SCT is just less than a gigabyte. This memory size is usually sufficient for small databases of less than 1000 objects. For many databases, it is necessary to reconfigure the SCT JVM to use more memory even when the JVM is configured in the low memory consumption configuration.

The screenshot that follows shows an SCT JVM under duress, with the tasks allocated hitting the maximum boundary of JVM memory. The setting was Slower conversion, lower memory consumption and failed due to a too-small JVM memory setting.

JVM Settings

You can increase (or decrease) the maximum and minimum memory allocated to the SCT JVM through configuration files.

Configuration files for SCT are located in the AWS Schema Conversion Tool installation directory under a subdirectory called app. The name of the main configuration file is AWS Schema Conversion Tool.cfg. You can modify the file using any text editor—however, don’t use a word processing tool (like Word or WordPad) because it might change certain characters in the file and prevent the file from being read correctly by the JVM.

The image following shows an AWS Schema Conversion Tool installation directory under C:\Program Files as an example.

FileExplorer

The default file that is installed with SCT has a few parameters for the tool itself and a single JVM option. None of the existing parameters in the file should be changed without consulting AWS Support.

SCT

You can modify the memory settings of the JVM by adding two parameters under the [JVMUserOptions] section. However, use caution when configuring the JVM size. If your maximum size exceeds your system’s available memory, it might crash your machine.

The first of the two JVM memory parameters is the maximum JVM memory size, Xmx, which is recorded in megabytes and should be in standard recognized memory values. To set a maximum JVM size of 4 GB, set this value to –Xmx4096m with no spaces on a single line.

The second parameter is the minimum JVM memory size Xms, which is recorded in megabytes and should be in standard recognized memory values. To set a minimum JVM size of 2 GB, set this value to –Xms2048m with no spaces on a single line.

SCT2

When you’re done editing, save the file and restart AWS SCT. You should see the values in the status bar change to reflect your new settings.

The screenshot that follows shows an SCT JVM running smoothly because the tasks allocated are operating comfortably within the maximum boundary of the JVM. The setting was Slower conversion, lower memory consumption, and the tool succeeded in converting nearly 5000 objects in about 10 minutes.

SCT3

Logging
In addition to managing the memory settings of the AWS Schema Conversion Tool, it can be a good idea to increase the logging information produced by the tool when converting new projects. Although increasing logging information might slow conversion slightly, this change will give robust information to AWS Support if any unforeseen errors arise.

To change the logging settings, choose Settings and then Global Settings. In the left navigation pane, choose Logging. See the section above Memory Management for help.

In the logging pane, change Debug mode to true and choose a few key items to increase the logging information for. Setting the Parser, Type Mapping, and User Interface to TRACE can help with key problem areas during conversion.

GlobalSettings4

You can also configure the log location. If the information becomes too verbose for the current file system where the logs are streaming, change the value to a location with sufficient space to capture the logs.

GlobalSettings5

To transmit logs to AWS Support, navigate to the directory where the logs are stored, zip up all the files into a manageable single file, and upload the .zip file with your support case.
When the initial analysis is completed and ongoing development is taking place, simply return Debug mode to false to eliminate the verbose logging and increase speed.

Tip: To manage the log size and streamline reporting issues, remove the logs or move them to another location after a successful conversion. Doing this ensures that only the relevant errors and information are transmitted to AWS Support and keeps the log file system from filling.

All references in this document refer to Version 1 of the AWS SCT and were tested on build 401 successfully.

Using Amazon RDS for Oracle as the Oracle SOA Suite Database

by Jinyoung Jung | on 19 SEP 2016 | in RDS Oracle | Permalink | Comments

This post is written by Fabio Douek, an architect at AWS Partner Rubicon Red.

When we started to provision Oracle Fusion Middleware platforms with AWS, we took advantage of the ability to create AWS infrastructure within minutes. We can also use MyST to provision complex platforms that are EDG-compliant with Oracle Fusion Middleware in less than an hour.

One challenge that we faced was that we couldn’t use Amazon RDS for Oracle as the database for our Oracle Fusion Middleware installations. Our inability to do so was primarily because the RDS master user didn’t have the database privileges required to run the Oracle Repository Creation Utility (RCU). As a result, we implemented our own automation for provisioning the Oracle Database running on Amazon EC2 instances.

Although this approach works for running development and test workloads on AWS, when it comes to implementing production workloads, Amazon RDS for Oracle provides a number of additional benefits. These benefits include simplified administration tasks, including backups, software patching, monitoring, and hardware scaling.

In addition, Multi-AZ deployment with Amazon RDS for Oracle simplifies the implementation of a highly available architecture, because it contains built-in support for automated failover from your primary database to a synchronously replicated secondary database in an alternative Availability Zone in case of a failure.

Our approach to this challenge started to change late last year, with a number of our customers looking at running Oracle SOA workloads in production on AWS. Being an AWS Technology Partner, we provided this feedback to AWS, who in return invited us to collaborate with the RDS team.

We spent the last four months working with the Amazon RDS team to test the RCU capability within MyST. This process went extremely well, and the Amazon RDS team worked closely with us to support the go-live of our first customer on Oracle SOA 12.2.1 on AWS using Amazon RDS for Oracle —what we believe to be a world first!

The following diagram illustrates the type of architecture required.

More recently, Amazon has now announced that RCU is officially supported by RDS Oracle. That’s great news, because it means that we can provision an Oracle Fusion Middleware environment that is EDG-compliant and highly available within minutes. This way, we can take advantage of RDS to simplify ongoing operations.

To find out more detail on the steps required to enable Oracle SOA with RDS Oracle, see our blog post on the Rubicon Red website.

Migrate PostgreSQL Databases and Perform Ongoing Replication with the AWS Database Migration Service

by Jeff Levine | on 14 SEP 2016 | in DMS, Migration, RDS PostgreSQL | Permalink | Comments

Jeff Levine is a solutions architect for Amazon Web Services.

As a solutions architect for Amazon Web Services, I help our customers to migrate workloads to the AWS Cloud, a key part of which includes their databases. The AWS Database Migration Service (AWS DMS) enables organizations to migrate data to and from a variety of databases located in Amazon Relational Database Service (Amazon RDS), those running on Amazon EC2, and also those running on-premises. The service supports homogeneous migrations such as Oracle to Oracle, and also heterogeneous migrations such as Oracle to Amazon Aurora. You can also customize table mappings and perform translations. In this post, I am going to talk about two things, first about using an RDS PostgreSQL source database for DMS and then about using the recently announced continuous data replication feature, which can keep the target database up to date after the initial migration.

Using an RDS PostgreSQL Source
You might think that migration means just moving data from one location to another, but the process involves far more. The process must also keep up with transactions that take place during the migration, so that the database is left in a consistent state upon completion. Accordingly, when we look at migrating from an Amazon RDS PostgreSQL source database, we need to take into account both how data is moved and also environmental factors that can affect the ongoing processing of transactions while the migration is happening.

Let’s take the case of a PostgreSQL to MySQL migration. I’ll begin with a simple table definition.

CREATE TABLE subscribers (
subscriber_id SERIAL NOT NULL PRIMARY KEY, subscriber_info json NOT NULL
);

I’m going to use pgAdmin, a popular open source administration tool for PostgreSQL, to create the table and display the specification in the source database.

After I run the DMS migration task, here’s how the newly created subscribers table appears in the target database using the program DbVisualizer.

Notice how the serial column is now an integer and the subscriber_info column has been converted to longtext. To understand what happened, we need to consider how DMS operates. DMS functions as an intermediate broker to a variety of different database engines. To make this conversion happen, DMS first maps the source data type (JSON in this case) to an intermediate DMS type and then maps that type to the target data type (longtext). It’s very important to take into account these mappings when designing your migration strategy. Note that I used the default mapping. DMS also allows you to customize the mappings if you want.

We also need to look at source-related performance factors such as replication activity and the use of write-ahead logging (WAL) that can affect the migration. Using Amazon RDS, you can change database parameters using parameter groups. One adjustment we suggest is changing the wal_sender_timeout parameter to 0. To do this, create a parameter group for your RDS PostgreSQL database and adjust the parameter accordingly. Because this is a dynamic parameter, the adjustment will take effect immediately. You can read about all the nuances of using PostgreSQL DMS sources in the RDS documentation.

Using Continuous Data Replication
Additionally, AWS recently announced continuous data replication for DMS, which enables you to keep your database up to date after the initial migration. With this feature, you can now have high availability during the replication process by specifying Multi-AZ when you create your replication instances.

When Multi-AZ is selected, two replication hosts will be created, each of which has its own IP address. Remember to adjust the underlying source and target security groups (or their on-premises equivalents) to grant both instances the appropriate level of access.

The AWS Database Migration Service offers a powerful set of migration capabilities. It is important to consider how the service works behind the scenes as part of the development of an overall migration plan. With continuous data replication, you can now keep your databases up to date after the initial load. It’s easy to get started. Follow our Getting Started Guide to migrate your database using AWS Database Migration Service with just a few clicks.

Creating a Simple Autocompletion Service with Redis: Part One of Two

by Darin Briskman | on 12 SEP 2016 | in ElastiCache, Redis | Permalink | Comments

Darin Briskman (@briskmad) is a developer evangelist for Amazon Web Services.

So you’ve figured out how to use a cool framework to make a great-looking web form. You’ve made it super-easy for people to understand each field and how to fill it out. But there is still a lot of stuff to type. How can we make it even easier, especially for mobile devices where a lot of data entry is a challenge?

One easy thing to do is add an autocomplete function, which detects when characters are entered into a field and suggests which words or phrases are desired. All you need is a list of phrases, Amazon ElastiCache for Redis, and a little bit o’ code.

If you’re not a Redis expert, don’t worry: For the purposes of autocompletion, everything you’ll need to know about Redis is in this blog post. Because you’re using Amazon ElastiCache, all the rest of the work (installation, updates, data management, maintenance, and so forth) is automated and done for you—all you need to do is provide the data and fetch the data.

Using Redis Sorted Sets
What makes this fast and easy is the Redis Sorted Sets data type. This type is simply a nonrepeating collection of strings, with a score associated with each one. The strings are always stored in descending order of scores. When more than one string has the same score, the set is ordered lexicographically. Because the set is always kept in order, retrieving data from the beginning or the end or the middle is easy, using the Redis ZRANGE command.

If we put the names of a few countries into a Redis sorted set, we’ll get the following.
redis> zadd mylist 0 "United States" (integer) 1 redis> zadd mylist 0 "Sweden" (integer) 1 redis> zadd mylist 0 "Union of South Africa" (integer) 1 redis> zadd mylist 0 "United Kingdom" (integer) 1 redis> zrange mylist 0 -1 1. "Sweden" 2. "Union of South Africa" 3. "United Kingdom" 4. "United States"

The other Redis command we need is ZRANK, which can tell us where a string lies inside an ordered set.
redis> zrank mylist "United Kingdom" (integer) 2

(Indexes in Redis always start at zero, so 2 is the third item on the list.)

And now that we know that United Kingdom is item 2 in the sorted set, we can easily find what comes before and after it.
redis> zrange mylist 1 2 1) "Union of South Africa" 2) "United Kingdom" redis> zrange mylist 2 3 1) "United Kingdom" 2) "United States"

For autocompletion, we need a sorted set that includes the letters that someone would type. We also need a character to show that we’ve reached the end of the phrase. I’m going to use %.
redis> zadd mylist2 0 "United States%" (integer) 1 redis> zadd mylist2 0 "United State" (integer) 1 redis> zadd mylist2 0 "United Stat" (integer) 1 redis> zadd mylist2 0 "United Sta" (integer) 1 redis> zadd mylist2 0 "United St" (integer) 1 redis> zadd mylist2 0 "United S" (integer) 1 redis> zadd mylist2 0 "United " (integer) 1 redis> zadd mylist2 0 "United" (integer) 1 redis> zadd mylist2 0 "Unite" (integer) 1 redis> zadd mylist2 0 "Unit" (integer) 1 redis> zadd mylist2 0 "Uni" (integer) 1 redis> zadd mylist2 0 "Un" (integer) 1 redis> zadd mylist2 0 "U" (integer) 1 redis> zrange mylist2 0 -1 1. "U" 2. "Un" 3. "Uni" 4. "Unit" 5. "Unite" 6. "United 7. "United " 8. "United S" 9. "United St" 10. "United Sta" 11. "United Stat" 12. "United State" 13. "United States%"

Now we can use code to loop through the entries to find the complete entry (ending with %).

Putting the pieces together
Let’s put it all together. First, we’ll set up an ElastiCache Redis node. If there are no repeating prefixes, this algorithm will create (Number of Items) x (Average Length of Each Item + 1) x (Average Length of each item) ÷ 2 entries in our sorted set. Unless you have a very weird set of items, there will be repetition among the prefixes, so the actual amount of memory needed will be smaller. For this example, countries.txt has 195 items that are an average of 10.4 characters each, including spaces (and the file’s encoded in UTF-8, so it’s one byte per character). So we’ll need no more than 11,560 bytes to store all the possible prefixes for this list. For more information on how to size your Redis node, see Selecting Your Redis Node Size in the ElastiCache User Guide.

This isn’t a lot of data, so we can use the very smallest cache.t2.micro node, which can run using the AWS Free Tier.

Start by going to the Amazon ElastiCache for Redis page and choosing Get Started or by signing on to your AWS Management Console and choosing ElastiCache.

Console

Then create a Redis cluster. For simplicity, we’ll avoid replication (though for a production instance, replication is a good idea!). Uncheck Enable Replication, name the cluster redis-autocomplete, and choose cache.t2.micro for node type. You can leave all the other settings at their defaults, then launch the cluster.

Redis Param
It can be handy to have redis-cli (the Redis command line interface) where you can test and debug your redis commands. If you’re anything like me, writing code is easy… debugging it to make it actually work, not so easy (sigh).

For the CLI, create a new Amazon EC2 instance (t2.micro will be enough) and log in to it. If you haven’t done this before, see Getting Started with Amazon EC2 Linux Instances for a guide on creating your EC2 instance and the needed security groups.

In theory, all you need to install is redis-cli, but there are a lot of dependencies. Redis is pretty small, so it’s easiest to just install the whole package.
sudo yum --enablerepo=epel install redis
To actually use the CLI, you’ll need to know the name of your ElastiCache instance. On your AWS Management Console, choose ElastiCache from the Services menu, then choose Cache Clusters.

CacheCluster

Then choose your node.

node

Look for the DNS name of your cache instance under Endpoint.

endpoint

Now from your EC2 instance, you can use redis-cli.
$ redis-cli -h <your ElastiCache endpoint DNS> <Your ElastiCache endpoint DNS>.cache.amazonaws.com:6379>

Try using the INFO command at the redis> prompt to validate that all is working as it should.
When you’re done, use CTRL-D to end the interactive redis> prompt and return to the system shell.
In a future post, we’ll use AWS Lambda to make this completely serverless. For now, we’ll keep it simple, using Python 2.7 (the version included with most Linux distributions) on our EC2 node.

Testing our simple autocompletion service
Let’s do a test run. First, we need some code that reads our text file and parses it into prefixes. We will need to grab a file with a list of items for autocompletion—you can use your own or you can use the list of UN member countries I’ve posted online.
$ wget http://autocomplete.s3.amazonaws.com/Countries.txt $ cp Countries.txt autocomplete.txt

We also need to install the Redis SDK for Python.
$ sudo pip install redis

Now that we have all of the prerequisites, we can use some code to load and parse the file. Use your favorite text editor (I’m old school, so I use vi) to create a file called ac_setup.py.

import redis
r = redis.StrictRedis(host='<your ElastiCache endpoint DNS>', port=6379, db=0)
f = open(‘autocomplete.txt',"r")
  for line in f:
    n = line.strip()
    for l in range(1,len(n)):
      prefix = n[0:l]
      r.zadd('autocomplete',0,prefix)
    r.zadd('autocomplete',0,n+"%")
else:
  exit

This code should pull in countries.txt and parse it into an ordered set of every unique prefix, with the full names identified by ending with %. Let’s execute the code.
$ python ac_setup.py

To check that this code worked, we can go back into the redis> interface and see what we have.
$ redis-cli -h <your ElastiCache endpoint DNS> <Your ElastiCache endpoint DNS>.cache.amazonaws.com:6379> zrange autocomplete 0 -1 1) "A" 2) "Af" 3) "Afg" 4) "Afgh" 5) "Afgha" 6) "Afghan" 7) "Afghani" 8) "Afghanis" 9) "Afghanist" 10) "Afghanista" 11) "Afghanistan%" 12) "Al" 13) "Alb" 14) "Alba" 15) "Alban" ...

Now, we need the code to pull the potential autocompletion list from Redis. Again, use your favorite text editor (I’m still using vi!) to create a file named ac.py.

import sys
import redis
r = redis.StrictRedis(host=<your ElastiCache endpoint DNS>..cache.amazonaws.com', port=6379, db=0)
def complete(r,prefix,count):
  results = []
  grab = 42
  start = r.zrank('autocomplete',prefix)
  if not start:
    return []
  while (len(results) != count):
    range = r.zrange('autocomplete',start,start+grab-1)
    start += grab
    if not range or len(range) == 0:
      break
    for entry in range:
      minlen = min(len(entry),len(prefix))
      if entry[0:minlen] != prefix[0:minlen]:
        count = len(results)
        break
      if entry[-1] == "%" and len(results) != count:
        results.append(entry[0:-1])
  return results

def autoComplete():
  print complete(r,sys.argv[1],50)

if __name__ == "__main__":
  autoComplete()

This program needs one argument, which is the prefix we want to search. Try it out!
$ python ac.py U ['Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay', 'Uzbekistan']

$ python ac.py Uni ['United Arab Emirates', 'United Kingdom', 'United States']

Congratulations! We’ve now built a working recommendation engine. If you want to try this with a larger dataset, let’s use the list of all of the cities and settlements in the US (provided by the US Census Bureau).

Flush the data out of Redis.
$ redis-cli -h <your ElastiCache endpoint DNS> <Your ElastiCache endpoint DNS>.cache.amazonaws.com:6379>FLUSHDB OK

Then double-check to make sure it’s gone.
<Your ElastiCache endpoint DNS>.cache.amazonaws.com:6379> zcard autocomplete (integer) 0

Use CTRL-D to exit the Redis shell.
re<your ElastiCache Endpoint DNS>.cache.amazonaws.com:6379>^D $

Then load the cities data.
$ wget http://autocomplete.s3.amazonaws.com/Cities.txt $ cp Cities.txt autocomplete.txt $ python ac_setup.py

This test will take a bit longer to run, because there are 15,637 items on the list to be parsed, which will take a few minutes on our micro family instance. Once done, we can now search for cities.
$ python ac.py Chi ['Chicago Heights, Illinois', 'Chicago Park, California', 'Chicago Ridge, Illinois', 'Chicago, Illinois', 'Chichester, New Hampshire', 'Chickamauga, Georgia', 'Chickasha, Oklahoma', 'Chico, California', 'Chicopee, Massachusetts', 'Chiefland, Florida', 'Childersburg, Alabama', 'Childress, Texas', 'Chilhowee, Missouri', 'Chillicothe, Illinois', 'Chillicothe, Missouri', 'Chillicothe, Ohio', 'Chilmark, Massachusetts', 'Chiloquin, Oregon', 'Chilton, Texas', 'Chilton, Wisconsin', 'Chimacum, Washington', 'China Spring, Texas', 'Chincoteague Island, Virginia', 'Chinese Camp, California', 'Chinle, Arizona', 'Chino Hills, California', 'Chino Valley, Arizona', 'Chino, California', 'Chinook, Montana', 'Chipley, Florida', 'Chippewa Falls, Wisconsin', 'Chisago City, Minnesota', 'Chisholm, Minnesota', 'Chittenango, New York', 'Chittenden, Vermont']

$ python ac.py Chic ['Chicago Heights, Illinois', 'Chicago Park, California', 'Chicago Ridge, Illinois', 'Chicago, Illinois', 'Chichester, New Hampshire', 'Chickamauga, Georgia', 'Chickasha, Oklahoma', 'Chico, California', 'Chicopee, Massachusetts']

In part two of this post, we’ll explore using AWS Lambda and Amazon S3 to automate loading the data and running the code, plus look at ways to integrate this autocompletion service into a webpage.

I hope you found this fun and useful! If you’d like more on this Redis pattern, this blog post was inspired by Salvatore “antirez” Sanfilippo, Redis’ creator, who wrote about Redis autocompletion on his blog. Amazon ElastiCache for Redis is a quickly evolving service, and you can always get the latest updates at the Amazon ElastiCache for Redis marketing website.

← Older posts

Oct	NOV	Dec
	12
2015	2016	2017