AWS Big Data Blog

Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3

by Illya Yalovyy | on | | Comments

Have you ever needed to move a large amount of data between Amazon S3 and Hadoop Distributed File System (HDFS) but found that the data set was too large for a simple copy operation? EMR can help you with this. In addition to processing and analyzing petabytes of data, EMR can move large amounts of data.

In the Hadoop ecosystem, DistCp is often used to move data. DistCp provides a distributed copy capability built on top of a MapReduce framework. S3DistCp is an extension to DistCp that is optimized to work with S3 and that adds several useful features. In addition to moving data between HDFS and S3, S3DistCp is also a Swiss Army knife of file manipulations. In this post we’ll cover the following tips for using S3DistCp, starting with basic use cases and then moving to more advanced scenarios:

1. Copy or move files without transformation
2. Copy and change file compression on the fly
3. Copy files incrementally
4. Copy multiple folders in one job
5. Aggregate files based on a pattern
6. Upload files larger than 1 TB in size
7. Submit a S3DistCp step to an EMR cluster

1. Copy or move files without transformation

We’ve observed that customers often use S3DistCp to copy data from one storage location to another, whether S3 or HDFS. Syntax for this operation is simple and straightforward:

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/incoming/hourly_table

The source location may contain extra files that we don’t necessarily want to copy. Here, we can use filters based on regular expressions to do things such as copying files with the .log extension only.

(more…)

Amazon QuickSight Adds Support for Amazon Redshift Spectrum

by Luis Wang | on | | Comments

In April, we announced Amazon Redshift Spectrum in the AWS Blog. Redshift Spectrum is a new feature of Amazon Redshift that allows you to run complex SQL queries against exabytes of data in Amazon without having to load and transform any data.

We’re happy to announce that Amazon QuickSight now supports Redshift Spectrum. Starting today, QuickSight customers can leverage Redshift Spectrum to visualize and analyze vast amounts of unstructured data in their Amazon S3 data lake. With QuickSight and Redshift Spectrum, customers can now visualize combined data sets that include frequently accessed data stored in Amazon Redshift and bulk data sets stored cost effectively in S3 using the familiar SQL syntax of Amazon Redshift.

With Redshift Spectrum, you can start querying your data in Amazon S3 immediately, with no loading or transformation required. You just need to register your Amazon Athena data catalog or Hive Metastore as an external schema. You can then use QuickSight to select the external schema and the Redshift Spectrum tables—just like any other Amazon Redshift tables in your cluster―and start visualizing your S3 data in seconds. You don’t have to worry about scaling your cluster. Redshift Spectrum lets you separate storage and compute, allowing you to scale each independently. You only pay for the queries that you run.

Redshift Spectrum support is now available in these QuickSight regions – US East (N. Virginia and Ohio), and US West (Oregon).

To learn more about these capabilities and start using them in your dashboards, check out the QuickSight User Guide.

If you have questions and suggestions, post them on the QuickSight Discussion Forum.

Not a QuickSight user? Get started for FREE on the QuickSight page.

 

New AWS Certification Specialty Exam for Big Data

by Sara Snedeker | on | | Comments

AWS Certifications validate technical knowledge with an industry-recognized credential. Today, the AWS Certification team released the AWS Certified Big Data – Specialty exam. This new exam validates technical skills and experience in designing and implementing AWS services to derive value from data. The exam requires a current Associate AWS Certification and is intended for individuals who perform complex big data analyses.

Individuals who are interested in sitting for this exam should know how to do the following:

  • Implement core AWS big data services according to basic architectural best practices
  • Design and maintain big data
  • Leverage tools to automate data analysis

To prepare for the exam, we recommend the Big Data on AWS course, plus AWS whitepapers and documentation that are focused on big data.

This credential can help you stand out from the crowd, get recognized, and provide more evidence of your unique technical skills.

The AWS Certification team also released an AWS Certified Advanced Networking – Specialty exam and new AWS Certification Benefits. You can read more about these new releases on the AWS Blog.

Have more questions about AWS Certification? See our AWS Certification FAQ.

Build a Serverless Architecture to Analyze Amazon CloudFront Access Logs Using AWS Lambda, Amazon Athena, and Amazon Kinesis Analytics

by Rajeev Srinivasan and Sai Sriparasa | on | | Comments

Nowadays, it’s common for a web server to be fronted by a global content delivery service, like Amazon CloudFront. This type of front end accelerates delivery of websites, APIs, media content, and other web assets to provide a better experience to users across the globe.

The insights gained by analysis of Amazon CloudFront access logs helps improve website availability through bot detection and mitigation, optimizing web content based on the devices and browser used to view your webpages, reducing perceived latency by caching of popular object closer to its viewer, and so on. This results in a significant improvement in the overall perceived experience for the user.

This blog post provides a way to build a serverless architecture to generate some of these insights. To do so, we analyze Amazon CloudFront access logs both at rest and in transit through the stream. This serverless architecture uses Amazon Athena to analyze large volumes of CloudFront access logs (on the scale of terabytes per day), and Amazon Kinesis Analytics for streaming analysis.

The analytic queries in this blog post focus on three common use cases:

  1. Detection of common bots using the user agent string
  2. Calculation of current bandwidth usage per Amazon CloudFront distribution per edge location
  3. Determination of the current top 50 viewers

However, you can easily extend the architecture described to power dashboards for monitoring, reporting, and trigger alarms based on deeper insights gained by processing and analyzing the logs. Some examples are dashboards for cache performance, usage and viewer patterns, and so on.

Following we show a diagram of this architecture.

(more…)

Amazon QuickSight Now Supports Federated Single Sign-On Using SAML 2.0

by Jose Kunnackal | on | | Comments

Since launch, Amazon QuickSight has enabled business users to quickly and easily analyze data from a wide variety of data sources with superfast visualization capabilities enabled by SPICE (Superfast, Parallel, In-memory Calculation Engine). When setting up Amazon QuickSight access for business users, administrators have a choice of authentication mechanisms. These include Amazon QuickSight–specific credentials, AWS credentials, or in the case of Amazon QuickSight Enterprise Edition, existing Microsoft Active Directory credentials. Although each of these mechanisms provides a reliable, secure authentication process, they all require end users to input their credentials every time users log in to Amazon QuickSight. In addition, the invitation model for user onboarding currently in place today requires administrators to add users to Amazon QuickSight accounts either via email invitations or via AD-group membership, which can contribute to delays in user provisioning.

Today, we are happy to announce two new features that will make user authentication and provisioning simpler – Federated Single-Sign-On (SSO) and just-in-time (JIT) user creation.

Federated Single Sign-On

Federated SSO authentication to web applications (including the AWS Management Console) and Software-as-a-Service products has become increasingly popular, because Federated SSO lets organizations consolidate end-user authentication to external applications.

Traditionally, SSO involves the use of a centralized identity store (such as Active Directory or LDAP) to authenticate the user against applications within a corporate network. The growing popularity of SaaS and web applications created the need to authenticate users outside corporate networks. Federated SSO makes this scenario possible. It provides a mechanism for external applications to direct authentication requests to the centralized identity store and receive an authentication token back with the response and validity. SAML is the most common protocol used as a basis for Federated SSO capabilities today.

With Federated SSO in place, business users sign in to their Identity Provider portals with existing credentials and access QuickSight with a single click, without having to enter any QuickSight-specific passwords or account names. This makes it simple for users to access Amazon QuickSight for data analysis needs.

(more…)

Build a Visualization and Monitoring Dashboard for IoT Data with Amazon Kinesis Analytics and Amazon QuickSight

by Karan Desai | on | | Comments

Customers across the world are increasingly building innovative Internet of Things (IoT) workloads on AWS. With AWS, they can handle the constant stream of data coming from millions of new, internet-connected devices. This data can be a valuable source of information if it can be processed, analyzed, and visualized quickly in a scalable, cost-efficient manner. Engineers and developers can monitor performance and troubleshoot issues while sales and marketing can track usage patterns and statistics to base business decisions.

In this post, I demonstrate a sample solution to build a quick and easy monitoring and visualization dashboard for your IoT data using AWS serverless and managed services. There’s no need for purchasing any additional software or hardware. If you are already using AWS IoT, you can build this dashboard to tap into your existing device data. If you are new to AWS IoT, you can be up and running in minutes using sample data. Later, you can customize it to your needs, as your business grows to millions of devices and messages.

Architecture

The following is a high-level architecture diagram showing the serverless setup to configure.

 

AWS service overview

AWS IoT is a managed cloud platform that lets connected devices interact easily and securely with cloud applications and other devices. AWS IoT can process and route billions of messages to AWS endpoints and to other devices reliably and securely.

Amazon Kinesis Firehose is the easiest way to capture, transform, and load streaming data continuously into AWS from thousands of data sources, such as IoT devices. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration.

Amazon Kinesis Analytics allows you to process streaming data coming from IoT devices in real time with standard SQL, without having to learn new programming languages or processing frameworks, providing actionable insights promptly.

(more…)

Build a Healthcare Data Warehouse Using Amazon EMR, Amazon Redshift, AWS Lambda, and OMOP

by Ryan Hood | on | | Comments

In the healthcare field, data comes in all shapes and sizes. Despite efforts to standardize terminology, some concepts (e.g., blood glucose) are still often depicted in different ways. This post demonstrates how to convert an openly available dataset called MIMIC-III, which consists of de-identified medical data for about 40,000 patients, into an open source data model known as the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). It describes the architecture and steps for analyzing data across various disconnected sources of health datasets so you can start applying Big Data methods to health research.

Before designing and deploying a healthcare application on AWS, make sure that you read through the AWS HIPAA Compliance whitepaper. This covers the information necessary for processing and storing patient health information (PHI).

Note: If you arrived at this page looking for more info on the movie Mimic 3: Sentinel, you might not enjoy this post.

OMOP overview

The OMOP CDM helps standardize healthcare data and makes it easier to analyze outcomes at a large scale. The CDM is gaining a lot of traction in the health research community, which is deeply involved in developing and adopting a common data model. Community resources are available for converting datasets, and there are software tools to help unlock your data after it’s in the OMOP format. The great advantage of converting data sources into a standard data model like OMOP is that it allows for streamlined, comprehensive analytics and helps remove the variability associated with analyzing health records from different sources.

OMOP ETL with Apache Spark

Observational Health Data Sciences and Informatics (OHDSI) provides the OMOP CDM in a variety of formats, including Apache Impala, Oracle, PostgreSQL, and SQL Server. (See the OHDSI Common Data Model repo in GitHub.) In this scenario, the data is moved to AWS to take advantage of the unbounded scale of Amazon EMR and serverless technologies, and the variety of AWS services that can help make sense of the data in a cost-effective way—including Amazon Machine Learning, Amazon QuickSight, and Amazon Redshift.

This example demonstrates an architecture that can be used to run SQL-based extract, transform, load (ETL) jobs to map any data source to the OMOP CDM. It uses MIMIC ETL code provided by Md. Shamsuzzoha Bayzid. The code was modified to run in Amazon Redshift.

Getting access to the MIMIC-III data

Before you can retrieve the MIMIC-III data, you must request access on the PhysioNet website, which is hosted on Amazon S3 as part of the Amazon Web Services (AWS) Public Dataset Program. However, you don’t need access to the MIMIC-III data to follow along with this post.

Solution architecture and loading process

The following diagram shows the architecture that is used to convert the MIMIC-III dataset to the OMOP CDM.

(more…)

Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator

by Allan MacInnis | on | | Comments

When building a streaming data solution, most customers want to test it with data that is similar to their production data. Creating this data and streaming it to your solution can often be the most tedious task in testing the solution.

Amazon Kinesis Streams and Amazon Kinesis Firehose enable you to continuously capture and store terabytes of data per hour from hundreds of thousands of sources. Amazon Kinesis Analytics gives you the ability to use standard SQL to analyze and aggregate this data in real-time. It’s easy to create an Amazon Kinesis stream or Firehose delivery stream with just a few clicks in the AWS Management Console (or a few commands using the AWS CLI or Amazon Kinesis API). However, to generate a continuous stream of test data, you must write a custom process or script that runs continuously, using the AWS SDK or CLI to send test records to Amazon Kinesis. Although this task is necessary to adequately test your solution, it means more complexity and longer development and testing times.

Wouldn’t it be great if there were a user-friendly tool to generate test data and send it to Amazon Kinesis? Well, now there is—the Amazon Kinesis Data Generator (KDG).

KDG overview

The KDG simplifies the task of generating data and sending it to Amazon Kinesis. The tool provides a user-friendly UI that runs directly in your browser. With the KDG, you can do the following:

  • Create templates that represent records for your specific use cases
  • Populate the templates with fixed data or random data
  • Save the templates for future use
  • Continuously send thousands of records per second to your Amazon Kinesis stream or Firehose delivery stream

The KDG is open source, and you can find the source code on the Amazon Kinesis Data Generator repo in GitHub. Because the tool is a collection of static HTML and JavaScript files that run directly in your browser, you can start using it immediately without downloading or cloning the project. It is enabled as a static site in GitHub, and we created a short URL to access it.

To get started immediately, check it out at http://amzn.to/datagen.

Using the KDG

Getting started with the KDG requires only three short steps:

  1. Create an Amazon Cognito user in your AWS account (first-time only).
  2. Use this user’s credentials to log in to the KDG.
  3. Create a record template for your data.

When you’ve completed these steps, you can then send data to Streams or Firehose.

Create an Amazon Cognito user

The KDG is a great example of a mobile application that uses Amazon Cognito for a user repository and user authentication, and the AWS JavaScript SDK to communicate with AWS services directly from your browser. For information about how to build your own JavaScript application that uses Amazon Cognito, see Use Amazon Cognito in your website for simple AWS authentication on the AWS Mobile Blog.

Before you can start sending data to your Amazon Kinesis stream, you must create an Amazon Cognito user in your account who can write to Streams and Firehose. When you create the user, you create a username and password for that user. You use those credentials to sign in to the KDG. To simplify creating the Amazon Cognito user in your account, we created a Lambda function and a CloudFormation template. For more information about creating the Amazon Cognito user in your AWS account, see Configure Your AWS Account.

Note:  It’s important that you use the URL provided by the output of the CloudFormation stack the first time that you access the KDG. This URL contains parameters needed by the KDG. The KDG stores the values of these parameters locally, so you can then access the tool using the short URL, http://amzn.to/datagen.

Log in to the KDG

After you create an Amazon Cognito user in your account, the next step is to log in to the KDG. To do this, provide the username and password that you created earlier.

(more…)

AWS Big Data Blog Month in Review: April 2017

by Derek Young | on | | Comments

Another month of big data solutions on the Big Data Blog. Please take a look at our summaries below and learn, comment, and share. Thank you for reading!

NEW POSTS

Amazon QuickSight Spring Announcement: KPI Charts, Export to CSV, AD Connector, and More! 
In this blog post, we share a number of new features and enhancements in Amazon Quicksight. You can now create key performance indicator (KPI) charts, define custom ranges when importing Microsoft Excel spreadsheets, export data to comma separated value (CSV) format, and create aggregate filters for SPICE data sets. In the Enterprise Edition, we added an additional option to connect to your on-premises Active Directory using AD Connector. 

Securely Analyze Data from Another AWS Account with EMRFS
Sometimes, data to be analyzed is spread across buckets owned by different accounts. In order to ensure data security, appropriate credentials management needs to be in place. This is especially true for large enterprises storing data in different Amazon S3 buckets for different departments. This post shows how you can use a custom credentials provider to access S3 objects that cannot be accessed by the default credentials provider of EMRFS.

Querying OpenStreetMap with Amazon Athena
This post explains how anyone can use Amazon Athena to quickly query publicly available OSM data stored in Amazon S3 (updated weekly) as an AWS Public Dataset. Imagine that you work for an NGO interested in improving knowledge of and access to health centers in Africa. You might want to know what’s already been mapped, to facilitate the production of maps of surrounding villages, and to determine where infrastructure investments are likely to be most effective.

Build a Real-time Stream Processing Pipeline with Apache Flink on AWS
This post outlines a reference architecture for a consistent, scalable, and reliable stream processing pipeline that is based on Apache Flink using Amazon EMR, Amazon Kinesis, and Amazon Elasticsearch Service. An AWSLabs GitHub repository provides the artifacts that are required to explore the reference architecture in action. Resources include a producer application that ingests sample data into an Amazon Kinesis stream and a Flink program that analyses the data in real time and sends the result to Amazon ES for visualization.

(more…)

Tips for Migrating to Apache HBase on Amazon S3 from HDFS

by Bruno Faria | on | | Comments

Starting with Amazon EMR 5.2.0, you have the option to run Apache HBase on Amazon S3. Running HBase on S3 gives you several added benefits, including lower costs, data durability, and easier scalability.

HBase provides several options that you can use to migrate and back up HBase tables. The steps to migrate to HBase on S3 are similar to the steps for HBase on the Apache Hadoop Distributed File System (HDFS). However, the migration can be easier if you are aware of some minor differences and a few “gotchas.”

In this post, I describe how to use some of the common HBase migration options to get started with HBase on S3.

HBase migration options

Selecting the right migration method and tools is an important step in ensuring a successful HBase table migration. However, choosing the right ones is not always an easy task.

The following HBase helps you migrate to HBase on S3:

  • Snapshots
  • Export and Import
  • CopyTable

The following diagram summarizes the steps for each option.

(more…)