This section demonstrates how to use Essentia to load data from various sources into Redshift quickly and efficiently.
AWS provides a scalable SQL service called ‘redshift’. It is commonly used in data warehousing, and can scale to store PB of data. But going from raw data into a properly formatted table suitable for Redshift (or any other database for that matter) is often problematic.
We added a module to Essentia to address issues that include:
In order to link Essentia and Redshift, the following is needed:
Here is an example of a policy that allows access to EC2 Instances as well as Redshift:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:*"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"redshift:*"
],
"Resource": [
"*"
]
}
]
}
Note
To learn how to create an IAM role, follow the instructions in IAM Roles.
Transferring data is very straightforward. First register the Redshift cluster with Essentia, then optionally generate a table using Essentia,
and then provide the ETL operation which is in a format very similar to the ‘stream’ command as described in the
ETL tutorial
to load the data into the table:
$ ess redshift register redshift_cluster_name redshift_database_name username password
$ ess redshift gentable table_name category_name --key "column_name = distkey"
$ ess redshift stream category_name start_date end_date "command" table_name --options TRUNCATECOLUMNS
Here, ‘command’ is typically aq_pp
, but it can also be any other program that accepts text data from the stdin
and outputs the results to stdout.
Once you have data loaded into Redshift you can query that data with Essentia using sql statements. You simply run:
$ ess redshift sql 'SQL_COMMAND'
Users are encouraged to go through these examples in order.