How To Scale Up
In this tutorial, we concentrate on using Essentia within the AWS cloud service. AWS allows Essentia to handle the creation of and connection to worker nodes automatically, making analyses fully scalable and adaptable. This tutorial requires the AWS cloud-based version of Essentia.
To connect to existing computers as your worker nodes and manually construct an Essentia cluster, go through Setting up a Customized Cluster.
Note: The Local and Docker installations of Essentia are currently limited to one computer and do not currently make use of Essentia’s scalability.
Running on AWS
The Amazon cloud is a pay as you need infrastucture, which offers general and specific computing resources, as well as reliable data storage. Essentia primarily uses three key AWS components:
- Elastic Cloud Computing (EC2). These are ‘virtual machines’ running on AWS servers that users can provision for computing. Once launched, users can log into the machines via the command line (typically ssh) or via the Essentia Data Lake Manager GUI and begin their tasks. The operating system Essentia uses is a form of Linux that AWS maintains.
- Simple, Secure Storage (S3). Essentially this is a place to store your files. The ‘simple’ part of S3 is that for most users, you can think of this as basically a hard drive that is on the cloud. Data is secure because behind the scenes, any files uploaded are copied to multiple drives on site in order to prevent data loss due to any failure. It is this redundancy which also allows for scalability in reading data. For instance, on your desktop or laptop, if two files are trying to be read from disk at the same time, the drive has to go back and forth to where the data is stored. This slows down the read. But with multiple disks, this competition can be avoided.
- Redshift. This is one of a few types of databases that AWS offers. It is common in many applications, including data warehousing. Data stored in Redshift can be efficiently queried by standard SQL commands. Essentia can integrate with Redshift to clean massive amounts of raw, dirty data and insert it directly into SQL tables.
Getting started with AWS is typically free, and interested users can get more information by reading our AWS Account Creation document.
Scanning and categorizing your data does not require anything other than a single master node. But the rest of Essentia benefits greatly when worker nodes are added. This tutorial will walk you through how to launch worker nodes in the AWS Cloud to scale up your processing. If you are using the local install or simply wish to use just your master node for the the remainder of the tutorials, that is fine. Worker nodes are not required for any of the training tutorials here.
Launching an Essentia Cluster
If you setup IAM Roles in Authentication, then all you need to do in order to spin up workers and build your Essentia cluster is run the command:
ess cluster create [--number=NUMBER] [--type=TYPE]
If you have NOT setup IAM Roles yet, then you need to run the command:
ess cluster create [--number=NUMBER] [--type=TYPE] --credentials=~/your_credential_file.csv
Alternatively, the credentials flag can be replaced with aws_access_key and aws_secret_access_key to directly enter credentials:
ess cluster create [--number=NUMBER] [--type=TYPE] --aws_access_key=YOUR_ACCESS_KEY --aws_secret_access_key=YOUR_SECRET_ACCESS_KEY
However, we recommend the use of credential files if possible. To create a credential file, simply save your access and secret access keys in the following format to a csv file with a name of your choice:
User Name,Access Key Id,Secret Access Key your_user_name,your_access_key,your_secret_access_key
Parallelizing your Operations
Once the ess cluster create … command has been run, Essentia will launch NUMBER of the EC2 virtual machines of type TYPE. The data and operations will be split up and parallelized across all of these ‘worker’ virtual machines. In particular:
ess stream category start end command will take the files in ‘category’ between the dates ‘start’ and ‘end’, and split these files up across all of the worker virtual machines. The virtual machines will run ‘command’ on each file they receive.
ess exec command will execute ‘command’ on each of the worker virtual machines.
You can use ess stream with our Data Processing command aq_pp to import the data into our In-memory Database to distribute the data across the memory of all of these worker virtual machines. It is then easy to analyze or output the data using ess exec with our Data Processing command aq_udb.
Stopping or Terminating an Essentia Cluster
EC2 charges for each hour each virtual machine is used. Thus it is good practice to stop or terminate your virtural machines.
When you are done using your worker virtual machines, you can stop them by running:
ess cluster stop
If you need to use those machines again you can start them by running:
ess cluster start
However, if you no longer need your worker virtual machines and will not need to access those exact machines in the future, you should terminate them. To terminate an Essentia cluster run:
ess cluster terminate
Note: Once terminated, you will no longer have ANY access to the worker virtual machines. You will have to launch a new cluster to parallelize your operations.
Managing Your Essentia Clusters
Sometimes it’s advantageous to reuse clusters or resize your existing clusters to suit your current analysis. You can learn how to do this by going through the tutorial: Managing Essentia Clusters.