Getting Started

This guide will help walk you through the basic setup and use of the Essentia Data Lake Manager.

To learn more about how to create a category, see Category Rules.

Repository setup and management

Link to AWS S3

  1. Click on Connect in the top menu and then the AWS S3 tab.
  2. Click on the +Add icon to open the input form.
  3. Enter your AWS S3 credentials (bucket name, access key, secret access key) and a label if you prefer to call the bucket by another name.

Note: If you are running the AWS Marketplace version of Essentia 3.1.2, you do not need to enter your AWS credentials. Instead, setup an IAM role as described in IAM Roles.

  1. Click on the Add button to add your S3 repository.
  2. Your newly added repository will be displayed in the AWS S3 table.
../../_images/connect_aws_add.png

Link to Azure Blob

  1. Click on Connect in the top menu and then the Azure Blob tab.
  2. Click on the +Add icon to open the input form.
  3. Enter your Azure Blob credentials (container name, username, password) and a label if you prefer to call the container by another name.
  4. Click on the Add button to add your Blob repository.
  5. Your newly added repository will be displayed in the Azure Blob table.
../../_images/connect_azure_add.png

Delete Repository

  1. Click on Connect in the top menu.
  2. Choose the appropriate tab (AWS S3 or Azure Blob).
  3. Click the icon on the right of the table for the repository you want to remove.
  4. Select the delete (trash) icon.
  5. Confirm to delete your setting.
../../_images/connect_delete.png

Datastore category setup and management

Create category

  1. Click on Categorize in the top menu and select a Repository from the drop down.
  2. Click on the +Add icon to open the input form.
  3. Define your Category by entering:
  • Category name - any arbitrary name (no spaces).
  • Pattern - globular matching pattern(s) to describe what types of files to include in your category.
  1. Optionally define any number of the following options to speed up data scanning or make data management easier:
  • Comment - any arbitrary comment.
  • Delimiter - the type of delimiter (comma, space, tab, etc) used in your data.
  • Exclude - globular matching pattern to describe what files to not include in your category. Note: this further restricts the files included by your Pattern.
  • Date Format - matching date extraction pattern found in filename structure. Specify a regular expression pattern to extract the date from your file path/name, see Date Regex.
../../_images/categorize_options.png
  1. Click on the Save button to create your category. This may take a few minutes while Essentia scans your data.
  2. After scan is complete, the derived column specifications will be displayed along with metadata about your files. Also, you can now Define Additional Category Options (see section 2 for more detail) or choose to Directly Edit Column Specification (see section 3 for more detail).
  3. Your newly added category will be displayed in the category table for the selected repository. From here you can edit, copy, scan, or delete a category, view a sample of the data or see the list of files that make up your category.

Define Additional Category Options

  1. Follow steps 1-5 of creating a category.
  2. Click on the preprocess drop down to Check or save a command to preprocess your data:
  • Preprocess - command to modify your raw data before it is scanned by Essentia.
  1. Or click on the options drop down arrow to display category options and define either of the following options:
  • Archive - matching pattern to describe filenames within a compressed file.
  • Use cached file list - reference the local file list for the current category instead of accessing the repository.

Directly Edit Column Specification

  1. Follow steps 1-5 of creating a category.
  2. Click on the Direct Edit checkbox to allow the current column spec to be edited.
  3. From here, you can change column headers (no spaces) and assign data types in case the scan was not correct.
  4. Click on the Save button to save your changes.

Exploring Your Data Repository

  1. Click Explore.
  2. Click the + next to a directory to navigate through the directories on your Repository.
  3. Your current path is displayed at the top, under your repository name. This is useful when defining a pattern for the files you want to group into a category.
  4. You can click the icon next to any filename to Download or Delete that file from your Repository.
../../_images/categorize_explore_dwnld.png

You can click Upload to choose files to upload to the current path on your Repository.

You can click Size to calculate the total number of files and bytes in the current path on your Repository.

You can click Refresh to get the latest list of files on your Repository.

Note: If the Explorer tab does not open when you click Explore, you may need to enable pop-ups from the Essentia UI.

Query setup and management

Create a Query

  1. Click on Query in the top menu and and select a Repository from the drop down
  2. Enter your SQL like query in the Input your query here area. You can optionally enter a label for this query so you can reference it later.
  3. Click on the Run button to view your query results on your screen and then optionally download your query results into a file on your computer by clicking Download and entering a filename.
  4. If you do not need the results of your query anymore, you can click Clear to delete those results.
  5. From this point you can access a saved query or run a new query. Running another query will clear the previous query’s results.
../../_images/query_run.png

Note: If you need to view available categories, click on the Categories drop down arrow to view a list of available categories.

../../_images/query_categories.png

Query Format

select [column_name] | [*] from [category_name]:[start_date | *]:[end_date | *] where ... order by ... limit ...

select count(distinct [column_name] | [*]) from [category_name]:[start_date | *]:[end_date | *]  where ...

select [column_name], count(*) from [category_name]:[start_date | *]:[end_date | *]  where ... group by [column_name]

Rules

The first query format above is a "select" query.
The second and third query formats above are "count" queries.

1. Group By is NOT supported for SELECT queries.
2. Order By is NOT supported for COUNT queries.
3. Limit is NOT supported for COUNT queries.
4. Group By can only be used when there is no DISTINCT in COUNT queries.

Example

select * from myfavoritedata:*:* where payment >= 50
select * from purchase:2014-09-01:2014-09-15 where articleID>=46 limit 10

To see more examples of the types of queries we allow and work with some sample queries of our public data, please go through our Query Examples

Working with Saved Queries

  1. Select your Saved Query from the dropdown. The query should appear in the “Input your query here” area. If you labeled your query, the label should appear next to the saved query dropdown.
  2. Now you can click the Run button to view your query results on your screen and then optionally download your query results into a file on your computer by clicking Download and entering a filename.

You can search your saved queries by entering any parts of your desired queries into the Search box.

Script setup and management

Run a Script

  1. Click on Analyze in the top menu.
  2. Select a Github Repository from the drop down menu or use the Default (DirectScipt - auriq).
  3. Enter your Essentia or unix shell commands in the Input your script here area. You can optionally select one of the files from your Github Repository to edit or run. To do this, click the file icon to the left of the filename.
  4. Click on the Run button to view your script’s results on your screen.
../../_images/analyze_script.png

Note: You can also Stop running your script or, when it has finished, Download the result onto your local machine or Clear the results so they are no longer stored. You must terminate any worker cluster before running Clear or you will have to terminate those nodes manually (without Essentia).

../../_images/analyze_script_run.png

Note: You can also view the status of your master computer and any other machines you are utilizing by clicking on Cluster Status. This will show you the connection information and resource usage of each connected machine.

Connect to a Github Repository

  1. Click on Analyze in the top menu.
  2. Click the Add button.
  3. Enter the Owner of your Github Repository, the name of your Repository, and your Personal Access Token. If you do not have a Personal Access Token, follow the instructions found here.
  4. Click on the Save button to finish adding your Github Repository.
  5. From this point you can view, edit, and run any of the scripts stored in the Github Repository.

Note: To view or switch between available Github Repositories or Branches, click on the Github Repository or Branch drop down menus.

Questions

Our tutorials are intended to guide you through the usage of the included tools, but you should feel free to contact us at essentia@auriq.com with any other questions.