The category name can be any name you want to use to group your files. However, it cannot contain any spaces and there cannot be repeat names.
This is a UNIX-style globular matching pattern to describe what types of files to include in your category. You can use a series of literal characters combined with wildcards to tell Essentia that any filenames that match this pattern belong in this category.
Example Patterns:
diy_woodworking/*purchase*
*store*sales*
*.dat
*.csv.gz
*pagecounts*2009*.gz
You can include multiple patterns separated by a space to group files that match multiple patterns into the same category.
Example of Multiple Patterns:
'diy_woodworking/*purchase* diy_woodworking/*browse*'
# This will include ALL of the tutorial browse and purchase files.
'filetypes/b.zip filetypes/"b test.zip"'
# You can include a literal space in your pattern by quoting that pattern.
Note: For a more detailed description of globular matching patterns, see Glob (programming)
Note: Exclude further restricts the files included by your Pattern.
This is a UNIX-style globular matching pattern to describe what types of files to exclude from your category. You can use a series of literal characters combined with wildcards to tell Essentia that any filenames that match this pattern do NOT belong in this category. This is a sub-pattern that further restricts the files included by your Pattern.
Note: For a more detailed description of globular matching patterns, see Glob (programming)
This field can contain any character and is just used to help you make notes about your categories or the files they contain.
This Date-regex option (--dateregex
) can be used to match a regular expression
pattern to the file paths and names in a category in order to extract the corresponding date.
This enables Essentia’s highly versatile date and number extraction by allowing
regular expression patterns that can extract dates from incredibly
complex or unique file paths and names.
The key options you can provide it to symbolize each date segment are:
You can also provide the following option to extract just a number from the file path/name as a unique identifier:
Note
If you don’t specify Date Regex or Date Format (see below), then Essentia uses auto to extract the date from your filename. If no date can be extracted, Essentia reverts to the none behavior and defaults the date to 1970-01-01.
Example Date Format Patterns:
auto
none
.*file_2014-06-09_out.zip:.*
.*[:%Y:]-[:%m:]-[:%d:].*
.*account12345678_20140609.csv:.*
.*[:%Y:][:%m:][:%d:].csv
Note: Here '[:%Y:][:%m:][:%d:]' alone won't work since there is another 8 digit number, therefore we add the '.csv'.
.*account12345678_20140609.csv:.*
_[:%Y:][:%m:][:%d:]
Note: This is another way to extract the date from this filename. Here we added '_' before the date
to help identify the date in the filename.
.*account12345678_20140609.csv:.*
[:%f:{8,}]_
Note: This extracts the number from this filename. Here we added '_' after the number
to help identify the number in the filename.
.*account12345678_20140609.csv:.*
account[:%f:{8,}]
Note: We could also have specified 'account' before the number to identify the number in the filename.
.*account12345678_20140609.csv:.*
account[:%f:{4,4}]
Note: By specifying the min and max digits to 4, we use only the first four digits ('1234') as the identifying number in the filename.
Note: For a more detailed description of regex matching patterns, see Regular Expression
Warning: Deprecated after Essentia version 3.1.0.7. Will be removed in the near future. Use Date Regex instead.
This is a UNIX-style globular matching pattern to describe how a date or number appears in each filename. You can use a series of literal characters combined with wildcards to tell Essentia how to extract the date or number from the filenames so it can obtain the date range or number range over which your category takes place. This allows you to later take a subset of those files by specifying a custom date range or number range that you want to pull data from. The key options you can provide it to symbolize each date/number segment are:
Example Date Format Patterns:
auto
none
*file_2014-06-09_out.zip:*
*Y-M-D*
*account12345678_20140609.csv:*
*YMD.csv
Note: Here 'YMD' alone won't work since there is another 8 digit number, therefore we add the '.csv'.
In fact, just '.' would have been sufficient.
*account12345678_20140609.csv:*
_YMD
Note: This is another way to extract the date from this filename. Here we added '_' before the date
to help identify the date in the filename.
*account12345678_20140609.csv:*
#_
Note: This extracts the number from this filename. Here we added '_' after the number
to help identify the number in the filename.
*account12345678_20140609.csv:*
account#
Note: We could also have specified 'account' before the number to identify the number in the filename.
Note: For a more detailed description of globular matching patterns, see Glob (programming)
Note: It is recommended to use regular expression patterns to extract the date from your filename, see Date Regex.
The type of delimiter that your data uses. You can choose any single delimiter for your files.
Selecting noprobe will cause Essentia not to attempt to determine your files’ format. This allows you to select broad categories of files that may not be related or in a single format, or files that are in complicated formats either with many delimiters or no delimiter whatsoever. noprobe is a very useful option for simply exploring your datastore and discovering what files it contains.
This is a UNIX-style globular matching pattern to describe what types of files are included within your compressed or uncompressed archive files. You can use a series of literal characters combined with wildcards to tell Essentia which filenames within your compressed or uncompressed file belong in this category. This allows you to extract certain files from a compressed or uncompressed file archive while ignoring others.
Example Archive Pattern:
Note: 'My_filename.zip' archive contains 'file_1_Site_12345' and 'file_1_Placement_12345'.
Match each of these files with the following Archive patterns:
*Site*
*Placement*
Note: For a more detailed description of globular matching patterns, see Glob (programming)
This option (--usecache
) references a cached list of the files whenever the category is used.
This bypasses Essentia’s update process (where it looks at changes to the files on the repository that match the category’s pattern(s)) and is meant to save time on large repositories.
Therefore, this option can cause your category to be out-of-date if files in this category are changed or new files matching the file pattern are uploaded on your repository.
This is a very useful feature for large repositories that have categories containing
files spread across different directories or many undesired files in the same directory as the categorized files.
Note
Essentia’s cached file list will still be updated anytime a category without the
--usecache
option is used. This can cause your categories utilizing --usecache
to have a different number of matching files than when you created the category, depending on what changes have been made to the repository.
To change this option for a single category you would run ess category change name usecache [--usecache|--nocache]
.
This option allows you to apply a command to the data in your category before Essentia tries to automatically detect its structure. This can be very helpful when your data contains many different delimiters or data that isn’t simply delimited. You can view a sample of your raw data as well as enter a preprocessing command and check what the data will look like after that command is applied. Examples of where this is useful:
Data With Multiple Delimiters:
Data:
54.248.98.72 - - [05/Oct/2014:03:24:27 -0700] "GET / HTTP/1.0" 301 - "-" "Mozilla/5.0 (compatible; monitis - premium monitoring service; http://www.monitis.com)"
Preprocess Command:
aq_pp -f,eok,qui,div - -d ip:ip sep:" " s:rlog sep:" " s:rusr sep:" [" s:time_s sep:"] \"" s,clf:req_line1 sep:" " s,clf:req_line2 sep:" " s,clf:req_line3 sep:"\" " i:res_status sep:" " i:res_size sep:" \"" s,clf:referrer sep:"\" \"" s,clf:user_agent sep:"\"" -eval i:time "DateToTime(time_s, \"d.b.Y.H.M.S.z\")" -c ip rlog rusr time req_line1 req_line2 req_line3 res_status res_size referrer user_agent
Sample Output:
54.248.98.72,"-","-",1412504667,"GET","/","HTTP/1.0",301,0,"-","Mozilla/5.0 (compatible; monitis - premium monitoring service; http://www.monitis.com)"
Fixed Width Data:
Data:
STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT
030050 99999 19320101 43.2 6 40.3 4 993.9 6 9999.9 0 4.3 6 10.2 6 18.1 999.9 45.0* 35.1 0.20F 999.9 010000
Preprocess Command:
aq_pp -f,+1,eok,fix - -d s,n=7,trm:stn s,n=7,trm:wban s,n=12,trm:yearmoda s,n=6,trm:temp s,n=5,trm:unlabeled1 s,n=6,trm:dewp s,n=4,trm:unlabeled2 s,n=7,trm:slp s,n=3,trm:unlabeled3 s,n=1,trm:unlabeled4 s,n=7,trm:stp s,n=3,trm:unlabeled5 s,n=7,trm:visib s,n=4,trm:unlabeled6 s,n=6,trm:wdsp s,n=3,trm:unlabeled7 s,n=7,trm:mxspd s,n=1,trm:unlabeled8 s,n=8,trm:gust s,n=8,trm:max s,n=6,trm:min s,n=7,trm:prcp s,n=7,trm:sndp s,n=6,trm:frshtt
Sample Output:
"stn","wban","yearmoda","temp","unlabeled1","dewp","unlabeled2","slp","unlabeled3","unlabeled4","stp","unlabeled5","visib","unlabeled6","wdsp","unlabeled7","mxspd","unlabeled8","gust","max","min","prcp","sndp","frshtt"
"030050","99999","19320101","43.2","6","40.3","4","993.9","6","9","999.9","0","4.3","6","10.2","6","18.1","9","99.9","45.0*","35.1","0.20F","999.9","010000"
Json Data:
Data:
{
"coordinates": null,
"created_at": "Thu Oct 21 16:02:46 +0000 2010",
"favorited": false,
"truncated": false,
"id_str": "28039652140",
"entities": {
"urls": [
{
"expanded_url": null,
"url": "http://gnip.com/success_stories",
"indices": [
69,
100
]
}
],
...
},
"in_reply_to_user_id_str": null,
"text": "what we've been up to at @gnip -- delivering data to happy customers http://gnip.com/success_stories",
...
"user": {
"profile_sidebar_border_color": "C0DEED",
"name": "Gnip, Inc.",
...
},
"in_reply_to_screen_name": null,
"source": "web",
"place": null,
"in_reply_to_status_id": null
}
Preprocess Command:
aq_pp -f,eok,jsn twitterex.json -d s:coordinates:coordinates s:created_at:created_at s:favorited:favorited s:truncated:truncated s:id_str:id_str s:in_reply_to_user_id_str:in_reply_to_user_id_str s:text:text s:profile_sidebar_border_color:user.profile_sidebar_border_color s:name:user.name s:in_reply_to_screen_name:in_reply_to_screen_name s:source:source s:place:place s:in_reply_to_status_id:in_reply_to_status_id
Sample Output:
"coordinates","created_at","favorited","truncated","id_str","expanded_url","url","in_reply_to_user_id_str","text","profile_sidebar_border_color","name","in_reply_to_screen_name","source","place","in_reply_to_status_id"
,"Thu Oct 21 16:02:46 +0000 2010","false","false","28039652140",,"what we've been up to at @gnip -- delivering data to happy customers http://gnip.com/success_stories","C0DEED","Gnip, Inc.",,"web",,
These allow you to name your columns so you can reference them later. They cannot contain spaces or special characters and they cannot start with a number. These can be used in your sql statement in Query to select and perform certain operations on specific columns in your data.
The type of your data column. The options are String, Unsigned Integer, Float, IP, X, Unsigned Long, Integer, and Long.
X is used to ignore an unwanted column and is highly recommended if you don’t need a certain column or columns as it will speed up your queries even further.