aq_pp [-h] Global_Opt Input_Spec Prep_Spec Process_Spec Output_Spec
Global_Opt:
[-test] [-verb] [-stat] [-bz ReadBufSiz]
Input_Spec:
[-f[,AtrLst] File [File ...]] [-d ColSpec [ColSpec ...]]
[-cat[,AtrLst] File [File ...] ColSpec [ColSpec ...]]
Prep_Spec:
[-ddef]
[-seed RandSeed]
[-rownum StartNum]
[-emod ModSpec]
[-var ColSpec Val]
[-alias ColName AltName]
[-renam ColName NewName]
Process_Spec:
[-eval ColSpec|ColName Expr]
[-mapf[,AtrLst] ColName MapFrom] [-mapc ColSpec|ColName MapTo]
[-kenc ColSpec|ColName ColName [ColName ...]]
[-kdec ColName ColSpec|ColName[+] [ColSpec|ColName[+] ...]]
[-filt FilterSpec]
[-map[,AtrLst] ColName MapFrom MapTo]
[-sub[,AtrLst] ColName File [File ...] [ColTag ...]]
[-grep[,AtrLst] ColName File [File ...] [ColTag ...]]
[-cmb[,AtrLst] File [File ...] ColSpec [ColSpec ...]]
[-pmod ModSpec]
Output_Spec:
[-o[,AtrLst] File] [-c ColName [ColName ...]]
[-udb [-spec UdbSpec | -db DbName] -imp [DbName:]TabName
[-seg N1[-N2]/N] [-nobnk] [-nonew] [-mod ModSpec]]
[-ovar[,AtrLst] File [-c ColName [ColName ...]]]
aq_pp
is a stream-based record processing tool.
It loads and processes records on at a time through these simple steps:
Other characteristics of the tool include:
With its stream-based design, aq_pp
can process an unlimited amount of
data using a constant amount of memory.
For this reason, it is well suited for the pre-processing of large amount of
raw data, where the extracted and transformed result is used to generate
higher level analytics.
-test
Test command line arguments and exit.
If specified twice (-test -test
), a more throughout test will be
attempted. For example, the program will try to load lookup files and
connect to Udb in test mode.
-verb
-stat
Print a record count summary line to stderr at the end of processing. The line has the form:
aq_pp:TagLab rec=Count err=Count out=Count
-bz ReadBufSiz
ReadBufSiz
is a number in bytes.-f[,AtrLst] File [File ...]
Set the input attributes and files.
If the data come from stdin, set File
to ‘-‘ (a single dash).
Optional AtrLst
is described under Input File Attributes.
If this option is not given, stdin is assumed.
Example:
$ aq_pp ... -f,+1l,eok file1 -f file2 ...
-d ColSpec [ColSpec ...]
Define the columns of the input records from all -f specs.
ColSpec
has the form Type[,AtrLst]:ColName
.
Up to 256 ColSpec
can be defined (excluding X
type columns).
Supported Types
are:
S
- String.F
- Double precision floating point.L
- 64-bit unsigned integer.LS
- 64-bit signed integer.I
- 32-bit unsigned integer.IS
- 32-bit signed integer.IP
- v4/v6 address.X[Type]
- marks an unwanted input column.
Type is optional. It can be one of the above (default is S
).
ColName is also optional. Such a name is simply discarded.Optional AtrLst
is a comma separated list containing:
esc
- Denote that the input field uses ‘\’ as escape character. Data
exported from databases (e.g. MySQL) sometimes use this format. Be careful
when dealing with multibyte character set because ‘\’ can be part of a
multibyte sequence.noq
- Denote that the input field is not quoted. Any quotes in or around
the field are considered part of the field value.hex
- For numeric type. Denote that the input field is in hexdecimal
notation. Starting 0x
is optional. For example, 100
is
converted to 256 instead of 100.trm
- Trim leading/trailing spaces from input field value.lo
, up
- For S
type. Convert input field to lower/upper case.ColName
restrictions:
Note: Optional ColSpec
attributes only apply to input data.
They cannot be used on the dynamically created columns discussed later.
Example:
$ aq_pp ... -d s:Col1 s,lo:Col2 i,trm:Col3 ...
trm
attribute removes
blanks around the value before it is converted to an internal number.-cat[,AtrLst] File [File ...] ColSpec [ColSpec ...]
Add rows from Files
to the current data set.
If the data come from stdin, set File
to ‘-‘ (a single dash).
Optional AtrLst
is described under Input File Attributes.
ColSpecs
define the columns in the files as with -d.
The columns may differ from those of the current data set.
The new data set will contain unique columns from both sets.
Columns that do not exist in a data set will be set to zero or blank when
that data set is loaded.
Example:
$ aq_pp ... -d s:Col1 s:Col2 i:Col3 s:Col4 ... -cat more.csv i:Col3 s:Col1 s:Col5 s:Col6 ...
-ddef
Turns on implicit column support for Udb import. If a column required by the target Udb table is not defined in the data set, its value will be set to 0 or blank during import.
-seed RandSeed
$Random
evaluation builtin variable.
Default seed is 1.-rownum StartNum
$RowNum
evaluation builtin variable.
StartNum
is the index of the first row.
Default starting row index is 1.-emod ModSpec
Deprecated in Essentia version 3.1.0.3. Eliminated emod modules and incorporated available functions by default.
Load a module that supplies custom evaluation functions. The supplied functions will be available for use in subsequent -eval specs.
ModSpec
has the form ModName[:argument]
where ModName
is the logical module name and argument
is an optional module specific
parameter string.
aq_pp
will look for “emod/ModName
.so” in the directory where it is
installed. For example, if it is installed as SomeDirectory/aq_pp
,
SomeDirectory/emod/ModName.so
will be loaded.
Multiple eval modules can be specified.
In case a function of the same name is supplied by multiple
modules, the one from the most recently loaded module will be used.
Each emod is individually documented. See the “aq_pp-emod-*” manual pages
for details.
-var ColSpec Val
Define a new variable and initialize its value to Val.
A variable stores a value that persists between rows over the entire run.
Recall that normal column values change from row to row.
ColSpec
is the variable’s spec in the form Type:ColName
where Type
is the data type and ColName is the variable’s name. See the -d for
details.
Note that a string Val
must be quoted,
see String Constant spec for details.
Example:
$ aq_pp ... -d i:Col1 ... -var 'i:Sum' 0 ... -eval 'Sum' 'Sum + Col1' ...
-alias ColName AltName
AltName
is the desired alias. An alias allow the same column to be
addressed using multiple names.
If the original column is no longer needed, use -renam instead.-renam ColName NewName
NewName
is the new name of the column/variable/alias.
addressed using multiple names.-eval ColSpec|ColName Expr
Evaluate Expr
and save the result to a column. The column can be a new
column, an existing column/variable or null as explained below.
-
is given, the result will not be saved anywhere. This is
useful when calling a function that puts its result in destinated columns
by itself.ColSpec
is given, a new column will be created using the spec.
See -d for details. Note that the new column cannot participate in
Expr
.Expr
is the expression to evaluate.
Data type of the evaluated result must be compatible with the data type of
the target column. For example, string result for a string column and
numeric result for a numeric column (there is no automatic type conversion;
however, explicit conversion can be done using the To*()
functions
described below).
Operands in the expression can be the names of previously defined columns or
variables, constants, builtin variables and functions.
ToIP()
, ToF()
, ToI()
and ToS()
.Builtin variables:
$Random
$RowNum
Builtin functions:
ToIP(Val)
Val
.
Val
can be a string/IP column’s name, a string constant,
or an expression that evaluates to a string/IP.ToF(Val)
Val
.
Val
can be a string/numeric column’s name, a string/numeric constant,
or an expression that evaluates to a string/number.ToI(Val)
Val
.
Val
can be a string/numeric column’s name, a string/numeric constant,
or an expression that evaluates to a string/number.ToS(Val)
Val
.
Val
can be a numeric column’s name, a string/numeric/IP constant,
or an expression that evaluates to a string/number/IP.Min(Val1, Val2 [, Val3 ...])
Val1
, Val2
and so on.
Values can be numeric column names, numbers,
or expressions that evaluates to a number.Max(Val1, Val2 [, Val3 ...])
Val1
, Val2
and so on.
Values can be numeric column names, numbers,
or expressions that evaluates to a number.PatCmp(Val, Pattern [, AtrLst])
Perform a pattern comparison between string value and a pattern.
Returns 1 (True) if successful or 0 (False) otherwise.
Val
can be a string column’s name, a string constant,
or an expression that evaluates to a string.
Pattern
is a string constant specifying
the pattern to match.
AtrLst
is a comma separated string list containing:
ncas
- Do case insensitive pattern match (default is case sensitive).
This has the same effect as the case insensitive operators below.rx
- Do Regular Expression matching.rx_extended
- Do Regular Expression matching.
In addition, enable POSIX Extended Regular Expression syntax.rx_newline
- Do Regular Expression matching.
In addition, apply certain newline matching restrictions.Without any of the Regular Expression related attributes,
Pattern
must be a simple wildcard pattern containing just ‘*’
(matches any number of bytes) and ‘?’ (matches any 1 byte) only;
literal ‘*’, ‘?’ and ‘\’ in the pattern must be ‘\’ escaped.
If any of the Regular Expression related attributes is enabled, then the pattern must be a GNU RegEx.
SHash(Val)
Val
can be a string column’s name, a string constant,
or an expression that evaluates to a string.SLeng(Val)
Val
can be a string column’s name, a string constant,
or an expression that evaluates to a string.DateToTime(DateVal, DateFmt)
Returns the UNIX time in integral seconds corresponding to DateVal
.
DateVal
can be a string column’s name, a string constant,
or an expression that evaluates to a string.
DateFmt
is a string constant specifying
the format of DateVal
.
The format is a sequence of single-letter conversion codes:
.
- represent a single unwanted character (e.g., a separator).Y
- 1-4 digit year.y
- 1-2 digit year.m
- month in 1-12.b
- abbreviated English month name (“JAN” ... “DEC”, case
insensitive).d
- day of month in 1-31.H
- hour in 0-23 or 1-12.M
- minute in 0-59.S
- second in 0-59.p
- AM/PM (case insensitive).z
- timezone as HHMM offset from GMT.This conversion is timezone dependent. If there is no timezone information
(z
conversion) in the DateVal
, set the timezone appropriately
(TZ environment) when running the program.
TimeToDate(TimeVal, DateFmt)
Returns the date string corresponding to TimeVal
.
The string’s maximum length is 127.
TimeVal
can be a numeric column’s name, a numeric constant,
or an expression that evaluates to a number.
DateFmt
is a string constant specifying
the format of the output. See the strftime()
C function manual
page regarding the format of DateFmt
.
This conversion is timezone dependent. Set the timezone appropriately (TZ environment) when running the program.
QryParmExt(QryVal, ParmSpec)
Extract query parameters from QryVal
and place the results in columns.
Returns the number of parameters extracted. If the return value is not
needed, invoke function using -eval - QryParmExt(...)
.
QryVal
can be a string column’s name, a string constant
or an expression that evaluates to a string.
ParmSpec
is a string constant specifying
the parameters to extract and the destination columns for the result.
It has the form:
[AtrLst]&Key[:ColName][,AtrLst][&Key[:ColName][,AtrLst]...]
It can start with a comma separated attribute list:
beg=c
- Skip over the initial portion of QryVal up to and including
the first ‘c’ character (single byte). A common value for ‘c’ is ‘?’.
Without this attribute, the entire QryVal will be used.zero
- Zero out all destination columns before extraction.dec=Num
- Number of times to perform URL decode on the extracted
values. Num must be between 0 and 99. Default is 1.trm=c
- Trim one leading and/or trailing ‘c’ character (single byte)
from the decoded extracted values.Keys
are the name of the parameters to extract.
It should be URL encoded if it contains any special characters.
Note that each Key
specification starts with an ‘&’.
The extracted value of Key is stored in a column given by ColName
.
The column must be a previously defined column. If ColName
is not
given, a column with the same name as Key
is assumed.
Each Key
can also have a comma separated attribute list:
zero
- Zero out the destination column before extraction.dec=Num
- Number of times to perform URL decode on the extracted
value of this Key. Num must be between 0 and 99.trm=c
- Trim one leading and/or trailing ‘c’ character (single byte)
from the decoded extracted value.Example:
$ aq_pp ... -d i:Col1 ... -eval l:Col_evl 'Col1 * 10' ...
$ aq_pp -rownum 101 ... -d i:Col1 ... -eval i:Seq '$RowNum' ...
$ aq_pp ... -d s:Col1 s:Col2 ... -eval is:Dt 'DateToTime(Col2, "Y.m.d.H.M.S.p") - DateToTime(Col1, "Y.m.d.H.M.S.p")' ...
-mapf[,AtrLst] ColName MapFrom
Extract data from a string column. This option should be used in
conjunction with -mapc.
ColName
is a previously defined column/variable to extract data from.
MapFrom
defines the extraction rule.
Optional AtrLst
is a comma separated list containing:
ncas
- Do case insensitive pattern match (default is case sensitive).rx
- Do Regular Expression matching.rx_extended
- Do Regular Expression matching.
In addition, enable POSIX Extended Regular Expression syntax.rx_newline
- Do Regular Expression matching.
In addition, apply certain newline matching restrictions.If any of the Regular Expression related attributes are enabled, then
MapFrom
must use the RegEx MapFrom Syntax.
Otherwise, it must use the RT MapFrom Syntax.
-mapc ColSpec|ColName MapTo
Render data extracted via previous -mapf into a new column or into an existing column/variable. The column must be of string type.
ColSpec
is given, a new column will be created using the spec.
See -d for details.ColName
is given, it must refer to a previously defined
column/variable.MapTo
is the rendering spec. See MapTo Syntax for details.
Example:
$ aq_pp ... -d s:Col1 s:Col2 s:Col3 ... -mapf Col1 '%%v1_beg%%.%%v1_end%%' -mapf,rx Col2 '\(.*\)-\(.*\)' -mapf,rx Col3 '\(.*\)_\(.*\)' -mapc s:Col_beg '%%v1_beg%%,%%1%%,%%4%%' -mapc s:Col_end '%%v1_end%%,%%2%%,%%5%%' ...
MapFrom
expressions do not have named
placeholders for the extracted data. Placeholders are interpreted
implicitly from the the expressions in this way.%%0%%
- Represent the entire match in the first -mapf,rx
(not used in example).%%1%%
- Represent the 1st subpattern match in the first -mapf,rx
.%%2%%
- Represent the 2nd subpattern match in the first -mapf,rx
.%%3%%
- Represent the entire match in the second -mapf,rx
(not used in example).%%4%%
- Represent the 1st subpattern match in the second -mapf,rx
.%%5%%
- Represent the 2nd subpattern match in the second -mapf,rx
.-kenc ColSpec|ColName ColName [ColName ...]
Encode a key column from the given ColNames
.
The key column must be of string type.
The encoded value it stores constains binary data.
ColSpec
is given, a new column will be created using the spec.
See -d for details.ColName
is given, it must refer to a previously defined
column/variable.The source ColNames
must be previously defined.
They can have any data type.
Example:
$ aq_pp ... -d s:Col1 i:Col2 ip:Col3 ... -kenc s:Key1 Col1 Col2 Col3 ...
-kdec ColName ColSpec|ColName[+] [ColSpec|ColName[+] ...]
Decode a key column given by ColName
into one or more columns
given by ColSpec
(new column) or ColName
(existing column/variable).
The key ColName
must be an existing string column/variable.
For the decode-to columns, possible specs are:
Type:ColName[+]
ColName[+]
Type:[+]
Note that the decode-to column types must match those used in the original -kenc spec.
Example:
$ aq_pp ... -d s:Key1 ... -kdec Key1 s:Col1 i:Col2 ip:Col3 ...
$ aq_pp ... -d s:Key1 ... -kdec Key1 s: i:Col2 ip: ...
$ aq_pp ... -d s:Key1 ... -kdec Key1 s: i:Col2+ ip:+ -kdec Key1 i: ip:Col3 ...
-filt FilterSpec
Filter (or select) records based on FilterSpec
.
FilterSpec
is a logical expression that evaluates to either true or false
for each record - if true, the record is selected; otherwise, it is
discarded.
It has the basic form [!] LHS [<compare> RHS]
where:
!
negates the result of the comparison.
It is recommended that !(...)
be used to clarify the intended
operation even though it is not required.==
, >
, <
, >=
, <=
-
LHS and RHS comparison.~==
, ~>
, ~<
, ~>=
, ~<=
-
LHS and RHS case insensitive comparison; string type only.!=
, !~=
-
Negation of the above equal operators.&=
-
Perform a “(LHS & RHS) == RHS” check; numeric types only.!&=
-
Negation of the above.&
-
Perform a “(LHS & RHS) != 0” check; numeric types only.!&
-
Negation of the above.More complex expression can be constructed by using (...)
(grouping),
!
(negation), ||
(or) and &&
(and).
For example:
LHS_1 == RHS_1 && !(LHS_2 == RHS_2 || LHS_3 == RHS_3)
Example:
$ aq_pp ... -d s:Col1 s:Col2 i:Col3 s:Col4 ... -filt 'Col1 === Col4 && Col2 != "" && Col3 >= 100' ...
-map[,AtrLst] ColName MapFrom MapTo
Remap (a.k.a., rewrite) a string column’s value.
ColName
is a previously defined column/variable.
MapFrom
defines the extraction rule.
MapTo
is the rendering spec. See MapTo Syntax for details.
Optional AtrLst
is a comma separated list containing:
ncas
- Do case insensitive pattern match (default is case sensitive).rx
- Do Regular Expression matching.rx_extended
- Do Regular Expression matching.
In addition, enable POSIX Extended Regular Expression syntax.rx_newline
- Do Regular Expression matching.
In addition, apply certain newline matching restrictions.If any of the Regular Expression related attributes are enabled, then
MapFrom
must use the RegEx MapFrom Syntax.
Otherwise, it must use the RT MapFrom Syntax.
Example:
$ aq_pp ... -d s:Col1 ... -map Col1 '%%v1_beg%%-%*' 'beg=%%v1_beg%%' ... $ aq_pp ... -d s:Col1 ... -map,rx Col1 '\(.*\)-*' 'beg=%%1%%' ...
-sub[,AtrLst] ColName File [File ...] [ColTag ...]
Update the value of a string column/variable according to a lookup table.
ColName
is a previously defined column/variable.
Files
contain the lookup table.
If the input comes from stdin, set File
to ‘-‘ (a single dash).
Optional AtrLst
is a comma separated list containing:
ncas
- Do case insensitive match (default is case sensitive).pat
- Support ‘?’ and ‘*’ wild cards in the “From” value. Literal ‘?’,
‘*’ and ‘\’ must be escaped by a ‘\’. Without this attribute,
“From” value is assumed constant and no escape is necessary.req
- Discard records not matching any entry in the lookup table.
Normally, column value will remain unchanged if there is no match.ColTags
are optional. They specify the columns in the files. Supported
tags (case insensitive) are:
FROM
- marks the column used to match the value of ColName.TO
- marks the column used as the new value of ColName.X
- marks an unused column.If ColTag
is used, both the FROM
and TO
tags must be given.
Any number of X
can be specified.
If ColTag
is not used, the files are assumed to contain
exactly 2 columns - the FROM
and TO
columns, in that order.
The FROM
value is generally a literal. Patterns can also be used,
see the pat
attribute description above.
The TO
value is always a literal.
Matches are carried out according to the order of the match value in the
files. Match stops when the first match is found. If the files contain both
exact value and pattern, then:
Note: If a file name happens to be one of FROM
, TO
or X
(case insensitive), prepend the name with a path (e.g., ”./X”)
to avoid misinterpretation.
Example:
$ aq_pp ... -d s:Col1 ... -sub Col1 lookup.csv ...
-grep[,AtrLst] ColName File [File ...] [ColTag ...]
Like filtering, but matches a single column/variable against a list of
values from a lookup table.
ColName
is a previously defined column/variable.
Files
contain the lookup table.
If the input comes from stdin, set File
to ‘-‘ (a single dash).
Optional AtrLst
is a comma separated list containing:
ncas
- Do case insensitive match (default is case sensitive).pat
- Support ‘?’ and ‘*’ wild cards in the “From” value. Literal ‘?’,
‘*’ and ‘\’ must be escaped by a ‘\’. Without this attribute,
match value is assumed constant and no escape is necessary.ColTags
are optional. They specify the columns in the files. Supported
tags (case insensitive) are:
FROM
- marks the column used to match the value of ColName.X
- marks an unwanted column.If ColTag
is used, the FROM
tag must be given.
Any number of X
can be specified.
If ColTag
is not used, the files are assumed to contain
exactly 1 column - the FROM
column.
The FROM
value is generally a literal. Patterns can also be used,
see the pat
attribute description above.
Matches are carried out according to the order of the match value in the
files. Match stops when the first match is found. If the files contain both
exact value and pattern, then:
Note: If a file name happens to be one of FROM
or X
(case insensitive), prepend the name with a path (e.g., ”./X”)
to avoid misinterpretation.
Example:
$ aq_pp ... -d s:Col1 ... -grep,rev Col1 lookup.csv ...
-cmb[,AtrLst] File [File ...] ColSpec [ColSpec ...]
Combine data from lookup table into the current data set by joining rows
from both data sets based on common key column values.
The new data set will contain unique columns from both sets.
Files
contain the lookup table.
If the data come from stdin, set File
to ‘-‘ (a single dash).
Optional AtrLst
is a comma separated list containing:
ncas
- Do case insensitive match (default is case sensitive).req
- Discard unmatched records.ColSpecs
define the columns in the files as with -d.
In addition to the standard -d column attributes,
the followings are supported:
key
- Mark a key column. This column must exist in the current
data set.cmb
- Mark a column to be combined into the current data set. If this
column does not exist, one will be added.If a column has neither the key
nor cmb
attribute, it will be
implicitly used as a combine key if a column with the same name already
existed in the current data set.
Example:
$ aq_pp ... -d s:Col1 s:Col2 i:Col3 s:Col4 ... -cmb lookup.csv i:Col3 s:Col1 s:Col5 s:Col6 ...
$ aq_pp ... -d s:Col1 s:Col2 i:Col3 s:Col4 ... -cmb lookup.csv i:Col3 s:Col1 s:Col5 s:Col6 s,cmb:Col2 ... $ aq_pp ... -d s:Col1 s:Col2 i:Col3 s:Col4 ... -cmb lookup.csv i,key:Col3 s,key:Col1 s,cmb:Col5 s,cmb:Col6 s,cmb:Col2 ...
-pmod ModSpec
Call the processing function in the module to process the current record. The function is typically used to implement custom logics.
ModSpec
has the form ModName[:argument]
where ModName
is the logical module name and argument
is a module specific
parameter string.
aq_pp
will look for “pmod/ModName
.so” in the directory where it is
installed. For example, if it is installed as /SomeDirectory/aq_pp
,
/SomeDirectory/pmod/ModName.so
will be loaded.
See the examples under “pmod/” in the source package regarding how this
type of module is implemented.
Standard modules:
unwrap_strv
Unwrap a delimiter separated string column into none or more values. The row will be replicated for each of the unwrapped values. Module arguments are:
From_Col:From_Sep:To_Col[:AtrLst]
From_Col
- Column containing the string value to unwrap.
It must have type S
.From_Sep
- The single byte delimiter that separate individual
values. The delimiter must be given as-is, no escape is recognized.To_Col
- Column to save each unwrapped value to.
It must have type S
. The To_Col
can be the same as the
From_Col
- the module will remember the original From_Col
value.
relax
- No trailing delimiter. One is expected by default.noblank
- Skip blank values.
[-o[,AtrLst] File] [-c ColName [ColName ...]]
Output data rows.
Optional “-o[,AtrLst] File
” sets the output attributes and file.
If File
is a ‘-‘ (a single dash), data will be written to stdout.
Optional AtrLst
is described under Output File Attributes.
Optional “-c ColName [ColName ...]
” selects the columns to output.
ColName
refers to a previously defined column/variable.
Without -c
, all columns are selected by default. Variables are not
automatically included though.
If -c
is specified without a previous -o
, output to stdout is
assumed.
In case a title line is desired but certain column names are not
appropriate, use -alias or -renam before the -o
to remap the
name of those columns manually.
With -alias, the alternate names must be explicitly selected with -c
.
With -renam, -c
is optional.
Multiple sets of “-o ... -c ...
” can be specified.
Example:
$ aq_pp ... -d s:Col1 s:Col2 s:Col3 ... -o,esc,noq - -c Col2 Col1
-udb [-spec UdbSpec|-db DbName] -imp [DbName:]TabName [-seg N1[-N2]/N] [-nobnk] [-nonew] [-mod ModSpec]
Output data directly to Udb (i.e., a Udb import).
-udb
marks the beginning of Udb import specific options.
Optional “-spec UdbSpec
” sets the Udb spec file for the import.
Alternatively, “-db DbName
” indirectly sets the spec file to
”.conf/DbName
.spec” in the current work directory.
If neither option is given, “udb.spec” in the current work directory
is assumed.
See the “udb.spec” manual page for details.
“-imp [DbName:]TabName
” specifies an import operation.
TabName
set the table in the spec to import data to.TabName
is case insensitive. It must not exceed 31 bytes long.DbName
defines UdbSpec
indirectly as in the -db
option.TabName
are automatically selected for import.
In case certain columns in the current data set are named
differently from tbe columns of TabName
, use -alias or -renam
to remap those columns manually.Optional “-seg N1[-N2]/N
” applies sampling by selecting segment N1 or
segment N1 to N2 (inclusive) out of N segments of unique users from the
input data to import. Users are segmented based on the hash value of the
user key. For example, “-seg 2-4/10
” will divide user pool into 10
segments and import segments 2, 3 and 4; segments 1 and 5-10 are discarded.
Optional -nobnk
excludes records with a blank user key from the import.
Optional -nonew
tells the server not to create any new user during this
import. Records belonging to users not yet in the DB are discarded.
Optional “-mod ModSpec
” specifies a module to load on the server side.
ModSpec
has the form ModName[:argument]
where ModName
is the logical module name and argument
is a module specific
parameter string. Udb server will try to load “umod/ModName
.so”
in the directory where udbd
is installed.
Multiple sets of “-udb -spec ... -imp ...
” can be specified.
-ovar[,AtrLst] File [-c ColName [ColName ...]]
Output the final variable values. Variables are those defined using the -var option. Only a single data row is output.
“-ovar[,AtrLst] File
” sets the output attributes and file.
If File`
is a ‘-‘ (a single dash), data will be written to stdout.
Optional AtrLst
is described under Output File Attributes.
Optional “-c ColName [ColName ...]
” selects the variables to output.
ColName
refers to a previously defined variable.
Without -c
, all variables are selected by default.
In case a title line is desired but certain variable names are not
appropriate, use -alias or -renam before -ovar
to remap the
name of those variables manually.
With -alias, the alternate names must be explicitly selected with -c
.
With -renam, -c
is optional.
Multiple sets of “-ovar ... -c ...
” can be specified.
Example:
$ aq_pp ... -d i:Col1 i:Col2 ... -var i:Sum1 0 -var i:Sum2 0 ... -eval Sum1 'Sum1 + Col1' -eval Sum2 'Sum2 + (Col2 * Col2)' ... -ovar - -c Sum1 Sum2
If successful, the program exits with status 0. Otherwise, the program exits with a non-zero status code along error messages printed to stderr. Applicable exit codes are:
Each input file can have these comma separated attributes:
eok
- Make error non-fatal. If there is an input error, program will
try to skip over bad/broken records. If there is a record processing error,
program will just discard the record.qui
- Quiet; i.e., do not print any input/processing error message.tsv
- Input is in TSV format (default is CSV).sep=c
- Use separator ‘c’ (single byte) as column separactor.bin
- Input is in binary format (default is CSV).esc
- ‘\’ is an escape character in input fields (CSV or TSV).noq
- No quotes around fields (CSV).+Num[b|r|l]
- Specifies the number of bytes (b
suffix), records (r
suffix) or lines (no suffix or l
suffix) to skip before processing.By default, input files are assumed to be in formal CSV format. Use the
tsv
, esc
and noq
attributes to set input characteristics as needed.
Some output file can have these comma separated attributes:
app
- Append to file; otherwise, file is overwritten by default.bin
- Input in binary format (default is CSV).esc
- Use ‘\’ to escape ‘,’, ‘”’ and ‘\’ (CSV).noq
- Do not quote string fields (CSV).fmt_g
- Use “%g” as print format for F
type columns. Only use this
to aid data inspection (e.g., during integrity check or debugging).notitle
- Suppress the column name label row from the output.
A label row is normally included by default.By default, output is in CSV format. Use the esc
and noq
attributes to
set output characteristics as needed.
A string constant must be quoted between double or single quotes. With double quotes, special character sequences can be used to represent special characters. With single quotes, no special sequence is recognized; in other words, a single quote cannot occur between single quotes.
Character sequences recognized between double quotes are:
\\
- represents a literal backslash character.\"
- represents a literal double quote character.\b
- represents a literal backspace character.\f
- represents a literal form feed character.\n
- represents a literal new line character.\r
- represents a literal carriage return character.\t
- represents a literal horizontal tab character.\v
- represents a literal vertical tab character.\0
- represents a NULL character.\xHH
- represents a character whose HEX value is HH
.Beyond these, other special sequences may be recognized depending on where
the string is used. For example, in a simple wildcard pattern
(see PatCmp()
), \?
and \*
represent literal ?
and *
respectively.
Sequences that are not recognized will be kept as-is. For example, in \a
,
the backslash will not be removed.
Two or more quoted strings can be used back to back to form a single string. For example,
'a "b" c'" d 'e' f" => a "b" c d 'e' f
RT style MapFrom is used in both -mapf and -map options. The MapFrom spec is used to match and/or extract data from a string column’s value. It has this general syntax:
literal_1%*literal_2%?literal_3
-
%*
matches any number of bytes and %?
matches any 1 byte.
This is like a pattern comparison.%%my_var%%
-
Extract the value into a variable named my_var
. my_var
can later be
used in the MapTo spec.literal_1%%my_var_1%%literal_2%%my_var_2%%
-
A common way to extract specific data portions.literal_1%=literal_2%=literal_3
-
%=
is used to toggle case sensitive/insensitive match. In the above case,
if -mapf or -map does not have the ncas
attribute, then
literal_1
‘s match will be case sensitive, but literal_2
‘s will be
case insensitive, and literal_3
‘s will be case sensitive again.\%\%not_var\%\%%%my_var%%a_backslash\\others
-
If a ‘%’ is used in such a way that resembles an unintended MapFrom spec,
the ‘%’ must be escaped. Literal ‘\’ must also be escaped.
On the other hand, ‘\’ has no special meaning within a variable spec
(described below).Each %%var%%
variable can have additional attributes. The general form of
a variable spec is:
%%VarName[:@class][:[chars]][:min[-max]][,brks]%%
where
VarName
is the variable name which can be used in MapTo. VarName can be a
‘*’; in this case, the extracted data is not stored, but the extraction
attributes are still honored.
Note: Do not use numbers as a RT mapping variable name.
:@class
restricts the exctracted data to belong to a class of characters.
class
is a code with these values and meanings:
n
- Characters 0-9.a
- Characters a-z.b
- Characters A-Z.c
- All printable ASCII characters.x
- The opposite of c
above.s
- All whitespaces.g
- Characters in {}[]()
.q
- Single/double/back quotes.Multiple classes can be used; e.g., %%my_var:@nab%%
for all alphanumerics.
:[chars]
([]
is part of the syntax) is similar to the character class
described above except that the allowed characters are set explicitly.
Note that ranges is not supported, all characters must be specified.
For example,
%%my_var:[0123456789abcdefABCDEF]%%
(same as
%%my_var:@n:[abcdefABCDEF]%%
) for hex digits. To include a ‘]’
as one of the characters, put it first, as in %%my_var:[]xyz]%%
.
:min[-max]
is the min and optional max length (bytes, inclusive) to
extract. Without a max, the default is unlimited (actually ~64Kb).
,brks
defines a list of characters at which extraction of the variable
should stop. For example, %%my_var,,;:%%
will extract data into my_var
until one of ,;:
or end-of-string is encountered. This usuage is often
followed by a wild card, as in %%my_var,,;:%%%*
.
Regular expression style MapFrom
can be used in both -mapf and -map
options. MapFrom
defines what to match and/or extract from a string
value of a column.
Differences between RegEx mapping and RT mapping:
^pattern$
.%%0%%
, %%1%%
, and so on.
See -mapc for an usage example.Regular Expression is very powerful but also complex. Please consult the GNU RegEx manual for details.
MapTo is used in -mapc and -map. It renders the data extracted by MapFrom into a column. Both RT and RegEx MapTo share the same syntax:
%%my_var%%
-
Substitute the value of my_var
.literal_1%%my_var_1%%literal_2%%my_var_2%%
-
A common way to render extracted data.\%\%not_var\%\%%%my_var%%a_backslash\\others
-
If a ‘%’ is used in such a way that resembles an unintended MapTo spec,
the ‘%’ must be escaped. Literal ‘\’ must also be escaped.
On the other hand, ‘\’ has no special meaning within a variable spec
(described below).Each %%var%%
variable can have additional attributes. The general form of
a variable spec is:
%%VarName[:cnv][:start[:length]][,brks]%%
where
VarName
is the variable to substitute in.
:cnv
sets a conversion method on the data in the variable. Note that the
data is first subjected to the length and break considerations before the
conversion. Supported conversions are:
b64
- Apply base64 decode.url[Num]
- Apply URL decode. Optional Num
is a number between 1-99.
It is the number of times to apply URL decode.Normally, only use 1 conversion. If both are specified (in any order), URL decode is always done before base64 decode.
:start
is the starting byte position of the extracted data to substitute.
The first byte has position 0. Default is 0.
:length
is the number of bytes (from start
) to substitute. Default is
till the end.
,brks
defines a list of characters at which substitution of the variable’s
value should stop.
See -mapc for an usage example.
Some of the data processing options can be placed in conditional groups such that different processing rules can be applied depending on the logical result of another rule. The basic form of a conditional group is:
-if[not] RuleToCheck RuleToRun ... -elif[not] RuleToCheck RuleToRun ... -else RuleToRun ... -endif
Groups can be nested to form more complex conditions.
Supported RuleToCheck
and RuleToRun
are
-eval, -mapf, -mapc, -kenc, -kdec,
-filt, -map, -sub, -grep, -cmb, -pmod,
-o and -udb. Note that some of these rules may be responsible for the
initialization of dynamically created columns. If such rules get skipped
conditionally, numeric 0 or blank string will be assigned to the
uninitialized columns.
There are 2 special RuleToCheck
:
-true
- Evaluate to true.-false
- Evaluate to false.In addition, there are 3 special RuleToRun
for output record disposition
control (they do not change any data):
-skip
- Do not output current row.-quit
- Stop processing entirely.-quitafter
- Stop processing after the current input record.Example:
$ aq_pp ... -d i:Col1 ... -if -filt 'Col1 == 1' -eval s:Col2 '"Is-1"' -elif -filt 'Col1 == 2' -false -else -eval Col2 '"Others"' -endif ...
$ aq_pp ... -d i:Col1 s:Col2 ... -if -filt 'Col1 == 1' -o Out1 -elif -filt 'Col1 == 2' -o Out2 -c Col2 -endif ...