Record preprocessor
aq_pp [-h] Global_Opt Input_Spec Prep_Spec Process_Spec Output_Spec
Global_Opt:
[-verb] [-stat] [-test]
Input_Spec:
[-f[,AtrLst] File [File ...]] [-d ColSpec [ColSpec ...]] |
[-exp[,AtrLst]|-cnt[,AtrLst] DbName[:TabName] [ExpOpts ...] --]
[-cat[,AtrLst] File [File ...] ColSpec [ColSpec ...]]
Prep_Spec:
[-seed RandSeed]
[-var ColSpec Val]
[-alias ColName AltName]
[-renam ColName NewName]
Process_Spec:
[-eval ColSpec|ColName Expr]
[-mapf[,AtrLst] ColName MapFrom] [-mapc ColSpec|ColName MapTo]
[-kenc ColSpec|ColName ColName [ColName ...]]
[-kdec ColName ColSpec|ColName[+] [ColSpec|ColName[+] ...]]
[-filt FilterSpec]
[-map[,AtrLst] ColName MapFrom MapTo]
[-sub[,AtrLst] ColName File [File ...] [ColSPec ...]]
[-grep[,AtrLst] ColName File [File ...] [ColSPec ...]]
[-cmb[,AtrLst] File [File ...] ColSpec [ColSpec ...]]
[-pmod ModSpec [ModSrc]]
Output_Spec:
[-o[,AtrLst] File] [-c ColName [ColName ...]]
[-ovar[,AtrLst] File [-c ColName [ColName ...]]]
[-imp[,AtrLst] DbName[:TabName] [-server AdrSpec [AdrSpec ...]] [-local]
[-mod ModSpec [ModSrc]]
aq_pp
is a stream-based record processing tool.
It loads and processes records on at a time through these simple steps:
Other characteristics of the tool include:
With its stream-based design, aq_pp
can process an unlimited amount of
data using a constant amount of memory.
For this reason, it is well suited for the pre-processing of large amount of
raw data, where the extracted and transformed result is used to generate
higher level analytics.
-test
Test command line arguments and exit.
If specified twice (-test -test
), a more throughout test will be
attempted. For example, the program will try to load lookup files and
connect to Udb in test mode.
-verb
-stat
Print a record count summary line to stderr at the end of processing. The line has the form:
aq_pp: rec=Count err=Count out=Count
-f[,AtrLst] File [File ...]
Set the input attributes and files. See the aq_tool input specifications manual for details.
Example:
$ aq_pp ... -f,+1l file1 file2 ...
-d ColSpec [ColSpec ...]
Define the input data columns.
See the aq_tool input specifications manual for details.
In general, ColSpec
has the form Type[,AtrLst]:ColName
.
Supported Types
are:
S
- String.F
- Double precision floating point.L
- 64-bit unsigned integer.LS
- 64-bit signed integer.I
- 32-bit unsigned integer.IS
- 32-bit signed integer.IP
- v4/v6 address.Optional AtrLst
is a comma separated list of column specific attributes.
ColName
is the column name (case insensitive). It can contain up to
31 alphanumeric and ‘_’ characters. Its first character cannot be a digit.
Example:
$ aq_pp ... -d s:Col1 s,lo:Col2 i,trm:Col3 ...
trm
attribute removes blanks around the value before it is converted to
an internal number.-exp[,AtrLst]|-cnt[,AtrLst] DbName[:TabName] [ExpOpts ...] --
Get the input data from an Udb export or count operation.
This will set the data source as well as the column definitions,
so -f`_ and -d are not needed.
DbName
is the database name (see Target Udb Database).
TabName
is a table/vector name in the database to export.
If TabName
is not given or if it is a ”.” (a dot), the primary keys
will be exported/counted.
Optional AtrLst
is a comma separated list containing:
spec=UdbSpec
- Set the spec file directly (see Target Udb Database).ExpOpts
are the -exp
or -cnt
related options as decribed in
aq_udb (except -o
which is not applicable here).
A --
must be specified following the last ExpOpts
. Options given
after --
will be interpreted as aq_pp
options.
Example:
$ aq_pp ... -exp mydb:Test -filt 'Col3 > 123456789' -- ... $ aq_pp ... -exp mydb:Test -- -filt 'Col3 > 123456789' ...
aq_pp
.-cat[,AtrLst] File [File ...] ColSpec [ColSpec ...]
Add rows from Files
to the -f data set.
The file and column specifications are the same as in the -f and -d
options.
See the aq_tool input specifications manual for details.
Note that the columns need not be the same as those from -d (by name).
If they differ, a super set is constructed.
Multiple -cat
can be used such that the final data set will contain
unique columns from -d and all -cat.
Columns that do not exist in a data set will be set to zero or blank
when that data set is loaded.
Example:
$ aq_pp ... -d s:Col1 s:Col2 i:Col3 s:Col4 ... -cat more.csv i:Col3 s:Col1 s:Col5 s:Col6 ...
more.csv
”. Column Col3 and Col1 are common,
so the resulting data set will have Col1, Col2, Col3, Col4, Col5 and Col6.
Since the main data set does not have Col5 and Col6, they are set to
blank when it is loaded.
Similarly, since “more.csv
” does not have Col2 and Col4,
they are set to blank when it is loaded.-seed RandSeed
$Random
evaluation builtin variable.
Default seed is 1.-var ColSpec Val
Define a new variable and initialize its value to Val.
A variable stores a value that persists between rows over the entire run.
Recall that normal column values change from row to row.
ColSpec
is the variable’s spec in the form Type:ColName
where Type
is the data type and ColName is the variable’s name, see -d for details.
Note that a string Val
must be quoted,
see String Constant spec for details.
A variable can also be used in conjunction with -o,fvar VarName
to
specify a dynamic output target (the variable must be a string in this case).
See the fvar
description under -o for details.
Example:
$ aq_pp ... -d i:Col1 ... -var 'i:Sum' 0 ... -eval 'Sum' 'Sum + Col1' ...
-alias ColName AltName
AltName
is the desired alias. An alias allow the same column to be
addressed using multiple names.
If the original column is no longer needed, use -renam instead.-renam ColName NewName
NewName
is the new name of the column/variable/alias.
addressed using multiple names.-eval ColSpec|ColName Expr
Evaluate Expr
and save the result to a column. The column can be a new
column, an existing column/variable or null as explained below.
-
is given, the result will not be saved anywhere. This is
useful when calling a function that puts its result in destinated columns
by itself.ColSpec
is given, a new column will be created using the spec.
See -d for details. Note that the new column cannot participate in
Expr
.Expr
is the expression to evaluate.
Data type of the evaluated result must be compatible with the data type of
the target column. For example, string result for a string column and
numeric result for a numeric column (there is no automatic type conversion;
however, explicit conversion can be done using the To*()
functions
described below).
Operands in the expression can be the names of previously defined columns or
variables, constants, builtin variables and functions.
ToIP()
, ToF()
, ToI()
and ToS()
.Builtin variables:
$Random
$RowNum
$CurSec
$CurUSec
Standard functions:
See aq-emod for a list of supported functions.
Example:
$ aq_pp ... -d i:Col1 ... -eval l:Col_evl 'Col1 * 10' ...
$ aq_pp ... -d s:Col1 s:Col2 ... -eval is:Dt 'DateToTime(Col2, "Y.m.d.H.M.S.p") - DateToTime(Col1, "Y.m.d.H.M.S.p")' ...
-mapf[,AtrLst] ColName MapFrom
Extract data from a string column. This option should be used in
conjunction with -mapc.
ColName
is a previously defined column/variable to extract data from.
MapFrom
defines the extraction rule.
Optional AtrLst
is a comma separated list containing:
ncas
- Do case insensitive match (default is case sensitive).
For ASCII data only.If any of the regular expression related attributes are enabled, then
MapFrom
must use the RegEx MapFrom Syntax.
Otherwise, it must use the RT MapFrom Syntax.
-mapc ColSpec|ColName MapTo
Render data extracted via previous -mapf into a new column or into an existing column/variable. The column must be of string type.
ColSpec
is given, a new column will be created using the spec.
See -d for details.ColName
is given, it must refer to a previously defined
column/variable.MapTo
is the rendering spec. See MapTo Syntax for details.
Example:
$ aq_pp ... -d s:Col1 s:Col2 s:Col3 ... -mapf Col1 '%%v1_beg%%.%%v1_end%%' -mapf,rx Col2 '\(.*\)-\(.*\)' -mapf,rx Col3 '\(.*\)_\(.*\)' -mapc s:Col_beg '%%v1_beg%%,%%1%%,%%4%%' -mapc s:Col_end '%%v1_end%%,%%2%%,%%5%%' ...
MapFrom
expressions do not have named
placeholders for the extracted data. Placeholders are interpreted
implicitly from the the expressions in this way.%%0%%
- Represent the entire match in the first -mapf,rx
(not used in example).%%1%%
- Represent the 1st subpattern match in the first -mapf,rx
.%%2%%
- Represent the 2nd subpattern match in the first -mapf,rx
.%%3%%
- Represent the entire match in the second -mapf,rx
(not used in example).%%4%%
- Represent the 1st subpattern match in the second -mapf,rx
.%%5%%
- Represent the 2nd subpattern match in the second -mapf,rx
.-kenc ColSpec|ColName ColName [ColName ...]
Encode a key column from the given ColNames
.
The key column must be of string type.
The encoded value it stores constains binary data.
ColSpec
is given, a new column will be created using the spec.
See -d for details.ColName
is given, it must refer to a previously defined
column/variable.The source ColNames
must be previously defined.
They can have any data type.
Example:
$ aq_pp ... -d s:Col1 i:Col2 ip:Col3 ... -kenc s:Key1 Col1 Col2 Col3 ...
-kdec ColName ColSpec|ColName[+] [ColSpec|ColName[+] ...]
Decode a key column given by ColName
into one or more columns
given by ColSpec
(new column) or ColName
(existing column/variable).
The key ColName
must be an existing string column/variable.
For the decode-to columns, possible specs are:
Type:ColName[+]
ColName[+]
Type:[+]
Note that the decode-to column types must match those used in the original -kenc spec.
Example:
$ aq_pp ... -d s:Key1 ... -kdec Key1 s:Col1 i:Col2 ip:Col3 ...
$ aq_pp ... -d s:Key1 ... -kdec Key1 s: i:Col2 ip: ...
$ aq_pp ... -d s:Key1 ... -kdec Key1 s: i:Col2+ ip:+ -kdec Key1 i: ip:Col3 ...
-filt FilterSpec
Filter (or select) records based on FilterSpec
.
FilterSpec
is a logical expression that evaluates to either true or false
for each record - if true, the record is selected; otherwise, it is
discarded.
It has the basic form [!] LHS [<compare> RHS]
where:
!
negates the result of the comparison.
It is recommended that !(...)
be used to clarify the intended
operation even though it is not required.==
, >
, <
, >=
, <=
-
LHS and RHS comparison.~==
, ~>
, ~<
, ~>=
, ~<=
-
LHS and RHS case insensitive comparison; string type only.!=
, !~=
-
Negation of the above equal operators.&=
-
Perform a “(LHS & RHS) == RHS” check; numeric types only.!&=
-
Negation of the above.&
-
Perform a “(LHS & RHS) != 0” check; numeric types only.!&
-
Negation of the above.More complex expression can be constructed by using (...)
(grouping),
!
(negation), ||
(or) and &&
(and).
For example:
LHS_1 == RHS_1 && !(LHS_2 == RHS_2 || LHS_3 == RHS_3)
Example:
$ aq_pp ... -d s:Col1 s:Col2 i:Col3 s:Col4 ... -filt 'Col1 === Col4 && Col2 != "" && Col3 >= 100' ...
-map[,AtrLst] ColName MapFrom MapTo
Remap (a.k.a., rewrite) a string column’s value.
ColName
is a previously defined column/variable.
MapFrom
defines the extraction rule.
MapTo
is the rendering spec. See MapTo Syntax for details.
Optional AtrLst
is a comma separated list containing:
ncas
- Do case insensitive match (default is case sensitive).
For ASCII data only.If any of the regular expression related attributes are enabled, then
MapFrom
must use the RegEx MapFrom Syntax.
Otherwise, it must use the RT MapFrom Syntax.
Example:
$ aq_pp ... -d s:Col1 ... -map Col1 '%%v1_beg%%-%*' 'beg=%%v1_beg%%' ... $ aq_pp ... -d s:Col1 ... -map,rx Col1 '\(.*\)-*' 'beg=%%1%%' ...
-sub[,AtrLst] ColName File [File ...] [ColSpec ...]
Replace the values of ColName
, a string column in the current data set,
with values from a lookup table loaded from Files
.
Optional AtrLst
is a comma separated list containing:
ncas
- Do case insensitive match (default is case sensitive).
For ASCII data only.pat
- Support ‘?’ and ‘*’ wild cards in the “From” value. Literal ‘?’,
‘*’ and ‘\’ must be escaped by a ‘\’. Without this attribute,
“From” value is assumed constant and no escape is necessary.req
- Discard records not matching any entry in the lookup table.
Normally, column value will remain unchanged if there is no match.all
- Use all matches. Normally, only the first match is used.
With this attribute, one row is produced for each match.ColSpecs
define the input columns as
described in the aq_tool input specifications manual.
The spec is optional, default is “S:from S:to
” (or just “from to
”).
If a spec is defined, it must include these 2 columns (by name):
from
- Marks the column used to match the value of ColName
.
It must have a string type.to
- Marks the column used as the new value of ColName
.
It must have a string type.The from values are generally literals. Patterns can be used if
the pat
attribute description above is set.
The to values are always literals.
Matches are carried out according to the order of the match value in the
files. Match stops when the first match is found. If the files contain both
exact value and pattern, then:
Example:
$ aq_pp ... -d s:Col1 ... -sub Col1 lookup.csv TO X FROM ...
from to
” format, so the column spec must be
given. The X
in the spec marks an unneeded column.-grep[,AtrLst] ColName File [File ...] [ColSpec ...]
Filter by matching the value of ColName
, a string column in the current
data set, against the values loaded from Files
.
Optional AtrLst
is a comma separated list containing:
ncas
- Do case insensitive match (default is case sensitive).
For ASCII data only.pat
- Support ‘?’ and ‘*’ wild cards in the “From” value. Literal ‘?’,
‘*’ and ‘\’ must be escaped by a ‘\’. Without this attribute,
match value is assumed constant and no escape is necessary.ColSpecs
define the input columns as
described in the aq_tool input specifications manual.
The spec is optional, default is “S:from
” (or just “from
”).
If a spec is defined, it must include 1 column (by name):
from
- Marks the column used to match the value of ColName
.
It must have a string type.The from values are generally literals. Patterns can be used if
the pat
attribute description above is set.
Matches are carried out according to the order of the match value in the
files. Match stops when the first match is found. If the files contain both
exact value and pattern, then:
Example:
$ aq_pp ... -d s:Col1 ... -grep,rev Col1 lookup.csv X X FROM ...
X
‘s in the spec mark the unneeded columns.-cmb[,AtrLst] File [File ...] ColSpec [ColSpec ...]
Combine data from Files
into the current data set by joining rows
from both data sets. The new data set will contain unique columns from
both sets. Common columns are automatically used as the join keys
(see ColSpec
description on how to customize join keys).
Optional AtrLst
is a comma separated list containing:
ncas
- Do case insensitive match (default is case sensitive).
For ASCII data only.req
- Discard unmatched records.all
- Use all matches. Normally, only the first match is used.
With this attribute, one row is produced for each match.mrg
- Use merge mode. Records in the current data set and in
in the combine files must already be sorted according to the combine keys
in the same order (default is ascending unless dec
is given).
Use this approach if the combine data is too large to fit into memory.dec
- Same as mrg
except that all the data are sorted in descending
order.ColSpecs
define the input columns as
described in the aq_tool input specifications manual.
with these column attribute extensions:
key
- Marks a column as being a join key. It must be a common column.
This is the default for a common column.cmb
- Marks a column to be combined into the current data set.
This is the default for a non-common column.
It is typically used to mark a common column as not a join key.Example:
$ aq_pp ... -d s:Col1 s:Col2 i:Col3 s:Col4 ... -cmb lookup.csv i:Col3 s:Col1 s:Col5 s:Col6 ...
$ aq_pp ... -d s:Col1 s:Col2 i:Col3 s:Col4 ... -cmb lookup.csv i:Col3 s:Col1 s:Col5 s:Col6 s,cmb:Col2 ... $ aq_pp ... -d s:Col1 s:Col2 i:Col3 s:Col4 ... -cmb lookup.csv i,key:Col3 s,key:Col1 s,cmb:Col5 s,cmb:Col6 s,cmb:Col2 ...
-pmod ModSpec [ModSrc]
Use the processing function in the given module to process the current record. The function is typically used to implement custom logics.
ModSpec
has the form ModName
or ModName("Arg1", "Arg2", ...)
where ModName
is the module name and Arg*
are module dependent
arguments. Note that the arguments must be string constants;
for this reason, they must be quoted according to the
string constant spec.
ModSrc
is an optional module source file. It can be:
.so
extension.Without ModSrc
, aq_pp
will look for a preinstalled module matching
ModName
. Standard modules:
unwrap_strv("From_Col", "From_Sep", "To_Col" [, "AtrLst"])
Unwrap a delimiter separated string column into none or more values. The row will be replicated for each of the unwrapped values. This module requires 3 or 4 arguments:
From_Col
- Column containing the string value to unwrap.
It must have type S
.From_Sep
- The single byte delimiter that separate individual
values. The delimiter must be given as-is, no escape is recognized.To_Col
- Column to save each unwrapped value to.
It must have type S
. The To_Col
can be the same as the
From_Col
- the module will remember the original From_Col
value.AtrLst
- Optional. A comma separated attribute list containing:relax
- No trailing delimiter. One is expected by default.noblank
- Skip blank values. Blanks are kept by default.[-o[,AtrLst] File] [-c ColName [ColName ...]]
Output data rows. Multiple sets of “-o ... -c ...
” can be specified.
Optional “-o[,AtrLst] File
” sets the output attributes and file.
See the aq_tool output specifications manual for details.
In addition, the following attribute is supported:
fvar
- Output to a dynamically defined target. File
is the name of
a previously defined string variable. The actual target
file is obtained from the value of the variable.
The initial value of the variable sets the initial file. Subsequently,
when the value of the variable changes, the old output will be closed
and the new one will be opened.Optional “-c ColName [ColName ...]
” selects the columns to output.
Normally, each selection is the name of a previously defined column/variable.
In addition, these special forms are supported:
*
- An asterisk adds all columns (except variables) to the output.ColName[:NewName][+NumPrintFormat]
- Add ColName
to the output.
If :NewName
is given, it will be used as the output label.
The +NumPrintFormat
spec is for numeric columns. It overrides the
print format of the column (be careful with this format - a wrong spec
can crash the program).^ColName[:NewName][+NumPrintFormat]
- Same as the above, but with a
leading ^
mark. It is used to modify the output label and/or format
of a previously selected output column called ColName
.
If ^ColName[...]
is the first selection after -c
, then *
will be
included automatically first.~ColName
- The leading ~
mark is used to exclude a previously
selected output column called ColName
.
If ~ColName
is the first selection after -c
, then *
will be
included automatically first.If -o
is given without a -c
, then *
is assumed.
If -c
is given without a prior -o
, the selected columns will
be output to stdout.
Example:
$ aq_pp ... -d s:Col1 s:Col2 s:Col3 ... -o - -c Col2 Col1
$ aq_pp ... -d s:Col1 s:Col2 s:Col3 ... -c ^Col1:ColX ~Col3
*
(Col1, Col2 and Col3) implicitly.
Then change Col1’s label to ColX. Then exclude Col3. The final output
columns are ColX and Col2.$ aq_pp ... -d s:Col1 s:Col2 s:Col3 ... -var i:Col4 0 -c '*' Col4
*
(Col1, Col2 and Col3) explicitly.
Then add the variable Col4.$ aq_pp ... -var s:out1 '"first.csv"' ... -if -filt '...' -eval out1 '...' -endif ... -o,fvar out1 ...
out1
.out1
as well.out1
is not known until run time.
If so, set its value to /dev/null
in the -var
statement.-ovar[,AtrLst] File [-c ColName [ColName ...]]
Output the final values of all variables defined via the -var option.
Multiple sets of “-ovar ... -c ...
” can be specified.
Only a single data row is output from each spec.
“-ovar[,AtrLst] File
” sets the output attributes and file.
See the aq_tool output specifications manual for details.
In addition, the following attribute is supported:
fvar
- Output to a dynamically defined target. File
is the name of
a previously defined string variable. The actual target
file is obtained from the value of the variable.
The initial value of the variable sets the initial file. Subsequently,
when the value of the variable changes, the old output will be closed
and the new one will be opened.Optional “-c ColName [ColName ...]
” selects the variables to output.
Normally, each selection is the name of a previously defined variable.
In addition, these special forms are supported:
*
- An asterisk adds all variables to the output.ColName[:NewName][+NumPrintFormat]
- Add ColName
to the output.
If :NewName
is given, it will be used as the output label.
The +NumPrintFormat
spec is for numeric variables. It overrides the
print format of the variable (be careful with this format - a wrong spec
can crash the program).^ColName[:NewName][+NumPrintFormat]
- Same as the above, but with a
leading ^
mark. It is used to modify the output label and/or format
of a previously selected output variable called ColName
.
If ^ColName[...]
is the first selection after -c
, then *
will be
included automatically first.~ColName
- The leading ~
mark is used to exclude a previously
selected output variable called ColName
.
If ~ColName
is the first selection after -c
, then *
will be
included automatically first.If -o
is given without a -c
, then *
is assumed.
Example:
$ aq_pp ... -d i:Col1 i:Col2 ... -var i:Sum1 0 -var i:Sum2 0 ... -eval Sum1 'Sum1 + Col1' -eval Sum2 'Sum2 + (Col2 * Col2)' ... -ovar - -c Sum1 Sum2
-imp[,AtrLst] DbName[:TabName] [-server AdrSpec [AdrSpec ...]] [-local] [-mod ModSpec [ModSrc]]
Output data to Udb (i.e., perform an Udb import).
DbName
is the database name (see Target Udb Database).
TabName
is a table/vector name in the database.
If TabName
is not given or if it is a ”.” (a dot), a primary key-only
import will be performed.
Columns (including variables) from the current data set matching
the column names of TabName
are automatically selected for import.
In case certain desired columns in the current data set are named
differently from tbe columns of TabName
, use -alias or -renam
to remap their names manually.
Optional AtrLst
is a comma separated list containing:
spec=UdbSpec
- Set the spec file directly (see Target Udb Database).ddef
- Allow missing target columns. Normally, it is an error when
a target column is missing from the current data set. With this attribute,
0 or blank will be used as the missing columns’ value.nodelay
- Send records to Udb servers as soon as possible.
Otherwise, up to 16KB of data may be buffered before an output occurs.seg=N1[-N2]/N[:V]
- Only import a subset of the input data by selecting
segment N1 or segments N1 to N2 (inclusive) out of N segments of
unique keys based on their hash values.
For example, seg=2-4/10
will divide the keys into 10 segments and
import segments 2, 3 and 4; segments 1 and 5-10 are skipped.
Optional V
is a number that can be used to vary the sample selection.
It is zero by default.nobnk
- Exclude records with a blank key from the import.
This only applies with the primary key is made up of a single string column.nonew
- Tell the server not to create any new key during the
import. In other words, records belonging to keys not yet in the DB are
discarded.noold
- The opposite of nonew
.Optional “-server AdrSpec [AdrSpec ...]
” sets the target servers.
If given, server spec in the Udb spec file will be ignored.
AdrSpec
has the form IP_or_Domain[|IP_or_Domain_Alt][:Port]
.
See Target Udb Database for details.
Optional “-local
” tells the program to connect to the local servers
only. Local servers are those in the server spec (from the Udb spec file or
-server
option) whose IP matches the the local
IP of the machine the program is running on.
Optional “-mod ModSpec [ModSrc]
” specifies a module to be
loaded on the server side.
ModSpec
has the form ModName
or ModName(Arg1, Arg2, ...)
where ModName
is the module name and Arg*
are module dependent
arguments. Note that the arguments must be literals -
string constants (quoted), numbers or IP addresses.
ModSrc
is an optional module source file containing:
.so
extension.Without ModSrc
, the server will look for a preinstalled module matching
ModName
.
Multiple sets of Udb import options can be specified.
Example:
$ aq_pp ... -d s:Col1 s:Col2 i:Col3 s:Col4 ... -imp mydb:Test
If successful, the program exits with status 0. Otherwise, the program exits with a non-zero status code along error messages printed to stderr. Applicable exit codes are:
A string constant must be quoted between double or single quotes. With double quotes, special character sequences can be used to represent special characters. With single quotes, no special sequence is recognized; in other words, a single quote cannot occur between single quotes.
Character sequences recognized between double quotes are:
\\
- represents a literal backslash character.\"
- represents a literal double quote character.\b
- represents a literal backspace character.\f
- represents a literal form feed character.\n
- represents a literal new line character.\r
- represents a literal carriage return character.\t
- represents a literal horizontal tab character.\v
- represents a literal vertical tab character.\0
- represents a NULL character.\xHH
- represents a character whose HEX value is HH
.\<newline>
- represents a line continuation sequence; both the backslash
and the newline will be removed.Sequences that are not recognized will be kept as-is.
Two or more quoted strings can be used back to back to form a single string. For example,
'a "b" c'" d 'e' f" => a "b" c d 'e' f
RT style MapFrom is used in both -mapf and -map options. The MapFrom spec is used to match and/or extract data from a string column’s value. It has this general syntax:
literal_1%*literal_2%?literal_3
-
%*
matches any number of bytes and %?
matches any 1 byte.
This is like a pattern comparison.%%my_var%%
-
Extract the value into a variable named my_var
. my_var
can later be
used in the MapTo spec.literal_1%%my_var_1%%literal_2%%my_var_2%%
-
A common way to extract specific data portions.literal_1%=literal_2%=literal_3
-
%=
is used to toggle case sensitive/insensitive match. In the above case,
if -mapf or -map does not have the ncas
attribute, then
literal_1
‘s match will be case sensitive, but literal_2
‘s will be
case insensitive, and literal_3
‘s will be case sensitive again.\%\%not_var\%\%%%my_var%%a_backslash\\others
-
If a ‘%’ is used in such a way that resembles an unintended MapFrom spec,
the ‘%’ must be escaped. Literal ‘\’ must also be escaped.
In summary, the following escape sequences are recognized:\%
- represents a literal percent character.\\
- represents a literal backslash character.\"
- represents a literal double quote character.\b
- represents a literal backspace character.\f
- represents a literal form feed character.\n
- represents a literal new line character.\r
- represents a literal carriage return character.\t
- represents a literal horizontal tab character.\v
- represents a literal vertical tab character.\0
- represents a NULL character.\xHH
- represents a character whose HEX value is HH
.\<newline>
- represents a line continuation sequence; both the backslash
and the newline will be removed.Each %%var%%
variable can have additional attributes. The general form of
a variable spec is:
%%VarName[:@class][:[chars]][:min[-max]][,brks]%%
where
VarName
is the variable name which can be used in MapTo. VarName can be a
‘*’; in this case, the extracted data is not stored, but the extraction
attributes are still honored.
Note: Do not use numbers as a RT mapping variable name.
:@class
restricts the exctracted data to belong to a class of characters.
class
is a code with these values and meanings:
n
- Characters 0-9.a
- Characters a-z.b
- Characters A-Z.c
- All printable ASCII characters.x
- The opposite of c
above.s
- All whitespaces.g
- Characters in {}[]()
.q
- Single/double/back quotes.Multiple classes can be used; e.g., %%my_var:@nab%%
for all alphanumerics.
:[chars]
([]
is part of the syntax) is similar to the character class
described above except that the allowed characters are set explicitly.
Note that ranges is not supported, all characters must be specified.
For example,
%%my_var:[0123456789abcdefABCDEF]%%
(same as
%%my_var:@n:[abcdefABCDEF]%%
) for hex digits. To include a ‘]’
as one of the characters, put it first, as in %%my_var:[]xyz]%%
.
:min[-max]
is the min and optional max length (bytes, inclusive) to
extract. Without a max, the default is unlimited (actually ~64Kb).
,brks
defines a list of characters at which extraction of the variable
should stop. For example, %%my_var,,;:%%
will extract data into my_var
until one of ,;:
or end-of-string is encountered. This usuage is often
followed by a wild card, as in %%my_var,,;:%%%*
.
Regular expression style MapFrom
can be used in both -mapf and -map
options. MapFrom
defines what to match and/or extract from a string
value of a column.
Both the POSIX and PCRE (Perl Compatible regular expression) engines are
supported. Which one to use depends on the mapping option’s attributes.
See regular expression attributes for
the appropriate attributes.
Differences between RegEx mapping and RT mapping:
^pattern$
.%%0%%
, %%1%%
,
and so on. See -mapc for an usage example. The PCRE engine can optionally
use named variables.\\
, \+
, \*
, etc), the followings are also recognized:\"
- represents a literal double quote character.\b
- represents a literal backspace character.\f
- represents a literal form feed character.\n
- represents a literal new line character.\r
- represents a literal carriage return character.\t
- represents a literal horizontal tab character.\v
- represents a literal vertical tab character.\0
- represents a NULL character.\xHH
- represents a character whose HEX value is HH
.\<newline>
- represents a line continuation sequence; both the backslash
and the newline will be removed.Regular expression is very powerful but also complex. Please consult the POSIX or PCRE2 regular expression manuals for details.
MapTo is used in -mapc and -map. It renders the data extracted by MapFrom into a column. Both RT and RegEx MapTo share the same syntax:
%%my_var%%
-
Substitute the value of my_var
.literal_1%%my_var_1%%literal_2%%my_var_2%%
-
A common way to render extracted data.\%\%not_var\%\%%%my_var%%a_backslash\\others
-
If a ‘%’ is used in such a way that resembles an unintended MapTo spec,
the ‘%’ must be escaped. Literal ‘\’ must also be escaped.
See RT MapFrom Syntax for all supported escape sequences.Each %%var%%
variable can have additional attributes. The general form of
a variable spec is:
%%VarName[:cnv][[:start]:length][,brks]%%
where
VarName
is the variable to substitute in.
:cnv
sets a conversion method on the data in the variable. Note that the
data is first subjected to the length and break considerations before the
conversion. Supported conversions are:
b64
- Apply base64 decode.url[Num]
- Apply URL decode. Optional Num
is a number between 1-99.
It is the number of times to apply URL decode.Normally, only use 1 conversion. If both are specified (in any order), URL decode is always done before base64 decode.
:length
(without a start position spec) is the number of bytes from the
beginning of the extracted data to substitute. Default is till the end.
:start:length
is the starting byte position and subsequent length of the
extracted data to substitute. The first byte has position 0.
,brks
defines a list of characters at which substitution of the variable’s
value should stop.
See -mapc for an usage example.
aq_pp
obtains information about the target Udb database from a spec file.
The spec file contains server IPs (or domain names) and table/vector
definitions. See udb.spec for details.
aq_pp
finds the relevant spec file in several ways:
spec=UdbSpec
attribute
of the -imp or -exp option.DbName
parameters
of the -imp or -exp option. This method sets the spec file to
“.conf/DbName.spec
” in the runtime directory of aq_pp
.udb.spec
” in the runtime directory of aq_pp
.Some of the data processing options can be placed in conditional groups such that different processing rules can be applied depending on the logical result of another rule. The basic form of a conditional group is:
-if[not] RuleToCheck RuleToRun ... -elif[not] RuleToCheck RuleToRun ... -else RuleToRun ... -endif
Groups can be nested to form more complex conditions.
Supported RuleToCheck
and RuleToRun
are
-eval, -mapf, -mapc, -kenc, -kdec,
-filt, -map, -sub, -grep, -cmb, -pmod,
-o and -imp. Note that some of these rules may be responsible for the
initialization of dynamically created columns. If such rules get skipped
conditionally, numeric 0 or blank string will be assigned to the
uninitialized columns.
There are 2 special RuleToCheck
:
-true
- Evaluate to true.-false
- Evaluate to false.In addition, there are 3 special RuleToRun
for output record disposition
control (they do not change any data):
-skip
- Do not output current row.-quit
- Stop processing entirely.-quitafter
- Stop processing after the current input record.Example:
$ aq_pp ... -d i:Col1 ... -if -filt 'Col1 == 1' -eval s:Col2 '"Is-1"' -elif -filt 'Col1 == 2' -false -else -eval Col2 '"Others"' -endif ...
$ aq_pp ... -d i:Col1 s:Col2 ... -if -filt 'Col1 == 1' -o Out1 -elif -filt 'Col1 == 2' -o Out2 -c Col2 -endif ...