aq-input: aq_tool input specifications

You are here:
← All Topics

Synopsis

aq_command ...
  ...
  -f[,AtrLst] File [File ...]
  -d ColSpec [ColSpec ...] | -d [SepSpec] ColSpec [[SepSpec] ColSpec ...]
  ...

Description

Most aq_tool commands require input data to operate. Specification of the input data is generally done using two options:

  • A -f option that specifies the input data source. Its attributes define the input format as well as other data handling characteristics.
  • A -d option that specifies the input column specs. Column spec features are input data format dependent.

The syntax and usages of these options are the same in all commands that support them. They are described in details below.

Note that certain aq_tools can take supplementary inputs. For example, aq_pp has a -cat option that takes the same input attributes and column specs as the -f and -d combination. The description below applies to those input specs well.

Input File Option

-f[,AtrLst] File [File …]
The -f option sets the input attributes (AtrLst) and sources (Files). If no -f is given, data will be obtained from the standard input. Each File is a data source. It can be a regular file or a stream:

  • For a regular file, specify the file’s path as File.
  • For a stream from the standard input, specify a – (a single dash) as File.
  • For a stream from a named pipe, specify fifo@PipeName as File where PipeName is the named pipe’s path. The program will create the pipe if it does not exist or just use it if it does. If the named pipe is known to exist already, PipeName alone also works.
  • For a stream obtained from connecting to a listener, specify connect@DomainName:Port or connect@IP4:Port or connect@[IP6]:Port as File. DomainName/IP4/IP6 and Port are the address and port to connect to.
  • For a stream obtained from accepting a connection, specify listen@DomainName:Port or listen@IP4:Port or listen@[IP6]:Port or listen@Port as File. DomainName/IP4/IP6 and Port are the address and port to listen at.

Optional AtrLst defines the input’s data format and handling characteristics. It is a list of comma separated attributes containing:

  • Input format selection:

These attributes are mutually exclusive except for sep and csv that can be used together. If no input format attribute is given, csv is assumed.

    • csv – Input is in CSV format. This is the default iutput format. Although CSV implies comma separated, sep=c can be used to select a different separator. This format uses the generic column specification.
    • sep=c or sep=\xHH – Input is in ‘c’ (single byte) separated value format. \xHH is a way to specify ‘c’ via its HEX value HH. This format uses the generic column specification.
    • fix – Input rows have the form “Column1Column2…” without any separator between column values. Instead, each column has a fixed byte length so that the columns can be extracted by byte positions. Individual column widths are defined as n=Len attribute in the generic column specification.
    • div – Input rows have the form “[Separator1]Column1[Separator2]Column2…” where the sepatators that vary from field to field. In this format, the separators are defined along with the columns in the column specification for arbitrary separators.
    • tab – Input is in HTML table format. Each row has the form “…<td>Column1</td>…<td>Column2</td>…</tr>”. In other words, a row begins at the first “<td …>” tag and ends at a “</tr>” tag. This format uses the generic column specification.
    • jsn – Input is in JSON format. Each record must be an object or an array that contains objects. Columns are extracted from object members. The member specification is given as an extended information under the column specification for key extraction.
    • xml – Input is in XML format. Each record comes from a repeated child (or subchild) under the document root. The child specification is given as an extended information under the column specification for key extraction.
    • bin – Input is in aq_tool’s internal binary format. This format is designed to improve performance when the input data is also generated by an aq_tool. Note that the other aq_tool must output its data in bin format as well. This format uses the generic column specification.
    • aq – Input comes from another aq_tool outputting in aq format. This is a special format that contains an embedded column spec – no further column spec will be needed (nor accepted).
  • Column spec attributes that apply to all columns:
    • esc – Interpret ‘\’ is an escape character in all input fields. Only applicable to sep, csv, fix and div formats.
  • Positioning the start of input:
    • +Num[b|r|l] – Specifies the number of bytes (b suffix), records (r suffix) or lines (l suffix) to skip before processing. Line is the default.
  • Error handling:
    By default, all input related errors are fatal – the program will print an error message and exit.

    • nox – Reject records with more fields than the column spec. For sep, csv and tab formats only. By default, these formats silently ignore extra (trailing) fields in the input records.
    • eok[=Num[/Rows]] – Make recoverable input error non-fatal. If there is an input parse error, the program will try to skip over the bad/broken data until the beginning of the next record. If there is an input data processing error, the program will just discard the offending record. Optional Num sets a finite number of errors per file to allow. Num/Rows allows Num errors every Rows rows.
    • qui[=Num] – Quiet. That is, suppress all input related error Optional Num sets a non-zero number of error messages to print for each input file before becoming quiet. Typically used with eok.
  • Processing buffer:
    • bz=BufSize – Set the per-record buffer size to BufSize bytes. It must be big enough to hold the data of all the columns in a record. Default size is 64KB.

Generic Column Specification

-d ColSpec [ColSpec …]
Define the columns of an input in sep, csv, fix, tab or bin format. ColSpec must be specified in the same order as they appear in the input. Up to 2048 non X` type ColSpec can be defined. ColSpec has the form Type[,AtrLst]:ColName. Supported Types are:

  • S – String (65535 byte max).
  • F – Double precision floating point (±2.23×10−308 to ±1.80×10308).
  • L – 64-bit unsigned integer (0 to 18,446,744,073,709,551,615).
  • LS – 64-bit signed integer (−9,223,372,036,854,775,808 to 9,223,372,036,854,775,807).
  • I – 32-bit unsigned integer (0 to 4,294,967,295).
  • IS – 32-bit signed integer (−2,147,483,648 to 2,147,483,647).
  • IP – v4/v6 address.
  • X[Type] – Marks an unwanted input column. Type is required only for a bin input (optional otherwise). It can have one of the above values.

Optional AtrLst determines how column data are to be extracted from the input. It is a comma separated list containing:

  • n=Len – Extract exactly Len source bytes. Use this for a fixed length data column. Not applicable to tab and bin formats.
  • esc – Interpret ‘\’ as an escape character in the input data. Do not use this attribute if the data contain multibyte character sequences that use ‘\’ for encoding. Not applicable to tab and bin formats.
  • clf – Interpret common log format like encoding in the input data. Not applicable to tab and bin formats.
    • Some whitespaces encoded as ‘\r’, ‘\n’, ‘\t’, ‘\v’ and ‘\f’.
    • ‘”’ and ‘\’ encoded as ‘\”’ and ‘\\’ respectively.
    • Non-printable bytes encoded as \xHH where HH is the hex value of the byte.
  • hex – Interpret integers in hexdecimal notation. Default is 10-based. Starting 0x is optional. For example, 100 or 0x100 is converted to 256 instead of 100. Not applicable to bin format.
  • trm – Trim leading/trailing spaces from the field value.
  • lo, up – Convert a string field value to lower or upper case.

ColName is the column name (case insensitive). It can contain up to 31 alphanumeric and ‘_’ characters. Its first character cannot be a digit. It is optional if the column has an X type.

Example:

$ aq_pp ... -d s:Col1 i,trm:Col2 ...
  • Generic column spec. Col1 is a string. Col2 is an unsigned integer; the trm attribute removes blanks around the value before it is converted to an integer.
$ aq_pp -f,fix ... -d s,n=5:Col1 i,n=12,trm:Col2 ...
  • Column spec for the fix format. An n=Len attribute is needed in all column specs.
$ aq_pp ... -d s:Col1 i,trm:Col2 ... -o,bin - | aq_pp -f,bin - -d s:C1 i,C2 ...
  • Column spec for the bin format. Note that the input column types must match those from the other command’s output columns.

Column Specification for Arbitrary Separators

-d [SepSpec] ColSpec [[SepSpec] ColSpec …]
Define the columns of an input in div format. The specification is identical to the Generic Column Specification except for the added SepSpec. The individual SepSpec in this specification is designed for input data that have multibyte separators and/or varying separators from field to field. ColSpec and SepSpec must be specified in the same order as they appear in the input. Up to 2048 non X` type ColSpec can be defined. ColSpec has the form Type[,AtrLst]:ColName. Supported Types are:

  • S – String (65535 byte max).
  • F – Double precision floating point (±2.23×10−308 to ±1.80×10308).
  • L – 64-bit unsigned integer (0 to 18,446,744,073,709,551,615).
  • LS – 64-bit signed integer (−9,223,372,036,854,775,808 to 9,223,372,036,854,775,807).
  • I – 32-bit unsigned integer (0 to 4,294,967,295).
  • IS – 32-bit signed integer (−2,147,483,648 to 2,147,483,647).
  • IP – v4/v6 address.
  • X[Type] – Marks an unwanted input column. Type is optional. It can have one of the above values.

Optional AtrLst determines how a column’s value is to be extracted from the input. It is a comma separated list containing:

  • n=Len – Extract exactly Len source bytes. Use this for a fixed length data column.
  • esc – Interpret ‘\’ as an escape character in the input data. Do not use this attribute if the data contain multibyte character sequences that use ‘\’ for encoding.
  • clf – Interpret common log format like encoding in the input data.
    • Some whitespaces encoded as ‘\r’, ‘\n’, ‘\t’, ‘\v’ and ‘\f’.
    • ‘”’ and ‘\’ encoded as ‘\”’ and ‘\\’.
    • Non-printable bytes encoded as \xHH where HH is the hex value of the byte.
  • hex – Interpret integers in hexdecimal notation. Default is 10-based. Starting 0x is optional. For example, 100 or 0x100 is converted to 256 instead of 100.
  • trm – Trim leading/trailing spaces from the field value.
  • lo, up – Convert a string field value to lower or upper case.

ColName is the column name (case insensitive). It can contain up to 31 alphanumeric and ‘_’ characters. Its first character cannot be a digit. It is optional if the column has an X type.

SepSpec has the form SEP:SepStr where SEP (case insensitive) is a keyword and SepStr is a literal separator of one or more bytes. Note that SepStr is taken as-is, there is no special interpretation. A SepSpec is generally needed between two adjacent ColSpec unless the former column has a n=Len attribute.

Example:

$ aq_pp ... -d sep:' [' s:time_s sep:'] "' s,clf:url sep:'"' ...
  • Parse data of the form: [01/Apr/2016:01:02:03 +0900] “/index.html”.

Column Specification for Key Extraction

-d ColSpec [ColSpec …]
Define the columns of an input in jsn or xml format. This spec differs from the other column specs in these ways:

  • Only the columns desired needed to be specified. There is no need to specify all the columns in the input.
  • The columns need not be in the same order as they appear in the input. Columns are extracted according to their KeySpec and not their positions.

Up to 2048 non X` type ColSpec can be defined. ColSpec has the form Type[,AtrLst]:ColName:KeySpec. Supported Types are:

  • S – String (65535 byte max).
  • F – Double precision floating point (±2.23×10−308 to ±1.80×10308).
  • L – 64-bit unsigned integer (0 to 18,446,744,073,709,551,615).
  • LS – 64-bit signed integer (−9,223,372,036,854,775,808 to 9,223,372,036,854,775,807).
  • I – 32-bit unsigned integer (0 to 4,294,967,295).
  • IS – 32-bit signed integer (−2,147,483,648 to 2,147,483,647).
  • IP – v4/v6 address.
  • X[Type] – Marks an unwanted input column. Type is optional. It can have one of the above values. Note that an X type is generally not necessary; instead, only specify the columns needed.

Optional AtrLst determines how column data are to be extracted from the input. It is a comma separated list containing:

  • hex – Interpret integers in hexdecimal notation. Default is 10-based. Starting 0x is optional. For example, 100 or 0x100 is converted to 256 instead of 100.
  • trm – Trim leading/trailing spaces from the field value.
  • lo, up – Convert a string field value to lower or upper case.
  • base=BaseSpec – Set an optional base for all the KeySpec. BaseSpec is a list of dot separated elements as in Element.Element….. Each Element has the form:
    • KeyName selects the value of an object member named KeyName (case insensitive).
    • [Num] selects the Num-th (zero-based) value in an array. If Num is *, all values will be selected (with certain key extraction limitations).
    • KeyName[Num] selects the Num-th (zero-based) value in the array belonging to an object member named KeyName (case insensitive). If Num is *, all values will be selected (with certain key extraction limitations).

ColName is the column name (case insensitive). It can contain up to 31 alphanumeric and ‘_’ characters. Its first character cannot be a digit.

KeySpec specifies which data field to extract for the column. It is a list of dot separated elements as in Element.Element….. Each Element has the form:

  • KeyName selects the value of an object member named KeyName (case insensitive).
  • [Num] selects the Num-th (zero-based) value in an array. If Num is *, all values will be selected (with certain key extraction limitations).
  • KeyName[Num] selects the Num-th (zero-based) value in the array belonging to an object member named KeyName (case insensitive). If Num is *, all values will be selected (with certain key extraction limitations).

If a BaseSpec attribute is given, KeySpec will be appended to BaseSpec (with a dot in between) to form the actual key.

Example:

{
  "Key1" : "Val1",
  "Key2" : { "Ary" : [ 0, 1, 2 ] }
}

$ aq_pp -f,jsn ... -d S:Col1:key1 I:Col2:key2.ary[*] ...
  • Extract 2 columns from the example JSON data – one from “key1”, the other from all values of “key2.ary”. The result will be “Val1,0”, “Val1,1” and “Val1,2”.
<root>
  <Key1>Val1</Key1>
  <Key2>
    <Ary>0</Ary>
    <Ary>1</Ary>
    <Ary>2</Ary>
  </Key2>
</root>

$ aq_pp -f,xml ... -d S:Col1:root.key1 I:Col2:root.key2.ary[*] ...
  • Extract 2 columns from the example XML data – one from “key1”, the other from all values of “key2.ary”. The result will be “Val1,0”, “Val1,1” and “Val1,2”.
{ "k1" : { "k2" : { "k3" : { "k4" : "14", "k5" : "15" } } } }
{ "k1" : { "k2" : { "k3" : { "k4" : "24", "k5" : "25" } } } }
{ "k1" : { "k2" : { "k3" : { "k4" : "34", "k5" : "35" } } } }

$ aq_pp -f,jsn ... -d I:Col1:k1.k2.k3.k4 I:Col2:k1.k2.k3.k5 ...
$ aq_pp -f,jsn,base=k1.k2.k3 ... -d I:Col1:k4 I:Col2:k5 ...
  • Extract 2 columns from the example JSON data. The two commands are equivalent, extracting 3 rows of output – “14,15”, “24,25” and “34,35”.
<k1><k2><k3><k4>14</k4><k5>15</k5></k3></k2></k1>
<k1><k2><k3><k4>24</k4><k5>25</k5></k3></k2></k1>
<k1><k2><k3><k4>34</k4><k5>35</k5></k3></k2></k1>

$ aq_pp -f,xml ... -d I:Col1:k1.k2.k3.k4 I:Col2:k1.k2.k3.k5 ...
$ aq_pp -f,xml,base=k1.k2.k3 ... -d I:Col1:k4 I:Col2:k5 ...
  • Extract 2 columns from the example XML data. The two commands are equivalent, extracting 3 rows of output – “14,15”, “24,25” and “34,35”.
[
{ "k1" : { "k2" : { "k3" : { "k4" : "14", "k5" : "15" } } } },
{ "k1" : { "k2" : { "k3" : { "k4" : "24", "k5" : "25" } } } },
{ "k1" : { "k2" : { "k3" : { "k4" : "34", "k5" : "35" } } } }
]

$ aq_pp -f,jsn,base=[*].k1.k2.k3 ... -d I:Col1:k4 I:Col2:k5 ...
  • Extract 2 columns from the example JSON data. Produces ths same result as the previous example. Note the use of “[*]” in base to address all the objects in the top array.
<k0>
<k1><k2><k3><k4>14</k4><k5>15</k5></k3></k2></k1>
<k1><k2><k3><k4>24</k4><k5>25</k5></k3></k2></k1>
<k1><k2><k3><k4>34</k4><k5>35</k5></k3></k2></k1>
</k0>

$ aq_pp -f,xml,base=k0.k1[*].k2.k3 ... -d I:Col1:k4 I:Col2:k5 ...
  • Extract 2 columns from the example XML data. Produces ths same result as the previous example. Note the use of “[*]” in base to address all the “k1” entries.
{ "k1" : { "k2" : { "k3" : [ { "k4" : "14", "k5" : "15" },
                             { "k4" : "24", "k5" : "25" } ] } } },
{ "k1" : { "k2" : { "k3" : [ { "k4" : "34", "k5" : "35" } ] } } }

$ aq_pp -f,jsn,base=k1.k2.k3[*] ... -d I:Col1:k4 I:Col2:k5 ...
  • Extract 2 columns from the example JSON data. Produces ths same result as the previous example. Note the use of “[*]” in base to address all the objects in the “k3” array.
<k1><k2><k3><k4>14</k4><k5>15</k5></k3>
        <k3><k4>24</k4><k5>25</k5></k3></k2></k1>
<k1><k2><k3><k4>34</k4><k5>35</k5></k3></k2></k1>

$ aq_pp -f,xml,base=k1.k2.k3[*] ... -d I:Col1:k4 I:Col2:k5 ...
  • Extract 2 columns from the example XML data. Produces ths same result as the previous example. Note the use of “[*]” in base to address all the objects in the “k3” elements.
[
{ "k1" : { "k2" : { "k3" : [ { "k4" : "14", "k5" : "15" },
                             { "k4" : "24", "k5" : "25" } ] } } },
{ "k1" : { "k2" : { "k3" : [ { "k4" : "34", "k5" : "35" } ] } } }
]

$ aq_pp -f,jsn,base=[*].k1.k2.k3[*] ... -d I:Col1:k4 I:Col2:k5 ...
  • Extract 2 columns from the example JSON data. Produces ths same result as the previous example. Note the use of two “[*]” in base to address all the objects in the top array and all the objects in the “k3” array.
<k0>
<k1><k2><k3><k4>14</k4><k5>15</k5></k3>
        <k3><k4>24</k4><k5>25</k5></k3></k2></k1>
<k1><k2><k3><k4>34</k4><k5>35</k5></k3></k2></k1>
</k0>

$ aq_pp -f,xml,base=k0.k1[*].k2.k3[*] ... -d I:Col1:k4 I:Col2:k5 ...
  • Extract 2 columns from the example XML data. Produces ths same result as the previous example. Note the use of two “[*]” in base to address all the “k1” entries and all the “k3” entries.
[ 1,2 ]
[ 3,4 ]

$ aq_pp -f,jsn,base=[*] ... -d I:Col1: ...

[ [ 1,2 ], [ 3,4 ] ]

$ aq_pp -f,jsn,base=[*].[*] ... -d I:Col1: ...

{ "k1" : [ 1,2 ] }
{ "k1" : [ 3,4 ] }

$ aq_pp -f,jsn,base=k1[*] ... -d I:Col1: ...

<k1>1</k1>
<k1>2</k1>
<k1>3</k1>
<k1>4</k1>

$ aq_pp -f,xml,base=k1 ... -d I:Col1: ...

The KeySpec in a ColSpec can be blank if base is given.

Key extraction limitations

The [*] extraction may not work sometimes because of the stream based design of aq_tools. It has to do with the arrangement of the input data. To illustrate, consider:

{
  "Key1" : "Val1",
  "Key2" : { "Ary" : [ 0, 1, 2 ] }
}

$ aq_pp -f,jsn ... -d S:Col1:key1 I:Col2:key2.ary[*] ...

Extracting “key1” and “key2.ary” gives the expected result of “Val1,0”, “Val1,1” and “Val1,2”. However, if the input data is arranged differently, as in:

{
  "Key2" : { "Ary" : [ 0, 1, 2 ] },
  "Key1" : "Val1"
}

$ aq_pp -f,jsn ... -d S:Col1:key1 I:Col2:key2.ary[*] ...

The same command only extracted ”,0”, ”,1” and ”,2” – i.e., the value of “key1” is missing. Due to its stream based design, aq_pp outputs one record for each value of the inner most array “key2.ary”. However, “key1” is not known when “key2.ary” is processed, so it is given an empty string value. To illustrate further, consider:

{
  "Key2" : { "Ary" : [ 0, 1, 2 ] },
  "Key1" : "Val1",
  "Key3" : { "Ary" : [ 10, 11, 12 ] }
}

$ aq_pp -f,jsn ... -d S:Col1:key1 I:Col2:key2.ary[*] I:Col3:key3.ary[*] ...

The result will be ”,0,0”, ”,1,0”, ”,2,0”, “Val1,0,10”, “Val1,0,11” and “Val1,0,12”. There are two inner most arrays of interest in this case. The first 3 result rows come from “key2.ary”, where “key1” and “key3.ary” are not known. The other result rows come from “key3.ary”, where “key1” is known but “key2.ary” is no longer in context.

See Also

  • aq_pp – Record preprocessor
  • aq_cnt – Data row/key count
  • aq_ord – In-memory record sort
  • aq_cat – Input multiplexer
  • aq-output – aq_tool output specifications