It is not uncommon for data scientists to spend a significant amount of time determining properties about a data set.
was developed to assist in the process. It scans a file and determines:
uses an O(N) algorithm to determine these properties. In particular the uniqueness estimate is based on an
algorithm which results in an answer that is accurate to within a few percent.
We will look at a simple 3 column file ( click to download donations.csv
) that records
the last
name, city,
and amount donated
to a fictional charity. In the rawest form, one can execute the following to get the full output:
loginf -f donations.csv
Sources = 1
Process Time = 0.000
Original Bytes = 228
Analyzed Bytes = 228 (100.00% Original)
Original Lines = 11
Analyzed Lines = 11 (100.00% Original)
Rows = 11
Row Length Min|Max|Avg = 15|31|20.7273
Columns Min|Max = 3|3
Value Bytes = 72 (31.58% Analyzed)
Count = 11 (100.00% Rows)
Value Length Min|Max|Avg = 4|9|6.54545
Unique Estimate = 11
Occurrence = 11 (100.00% Count)
Sample.1 = [8]lastname
Sample.2 = [9]Henderson
Sample.3 = [4]Long
Sample.4 = [9]Alexander
Value Bytes = 95 (41.67% Analyzed)
Count = 11 (100.00% Rows)
Value Length Min|Max|Avg = 4|18|8.63636
Unique Estimate = 11
Occurrence = 11 (100.00% Count)
Sample.1 = [4]city
Sample.2 = [4]Ngou
Sample.3 = [15]Lendangara Satu
Sample.4 = [9]Carazinho
Sample.Has binary = [18]Oborniki Śląskie
Value Bytes = 28 (12.28% Analyzed)
Count = 11 (100.00% Rows)
Value Length Min|Max|Avg = 2|8|2.54545
Unique Estimate = 10
Occurrence = 1 (9.09% Count)
Sample.1 = [8]donation
Occurrence = 10 (90.91% Count)
Numeric Min|Max = 25|50
Sample.1 = [2]26
Sample.2 = [2]27
Sample.3 = [2]31
Sample.4 = [2]35
breaks down each column. Note column three which is the numerical column. Since at first logcnv
not know if there is a header line, it identifies that 1/11 entries are strings, while the other 10/11 are integers.
If you know in advance how many lines to skip at the start of a file, can can use the -f,
+n attribute to skip the first n lines.
This is used for determining the column specification used in other AQ commands:
loginf -f donations.csv -o_pp_col -
It is extremely helpful when integrating new datasets with the AQ tools.
This utility also has the ability to store the output in a raw form that can be used to merge results from several
files. This is most useful when an estimate of uniqueness is needed from a column in a set of log files that span a
length of time. Refer to the loginf
manual for the full syntax.