[SOLVED] AWK vs MySQL for Data Aggregation

Issue

In trying to figure out if AWK or MySQL is more efficient for processing log files and returning aggregate stats, I noticed the following behavior which doesn’t make sense to me:

To test this I used a file that had 4 columns and approximately 9 Million records. I used the same server, which is a VPS with a SSD and 1GB of RAM.

column1 is a column that has about 10 unique values and the number of total unique values for the combination of all columns is approximately 4k.

In MySQL I use a table defined as table (column1, column2, column3, column4) with no indices.

Data Format:

    column1,column2,column3,column4
    column1,column2,column3,column4

AWK Script:

BEGIN {
    FS = ",";
    time = systime();
}  {
    array[$1]++;  #first test
    #array[$1 "," $2 "," $3 "," $4]++; #second test
}
} END {
    for (value in array) {
            print "array[" value "]=" array[value];
    }
}

MySQL Query:

Query 1: SELECT column1, count(*) FROM log_test GROUP BY column1;

Query 2: SELECT column1, column2, column3, column4, count(*) 
FROM log_test GROUP BY column1, column2, column3, column4;

AWK is slower than MySQL as expected. However, when I run the first test that returns the aggregate data with 10 lines MySQL takes around 7 secs to finish and AWK takes around 22 secs.

I understand that awk reads line by line and processes the data, so I would expect that when I run the second test, that has an output of 4k lines, AWK should take about the same time as it did for the first test, being that it still has the same number of lines to read and it isnt doing much more processing. However AWK takes about 90 secs but only uses .1% MEM while MySQL takes about 45 secs and uses 3% MEM.

  1. Why does AWK take so much longer for test 2 than test 1 when it is essentially reading the same file?
  2. Why does AWK not use more memory and is awk storing its values on the harddisk vice in memory?
  3. Why is MySQL so much faster when it essentially has to read the table line by line as well?
  4. Are there more efficient alternatives to aggregating this data?

Solution

Awk has to store all of the tuples in the second case (and juggle a much larger associative map). To verify this, try the intermediate steps of 2- and 3- field counts

As for memory usage, can you look at the exact number of bytes used by the process? Force awk to sleep at the end and measure the memory usage in both cases (in bytes) and you will see a difference

MySQL stores the numerical data in a more efficient way than merely printing text. More importantly, it is probably storing the data in a pre-parsed form whereas awk has to perform an expensive record and field split on every line (you didn’t show the MySQL setup, but if you used char(10) or other fixed width fields MySQL doesn’t have to re-process the data).

The most efficient way would be to pre-sort or apply an index that is maintained as you update, although it comes at the cost of per-insert time. Alternatively, if the columns are small and have known widths, you can write your own C utility that takes advantage of the assumptions (the file would merely be a set of structs)

Answered By – SheetJS

Answer Checked By – Clifford M. (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *