Friday, October 9, 2009

Grouping files using awk in Bash shell


My directory contains a set of log files with filename of the following pattern:
debug.vendor-name.some-serial-number.epoch-time-stamp. device-class.log

where:
epoch-time-stamp
is the UNIX time stamp when the log file is generated.

device-class
first 4 character of this number represent the service-name of the device and next 6 character is for device class name

$ ls -1
debug.cisco.0001.1254059837.svc1class2.log
debug.cisco.0001.1255058827.svc1class3.log
debug.cisco.0001.1255058827.svc2class3.log
debug.cisco.0001.1255058837.svc1class2.log
debug.cisco.0001.1255059834.svc2class3.log
debug.cisco.0002.1255059819.svc1grade2.log
debug.cisco.0002.1255059849.svc1class1.log
debug.cisco.0002.1255059849.svc2class1.log
debug.juniper.0001.1255059831.svc1class2.log

Lets try to group similar files (under different conditions) and count number of files in each of the groups.

One: Group based on vendor-name(2nd field)

$ ls | awk -F "." '{count[$2]++}END{for(j in count) print j,"["count[j]"]"}'

Output:
cisco [8]
juniper [1]

Two: Group based on vendor-name(2nd field) and serial-number(3rd field)

$ ls | awk -F "." '{count[$2" "$3]++}END{for(j in count) print j,"["count[j]"]"}'

Output:
cisco 0002 [3]
juniper 0001 [1]
cisco 0001 [5]

Three: Group based on vendor-name(2nd field) , serial-number(3rd field) and UNIX-time-stamp(4th field) in hour bucketing*

$ ls | awk -F "." '{count[$2" "$3" "$4-($4%3600)]++}
END{for(j in count) print j,"["count[j]"]"}'

Output:
juniper 0001 1255057200 [1]
cisco 0001 1254056400 [1]
cisco 0002 1255057200 [3]
cisco 0001 1255057200 [4]

*hour bucketing :
e.g: 'Fri Oct 9 09:51:55 UTC 2009' and 'Fri Oct 9 09:01:55 UTC 2009' will fall to the same bucket of Fri Oct 9 09:00:00 UTC 2009

Four: Group based on vendor-name(2nd field) and first 4 characters of device-class (5th field)

$ ls | awk -F "." '
{ $5 = substr($5, 0, 4) }
{count[$2" "$5]++}
END{for(j in count) print j,"["count[j]"]"}'

Output:
juniper svc1 [1]
cisco svc1 [5]
cisco svc2 [3]

Five: Group based on
vendor-name(2nd field),
serial-number(3rd field) ,
UNIX-time-stamp(4th field) in hour bucketing
and first 4 characters of device-class (5th field)

$ ls | awk -F "." '
{ $5 = substr($5, 0, 4) }
{count[$2" "$3" "$4-($4%86400)" "$5]++}
END {for(j in count) print j,"["count[j]"]"}'

Output:
cisco 0002 1255046400 svc1 [2]
cisco 0002 1255046400 svc2 [1]
juniper 0001 1255046400 svc1 [1]
cisco 0001 1255046400 svc1 [2]
cisco 0001 1255046400 svc2 [2]
cisco 0001 1254009600 svc1 [1]


Hope you find it useful.

Related post:

- SQL Sum of and group by using awk
- Group by Clause functionality using awk
- Associative array in awk

No comments:

© Jadu Saikia www.UNIXCL.com