Wednesday, March 24, 2010

Awk - count and sum on multiple columns


I have got the idea for this post from one of my reader "manimaran" who have posted a comment on one of my post related to group by using awk. Thanks "manimaran".

Input file:
      
$ cat details.txt
SA 0 2345
AF 0 123
SA 1 89
OC 0 890
EU 1 24
AF 1 90
NA 0 5678
AF 1 345
OC 1 90
OC 0 23
SA 1 1234
EU 0 90
AF 0 12
SA 0 909


Output required :

- Group by continent
- For each continent count the followings:

Count of 0
Count of 1
Sum of last fields where 2nd field is 0
Sum of last fields where 2nd field is 1

i.e. required output:

Continent count0 count1 sum0 sum1
NA 1 0 5678 0
OC 2 1 913 90
AF 2 2 135 435
SA 2 2 3254 1323
EU 1 1 90 24

The awk solution using awk associative array and awk if-else

$ awk '
BEGIN {print "Continent count0 count1 sum0 sum1"} {
n[$1]++;
C0_[$1] += ($2 == "0" ? 1 : 0); S0_[$1] += ($2 == "0" ? $NF : 0)
C1_[$1] += ($2 == "1" ? 1 : 0); S1_[$1] += ($2 == "1" ? $NF : 0)
}
END {
for (i in n) {
print i,C0_[i],C1_[i],S0_[i],S1_[i]
}
}' details.txt

Output:

Continent count0 count1 sum0 sum1
NA 1 0 5678 0
OC 2 1 913 90
AF 2 2 135 435
SA 2 2 3254 1323
EU 1 1 90 24

And if you need the output of the above awk script in a well formatted way, here is the one:

$ awk '
BEGIN {
FORMAT="%-10s%-8s%-8s%-8s%s\n"
{printf FORMAT,"Continent","count0","count1","sum0","sum1"}
}
{ n[$1]++
C0_[$1] += ($2 == "0" ? 1 : 0)
S0_[$1] += ($2 == "0" ? $NF : 0)
C1_[$1] += ($2 == "1" ? 1 : 0)
S1_[$1] += ($2 == "1" ? $NF : 0)
}
END {
for (i in n) {
printf FORMAT,i,C0_[i],C1_[i],S0_[i],S1_[i]
}
}' details.txt

Output:

Continent count0 count1 sum0 sum1
NA 1 0 5678 0
OC 2 1 913 90
AF 2 2 135 435
SA 2 2 3254 1323
EU 1 1 90 24

This post explains how we can align text using awk printf function.

Related posts:

- Group by clause functionality in awk
- Sum of and group by using awk in bash
- Grouping similar files using awk in bash

4 comments:

Jadu Saikia said...

To find the min and max of the available values of last column (for each continent), here is a way:

$ awk '
BEGIN {print "Continent count0 count1 sum0 sum1 min max"} {
n[$1]++;
C0_[$1] += ($2 == "0" ? 1 : 0); S0_[$1] += ($2 == "0" ? $NF : 0)
C1_[$1] += ($2 == "1" ? 1 : 0); S1_[$1] += ($2 == "1" ? $NF : 0)
min[$1]=(!($1 in min) || min[$1]> $3 )? $3 : min[$1];max[$1]=(max[$1]> $3)? max[$1] : $3
}
END {
for (i in n) {
print i,C0_[i],C1_[i],S0_[i],S1_[i],min[i], max[i]
}
}' details.txt

o/p:
Continent count0 count1 sum0 sum1 min max
NA 1 0 5678 0 5678 5678
OC 2 1 913 90 23 890
AF 2 2 135 435 12 345
SA 2 2 3254 1323 89 2345
EU 1 1 90 24 24 90

manimaran said...

Thanks a lot again...

cheers,
Manimaran

uaguide said...

hi Jadu,

can you please provide code for below output

IN|1|A|1|one|1
LK|1|C|1|one|1
US|2|A|1|two|1
US|2|B|1|six|1

2nd column is total count noticed of IN/LK/US
4th column is total count noticed of A/B/C
6th column is total count noticed of one/two/six

Jadu Saikia said...

@uaguide sure I can help here. Could you please clarify your input and expected o/p once more, that would help.

© Jadu Saikia www.UNIXCL.com