Wednesday, March 24, 2010

Awk - count and sum on multiple columns


I have got the idea for this post from one of my reader "manimaran" who have posted a comment on one of my post related to group by using awk. Thanks "manimaran".

Input file:
      
$ cat details.txt
SA 0 2345
AF 0 123
SA 1 89
OC 0 890
EU 1 24
AF 1 90
NA 0 5678
AF 1 345
OC 1 90
OC 0 23
SA 1 1234
EU 0 90
AF 0 12
SA 0 909


Output required :

- Group by continent
- For each continent count the followings:

Count of 0
Count of 1
Sum of last fields where 2nd field is 0
Sum of last fields where 2nd field is 1

i.e. required output:

Continent count0 count1 sum0 sum1
NA 1 0 5678 0
OC 2 1 913 90
AF 2 2 135 435
SA 2 2 3254 1323
EU 1 1 90 24

The awk solution using awk associative array and awk if-else

$ awk '
BEGIN {print "Continent count0 count1 sum0 sum1"} {
n[$1]++;
C0_[$1] += ($2 == "0" ? 1 : 0); S0_[$1] += ($2 == "0" ? $NF : 0)
C1_[$1] += ($2 == "1" ? 1 : 0); S1_[$1] += ($2 == "1" ? $NF : 0)
}
END {
for (i in n) {
print i,C0_[i],C1_[i],S0_[i],S1_[i]
}
}' details.txt

Output:

Continent count0 count1 sum0 sum1
NA 1 0 5678 0
OC 2 1 913 90
AF 2 2 135 435
SA 2 2 3254 1323
EU 1 1 90 24

And if you need the output of the above awk script in a well formatted way, here is the one:

$ awk '
BEGIN {
FORMAT="%-10s%-8s%-8s%-8s%s\n"
{printf FORMAT,"Continent","count0","count1","sum0","sum1"}
}
{ n[$1]++
C0_[$1] += ($2 == "0" ? 1 : 0)
S0_[$1] += ($2 == "0" ? $NF : 0)
C1_[$1] += ($2 == "1" ? 1 : 0)
S1_[$1] += ($2 == "1" ? $NF : 0)
}
END {
for (i in n) {
printf FORMAT,i,C0_[i],C1_[i],S0_[i],S1_[i]
}
}' details.txt

Output:

Continent count0 count1 sum0 sum1
NA 1 0 5678 0
OC 2 1 913 90
AF 2 2 135 435
SA 2 2 3254 1323
EU 1 1 90 24

This post explains how we can align text using awk printf function.

Related posts:

- Group by clause functionality in awk
- Sum of and group by using awk in bash
- Grouping similar files using awk in bash

6 comments:

Jadu Saikia said...

To find the min and max of the available values of last column (for each continent), here is a way:

$ awk '
BEGIN {print "Continent count0 count1 sum0 sum1 min max"} {
n[$1]++;
C0_[$1] += ($2 == "0" ? 1 : 0); S0_[$1] += ($2 == "0" ? $NF : 0)
C1_[$1] += ($2 == "1" ? 1 : 0); S1_[$1] += ($2 == "1" ? $NF : 0)
min[$1]=(!($1 in min) || min[$1]> $3 )? $3 : min[$1];max[$1]=(max[$1]> $3)? max[$1] : $3
}
END {
for (i in n) {
print i,C0_[i],C1_[i],S0_[i],S1_[i],min[i], max[i]
}
}' details.txt

o/p:
Continent count0 count1 sum0 sum1 min max
NA 1 0 5678 0 5678 5678
OC 2 1 913 90 23 890
AF 2 2 135 435 12 345
SA 2 2 3254 1323 89 2345
EU 1 1 90 24 24 90

manimaran said...

Thanks a lot again...

cheers,
Manimaran

uaguide said...

hi Jadu,

can you please provide code for below output

IN|1|A|1|one|1
LK|1|C|1|one|1
US|2|A|1|two|1
US|2|B|1|six|1

2nd column is total count noticed of IN/LK/US
4th column is total count noticed of A/B/C
6th column is total count noticed of one/two/six

Jadu Saikia said...

@uaguide sure I can help here. Could you please clarify your input and expected o/p once more, that would help.

Karan Bohra said...


if='details.txt'

< $if perl -lane '
$_ .= " " .
($h{$F[0]} ? $h{$F[0]} : ++($h{$F[0]}=scalar keys %h));
s/^\S+/[$&]/; push @A, $_;
$h{$F[0]} > $max and $max = $h{$F[0]};
}{print"@{[split/\n/,`seq $max|shuf`]}";print for @A;
' |
dc -e "
# each unique key (field#1) is mapped to a unique key code in increasing order.
# line number 1 data has key codes in shuffled order
# stack 'M': holds the shuffled key codes.
# reg 'N': holds the max shuffled key code
[SM z 0 k C S K
# key [0|1] sum key_code
# memory architecture
# x = [A]
# [6*i-5]=count0s C0
# [6*i-4]=count1s C1
# [6*i-3]=sum0s S0
# [6*i-2]=sum1s S1
# [6*i-1]=key k
# [6*i]=key_code K
# i -> list of shuffled key codes

[q]sq
[sK sS sC sk]sa
[lK 6* 5- lC+ d ;x 1+ r :x]sb
[lK 6* 5- lC+ 2+ d ;x lS+ r :x]sc
[lk lK 6* 1- :x lK d 6* :x]sd
[6* d 1- ;x s1 d 5- ;x s5 d 4- ;x s4 d 3- ;x s3 2- ;x s2]se
[lex l1n [ ]n l5n [ ]n l4n [ ]n l3n [ ]n l2n []p c]sf
[? z 0 =q lax lbx lcx ldx c l?x]s?
[LM lfx lN 1- d sN 0 <g]sg
l?x
lgx
" |
sed -e '
1i\
.TS\
center,allbox,tab( );\
c c c c c\
l n n n n.
y/:/@/
$a\
.TE
' | tbl - | nroff -Tascii -ms | grep '.'

Sasidhar Reddy said...

i have a similar query,
Have a text file with the content delimited using pipe symbol, in the below format.

code|Loan type|loan amount|interest|age|phone|email

195|loan1|10|10|age|ph1|email1
195|loan1|20|10|age|ph1|email1
195|loan1|30|10|age|ph1|email1
295|loan2|60|10|age|ph1|email1
295|loan2|70|10|age|ph1|email1
295|loan2|80|10|age|ph1|email1
395|loan3|90|10|age|ph1|email1
395|loan3|10|10|age|ph1|email1
395|loan3|20|10|age|ph1|email1

Trying to group the data based on the code or loan type and also sum certain column values as below:

Header should repeat at the start of each grouping

code|Loan type|loan amount|interest|age|phone|email
195|loan1|10|10|age|ph1|email1
195|loan1|20|10|age|ph1|email1
195|loan1|30|10|age|ph1|email1
Total 60|30
code|Loan type|loan amount|interest|age|phone|email
295|loan2|60|10|age|ph1|email1
295|loan2|70|10|age|ph1|email1
295|loan2|80|10|age|ph1|email1
Total 210|30
code|Loan type|loan amount|interest|age|phone|email
395|loan3|90|10|age|ph1|email1
395|loan3|10|10|age|ph1|email1
395|loan3|20|10|age|ph1|email1

Total 120|30

© Jadu Saikia www.UNIXCL.com