Wednesday, March 24, 2010

Awk - count and sum on multiple columns

I have got the idea for this post from one of my reader "manimaran" who have posted a comment on one of my post related to group by using awk. Thanks "manimaran".

Input file:
`      \$ cat details.txtSA 0 2345AF 0 123SA 1 89OC 0 890EU 1 24AF 1 90NA 0 5678AF 1 345OC 1 90OC 0 23SA 1 1234EU 0 90AF 0 12SA 0 909`

Output required :

- Group by continent
- For each continent count the followings:

Count of 0
Count of 1
Sum of last fields where 2nd field is 0
Sum of last fields where 2nd field is 1

i.e. required output:
`Continent count0 count1 sum0 sum1NA 1 0 5678 0OC 2 1 913 90AF 2 2 135 435SA 2 2 3254 1323EU 1 1 90 24`

The awk solution using awk associative array and awk if-else
`\$ awk 'BEGIN {print "Continent count0 count1 sum0 sum1"} { n[\$1]++; C0_[\$1] += (\$2 == "0" ? 1 : 0); S0_[\$1] += (\$2 == "0" ? \$NF : 0) C1_[\$1] += (\$2 == "1" ? 1 : 0); S1_[\$1] += (\$2 == "1" ? \$NF : 0)}END {   for (i in n) {      print i,C0_[i],C1_[i],S0_[i],S1_[i]   }}' details.txt`

Output:
`Continent count0 count1 sum0 sum1NA 1 0 5678 0OC 2 1 913 90AF 2 2 135 435SA 2 2 3254 1323EU 1 1 90 24`

And if you need the output of the above awk script in a well formatted way, here is the one:
`\$ awk 'BEGIN {        FORMAT="%-10s%-8s%-8s%-8s%s\n"        {printf FORMAT,"Continent","count0","count1","sum0","sum1"}      }{  n[\$1]++    C0_[\$1] += (\$2 == "0" ? 1 : 0)    S0_[\$1] += (\$2 == "0" ? \$NF : 0)    C1_[\$1] += (\$2 == "1" ? 1 : 0)    S1_[\$1] += (\$2 == "1" ? \$NF : 0)}END {   for (i in n) {      printf FORMAT,i,C0_[i],C1_[i],S0_[i],S1_[i]   }}' details.txt`

Output:
`Continent count0  count1  sum0    sum1NA        1       0       5678    0OC        2       1       913     90AF        2       2       135     435SA        2       2       3254    1323EU        1       1       90      24`

This post explains how we can align text using awk printf function.

Related posts:

- Group by clause functionality in awk
- Sum of and group by using awk in bash
- Grouping similar files using awk in bash

Unknown said...

To find the min and max of the available values of last column (for each continent), here is a way:

\$ awk '
BEGIN {print "Continent count0 count1 sum0 sum1 min max"} {
n[\$1]++;
C0_[\$1] += (\$2 == "0" ? 1 : 0); S0_[\$1] += (\$2 == "0" ? \$NF : 0)
C1_[\$1] += (\$2 == "1" ? 1 : 0); S1_[\$1] += (\$2 == "1" ? \$NF : 0)
min[\$1]=(!(\$1 in min) || min[\$1]> \$3 )? \$3 : min[\$1];max[\$1]=(max[\$1]> \$3)? max[\$1] : \$3
}
END {
for (i in n) {
print i,C0_[i],C1_[i],S0_[i],S1_[i],min[i], max[i]
}
}' details.txt

o/p:
Continent count0 count1 sum0 sum1 min max
NA 1 0 5678 0 5678 5678
OC 2 1 913 90 23 890
AF 2 2 135 435 12 345
SA 2 2 3254 1323 89 2345
EU 1 1 90 24 24 90

Unknown said...

Thanks a lot again...

cheers,
Manimaran

Anonymous said...

can you please provide code for below output

IN|1|A|1|one|1
LK|1|C|1|one|1
US|2|A|1|two|1
US|2|B|1|six|1

2nd column is total count noticed of IN/LK/US
4th column is total count noticed of A/B/C
6th column is total count noticed of one/two/six

Unknown said...

@uaguide sure I can help here. Could you please clarify your input and expected o/p once more, that would help.

Unknown said...

if='details.txt'

< \$if perl -lane '
\$_ .= " " .
(\$h{\$F[0]} ? \$h{\$F[0]} : ++(\$h{\$F[0]}=scalar keys %h));
s/^\S+/[\$&]/; push @A, \$_;
\$h{\$F[0]} > \$max and \$max = \$h{\$F[0]};
}{print"@{[split/\n/,`seq \$max|shuf`]}";print for @A;
' |
dc -e "
# each unique key (field#1) is mapped to a unique key code in increasing order.
# line number 1 data has key codes in shuffled order
# stack 'M': holds the shuffled key codes.
# reg 'N': holds the max shuffled key code
[SM z 0 k C S K
# key [0|1] sum key_code
# memory architecture
# x = [A]
# [6*i-5]=count0s C0
# [6*i-4]=count1s C1
# [6*i-3]=sum0s S0
# [6*i-2]=sum1s S1
# [6*i-1]=key k
# [6*i]=key_code K
# i -> list of shuffled key codes

[q]sq
[sK sS sC sk]sa
[lK 6* 5- lC+ d ;x 1+ r :x]sb
[lK 6* 5- lC+ 2+ d ;x lS+ r :x]sc
[lk lK 6* 1- :x lK d 6* :x]sd
[6* d 1- ;x s1 d 5- ;x s5 d 4- ;x s4 d 3- ;x s3 2- ;x s2]se
[lex l1n [ ]n l5n [ ]n l4n [ ]n l3n [ ]n l2n []p c]sf
[? z 0 =q lax lbx lcx ldx c l?x]s?
[LM lfx lN 1- d sN 0 <g]sg
l?x
lgx
" |
sed -e '
1i\
.TS\
center,allbox,tab( );\
c c c c c\
l n n n n.
y/:/@/
\$a\
.TE
' | tbl - | nroff -Tascii -ms | grep '.'

Unknown said...

i have a similar query,
Have a text file with the content delimited using pipe symbol, in the below format.

code|Loan type|loan amount|interest|age|phone|email

195|loan1|10|10|age|ph1|email1
195|loan1|20|10|age|ph1|email1
195|loan1|30|10|age|ph1|email1
295|loan2|60|10|age|ph1|email1
295|loan2|70|10|age|ph1|email1
295|loan2|80|10|age|ph1|email1
395|loan3|90|10|age|ph1|email1
395|loan3|10|10|age|ph1|email1
395|loan3|20|10|age|ph1|email1

Trying to group the data based on the code or loan type and also sum certain column values as below:

Header should repeat at the start of each grouping

code|Loan type|loan amount|interest|age|phone|email
195|loan1|10|10|age|ph1|email1
195|loan1|20|10|age|ph1|email1
195|loan1|30|10|age|ph1|email1
Total 60|30
code|Loan type|loan amount|interest|age|phone|email
295|loan2|60|10|age|ph1|email1
295|loan2|70|10|age|ph1|email1
295|loan2|80|10|age|ph1|email1
Total 210|30
code|Loan type|loan amount|interest|age|phone|email
395|loan3|90|10|age|ph1|email1
395|loan3|10|10|age|ph1|email1
395|loan3|20|10|age|ph1|email1

Total 120|30