Sunday, September 28, 2008

Sum of and group by using awk


$ cat cont_details.txt
continent|mval|cval|kval
SA|2345|10|2.3
AF|123|12|4.5
SA|89|12.67
OC|890|10|2.3
EU|24|45|2.4
AF|90|10|10
NA|5678|12|89
AF|345|12|3.5
OC|90|78|5.6
OC|23|12|4.5
SA|1234|12|6.7
EU|90|12|10
AF|12|12|34
SA|909|12|56

Output required:

Select continent, sum(mval),sum(cval),sum(kval) group by continent

i.e. required output:

continent|mval|cval|kval
NA|5678|12|89
OC|1003|100|12.4
AF|570|46|52
SA|4577|46.67|77.67
EU|114|57|12.4

Awk solution:

$ awk -F"|" '
NR==1 {print}
NR!=1 {OFS="|";a[$1]+=$2;b[$1]+=$3;c[$1]+=$NF}
END{for (i in a){print i,a[i],b[i],c[i]}}
' cont_details.txt

Output:

continent|mval|cval|kval
NA|5678|12|89
OC|1003|100|12.4
AF|570|46|52
SA|4577|46.67|77.67
EU|114|57|12.4

Related post:
Implement group by clause functionality using awk

27 comments:

Ruturaj Vartak said...

that is great! thanks a lot

yoosfg said...

Excellent.. Much needed help indeed..Thanks

awkseeker said...

I used two of your tricks and perfect work for me. I just wonder how does your script work, because you refer to $1 and then your add with $2, it is a big question in my head. Could you please explain more clear for me. Thank you very much for your contribution here. Hope to hear from you, my professor.

Unknown said...

@awkseeker,
Thanks for reading this blog.
I am putting an example below that will help you to understand awk associative-array. Please let me know if you still have any doubt on this.

$ cat file1.txt
Alex,147,1
Brian,056,7
Nanda,559,2
Lisa,221,0
Nanda,268,2
Alex,200,5
Alex,408,8


i)
#a[$1] registers the first field of the lines seen in the associative-array "a" (arrays are always associative in Awk)

$ awk -F "," '{a[$1]} END {for (i in a) print i}' file.txt

Nanda
Lisa
Alex
Brian

ii)
#Count of first field, sum of 2nd field, sum of 3rd field group by each first field

$ awk -F "," '{a[$1]++; b[$1]+=$2; c[$1]+=$3 } END {for (i in a) print i,a[i],b[i],c[i]}' file1.txt

Nanda 2 827 4
Lisa 1 221 0
Alex 3 755 14
Brian 1 56 7

iii)
#Sum of 2nd field, sum of 3rd field group by each first field

$ awk -F "," '{a[$1]+=$2; b[$1]+=$3 } END {for (i in a) print i,a[i],b[i]}' file1.txt

Nanda 827 4
Lisa 221 0
Alex 755 14
Brian 56 7

iv)
#Sum of 2nd field, sum of 3rd field group by each first field, same as ii)

$ awk -F "," '{a[$1]+=$2; b[$1]+=$3 } END {for (i in b) print i,a[i],b[i]}' file1.txt

Nanda 827 4
Lisa 221 0
Alex 755 14
Brian 56 7

Hope this helps.

awkseeker said...

Thank Juda, your blog is so valuable and save huge time to my work. I use you trick to do tabular text file and much more efficiency compare to traditional sql.

I am new to awk and have some basic of turbo c in previous time. however, your good explanation make me more clear and I would bother you if some doubts.

Best regards,

Unknown said...

@Awkseeker : thanks so much for your comment. Its really inspiring. Feel free to ask any doubt on Unix one liner, awk, sed and I will definitely try to help.

awkseeker said...
This comment has been removed by the author.
awkseeker said...
This comment has been removed by the author.
cef_luxan said...

nice tutor,
How about grouping with more than a column, i.e:
Data:
a, b, c, d
Require:
select a, b, sum(c), sum(d) group by a,b;

Thnk's

cef_luxan said...

nice tutor,
How about grouping with more than a column, i.e:
Data:
a, b, c, d
Require:
select a, b, sum(c), sum(d) group by a,b;

Thnk's

Unknown said...

@cef_luxan, something like this?

$ cat file.txt
US|A|23
IN|A|12
US|B|12
LK|C|20
US|B|12

$ awk -F"|" '
{OFS="|";a[$1]+=$NF}
END{for (i in a){print i,a[i]}}
' file.txt

LK|20
IN|12
US|47

$ awk -F"|" '
{OFS="|";a[$1"|"$2]+=$NF}
END{for (i in a){print i,a[i]}}
' file.txt

IN|A|12
LK|C|20
US|A|23
US|B|24

cef_luxan said...

Hi Mr. Jadu,
Thank you for your excellent answer

awkseeker said...

Hi Juda, love to bother you again. I tried some SOAP and got the result in big XML. I want to use awk to archive the needed parameters only. Could you help me.

Unknown said...

@awkseeker, sure, could you please put your input and expected output. Thanks.

Anonymous said...

this post is start for me ... I stumped at samething that I not sure as to how start on ... I tried so of your script that i honestly lost touch ... please help
COL1 COl2 COL3
12222 AUH FLUEH
12222 SSC OPERA
12222 SSC OPERA
(SSC shows 2 occurences & AUH1 Occurences)
12223 SSC OPERA
12223 AUH FLUEH
12223 SSC OPERA
12223 SSC OPERA
(SSC shows 3 occurences & AUH1 Occurences)

RESULT -
12222 SSC-2 AUH-1 OPERA-2 FLUEH-1
12223 SSC-3 AUH-1 OPERA-3 FLUEH-1

How shold I do this?

Juan C. Rivera said...

HI Jadu,

I have an input like this

Nanda 827 4
Lisa 221 0
Alex 755 14
Brian 56 7

I need to sum the last column only, my output would be:
25

Unknown said...

@Juan, you can do something like this:

$ cat file.txt
Nanda 827 4
Lisa 221 0
Alex 755 14
Brian 56 7

$ awk '{sum+=$NF} END {print sum}' file.txt
25

Which is same as:
$ awk '{sum+=$3} END {print sum}' file.txt
25

You can check this post for other alternatives to sum numbers in a file in UNIX bash shell.

http://unstableme.blogspot.in/2009/11/sum-of-numbers-in-file-unix.html

Hope this helps.

Anonymous said...

Hi Jadu,

Great explanation above of how to do Group By's in Awk! However suppose you need to sum up 1000's of columns? Is there a better solution then the one above?

Terry

Anonymous said...

Sorry I meant what happens if you need to sum 1000's of columns but group by only a few columns,...

Terry

Anonymous said...

Hi Jadu,

Ok I came up with the following solution for cases where you have 1000's of columns and want to do a Group By:

gawk -F"|" 'NR==1 {print}
NR!=1 {OFS="|";
for (i=1; i<=43; i++)
a[i,$1]+=$(i+1)
}
END {for (i in a)
b[substr(i,index(i,"\034")+1)]="";
for (i in b){output="";
for (j=1; j<=43; j++)
output = output "," a[j,i]
print i,output
}
}' cont_details.txt

Feel free to post it on your blog! :)

Terry

Anonymous said...

Hi Jadu,

One mistake in the code I send you earlier,... All occurrences of '43' should be replaced by the number of columns you have,...

Terry

awkseeker said...

Hi Juda, I have tried to comment, but it seem not to be success, so wold bother you again.
infile:
info1=1,2
info2=a
info3=a,s
info1=1
info2=a,d
info3=f
info1=1
info2=a
info3=s

output:
1,2|a|a,s
1|a,d|f
1|a|s

tried to apply array, but no idea how to start. please advise.

Unknown said...

@awkseeker something like this ?

$ awk -F "=" '{print $NF}' infile.txt | awk 'NR%3{printf $0"|";next;}1'

1,2|a|a,s
1|a,d|f
1|a|s

awkseeker said...

How cool this simple code is!!!!
I would further bother you on code. please explain 'condition {dosomething;next;}1'
why there are no if condition needed?

Sincerely

Unknown said...

@awkseeker, Thanks.

Actually its acting like a if condition.

Something like, following two are same:

$ awk 'NR==3' infile.txt

$ awk '{if(NR==3) {print $0}}' infile.txt

I have made your query a blog post today, you can have a look.

http://www.unixcl.com/2013/07/unix-merge-multiple-consecutive-lines.html

Also one post on awk if else that you might like:

http://www.unixcl.com/2009/09/if-else-examples-in-awk-bash.html

Thanks!

RajIyengar said...

Nice work.

Suppose I have data in 2 rows and want to show it in one row :)

Eg. Raw Data

date value flg asset
20131001 1234 B AAA
20131001 1423 S AAA
20131001 1324 B AAB
20131001 4321 S AAB

Expected output:

date b-value s-value asset
20131001 1234 1423 AAA
20131001 1324 4321 BBB

How would I do that?

Thanks.

Unknown said...

< cont_details.txt \
perl -wMstrict -Mvars='@v,$i' -F'\|' -lane '
print,next if $. == 1;
$v[$_]->{$F[0]} += $F[$_+1] for 0..2}{$,="|";
$i=$_,print $i, map { $_->{$i} } @v for sort keys %{$v[0]}

'

© Jadu Saikia www.UNIXCL.com