Saturday, September 20, 2008

GROUP BY clause functionality in awk - bash


Q: For a sample two-column data below (cont_bd.txt), how to sum the second column and group by the first column:

Input file:

$ cat cont_bd.txt
continent:mval
SA:2345
AF:123
SA:89
OC:890
EU:24
AF:90
NA:5678
AF:345
OC:90
OC:23
SA:1234
EU:90
AF:12
SA:909

Awk solution for group by clause implementation:

$ awk 'BEGIN{FS=":"; print "continent count total avg"} NR!=1 {a[$1]++;b[$1]=b[$1]+$2}END{for (i in a) printf("%s %10.0f %10.0f %10.2f\n", i, a[i], b[i], b[i]/a[i])} ' cont_bd.txt

Output:

continent count total avg
NA 1 5678 5678.00
OC 3 1003 334.33
AF 4 570 142.50
SA 4 4577 1144.25
EU 2 114 57.00

20 comments:

ropata said...

Thanks mate, this is gold! I used your ideas in this funky script. The hardest part was grokking how awk parses through a file in one hit!

Jadu Saikia said...

@ropata, I just saw your script, great work. Thanks for mentioning me. keep in touch.

The Hat said...

Thanks for this fantastic little script - I've used it for some Twitter data crunching here.

Jadu Saikia said...

@The Hat, Welcome :-)

That's really a good analysis. Thanks for mentioning me. Keep in touch.

manimaran said...

Hi Jadu Saikia...
your little script..looks good.. i need some enhancement for my requirement. I have failed to that using ur script.. could you help me to do..

I have 3 columns and i need to group up ...

SA 0 2345
AF 0 123
SA 1 89
OC 0 890
EU 1 24
AF 1 90
NA 0 5678
AF 1 345
OC 1 90
OC 0 23
SA 1 1234
EU 0 90
AF 0 12
SA 0 909

I want the results like this....

continent count0 count1 total0 total1
NA 1 0 5678 0
OC 2 1 913 90
AF 2 2 135 435
SA
EU

Jadu Saikia said...

@manimaran, I have created a post of your query, please check today's post, hope this helps. Thanks for your comment. Keep in touch.

manimaran said...

Hi Jadu Saikia,

Man... u r gr8.... thanks a ton.. u saved my life :)... its working...

manimaran said...

Hi Jadu Saika,

I need to get the min and max of the available values.. help me if you can.

Expected output:

Continent count0 count1 sum0 sum1 min max
NA 1 0 5678 0 89 5678
OC 2 1 913 90 23 890
AF 2 2 135 435 12 345
SA 2 2 3254 1323 89 2345
EU 1 1 90 24 24 90

Jadu Saikia said...

@manimaran, please check the comment on this post:

http://unstableme.blogspot.com/2010/03/awk-count-and-sum-on-multiple-columns.html

manimaran said...

Hi Jadu Saikia,

is there any way to find the standard deviation of sum0 and sum1 for each row entry ?

Thanks,
-Manimaran

Cacasodo said...

Tasty AWKness, Jadu! Nice work..saved me a bunch of time.
'sodo

John said...

Hi,
I had a similar problem to solve:
I would like to count occurences of the string in a log file, and sort them groupped per hour or minute.

As a total newbie to the linux scripting, I had written something like:

grep 'string' file.log | cut -c 1-23

which returned the result similar to this:

2011-08-06@00:13:43,147
2011-08-06@07:47:30,056
2011-08-06@08:48:15,559
2011-08-06@10:19:07,221
2011-08-06@10:21:45,268

Now I would like to group the results by 2011-08-06@?? and count the lines. The result should look like below:
hour count
2011-08-06@07 1
2011-08-06@17 3
2011-08-06@08 1
2011-08-06@18 1
2011-08-06@10 4

I had modified the given script:
awk 'BEGIN{FS=":"; print " hour count"} NR!=1 {a[$1]++;b[$1]=b[$1]+$2}END{for (i in a) printf("%s %10.0f \n", i, a[i])} ' list.txt

but it omitted the first line starting with
2011-08-06@00

After that I would like to pipe the results of the grep command to the oneliner which does the needed grouping
(I would like to avoid putting scripts into the system, it's rather preferred to copy-paste commands)

Please help :)
John

Jadu Saikia said...

@John, thanks for the question.

The first line is omitted in your case as in the awk command, you had the statement

NR!=1

NR is nothing but the line number.

e.g.

$ cat file.txt
2011-08-06@00:13:43,147
2011-08-06@07:47:30,056
2011-08-06@08:48:15,559
2011-08-06@10:19:07,221
2011-08-06@10:21:45,268

$ awk '{print NR,$0}' file.txt
1 2011-08-06@00:13:43,147
2 2011-08-06@07:47:30,056
3 2011-08-06@08:48:15,559
4 2011-08-06@10:19:07,221
5 2011-08-06@10:21:45,268

$ cat file.txt | awk -F ":" '{a[$1]++} END {for (i in a) print i,a[i]}'

Output:
2011-08-06@07 1
2011-08-06@08 1
2011-08-06@00 1
2011-08-06@10 2

Hope this helps.

Regards,
Jadu

martin novoty said...

Great Hint.

Saurav Bhattacharyya said...

Hi Jadu,

I have a nearly same kind of problem which need to be solved by Unix shell-scripting(awk scripting is also allowed).

Input File:
===========
1,B.Sc,Calcutta
1,M.Sc,Stanford
2,M.A.,Pune
3,M.Tech,IIT Guwahati
4,B.Tech,Shibpur
4,M.Tech,Jadavpur
5,B.Lib.,Calcutta
6,B.Sc.,Bangalore

The Output File would be like this:-
==============
2,1,B.Sc,Calcutta,M.Sc,Stanford
1,2,M.A.,Pune
1,3,M.Tech,IIT Guwahati
2,4,B.Tech,Shibpur,M.Tech,Jadavpur
1,5,B.Lib.,Calcutta
1,6,B.Sc.,Bangalore

i.e. ,, in a single line for each student.
For e.g.,Student No 1 has 2 qualifications, B.Sc from Calcutta
and M.Sc from Stanford.

Please suggest a time-efficient(as this needs to be simulated for crores of records) & brilliant way of doing this.I will highly appreciate any help from you.

Waiting for your kind & prompt help...

Jadu Saikia said...

@Saurav, Thanks for asking the question. Hope this wiil help. Thanks.

$ cat in.txt
1,B.Sc,Calcutta
1,M.Sc,Stanford
2,M.A.,Pune
3,M.Tech,IIT Guwahati
4,B.Tech,Shibpur
4,M.Tech,Jadavpur
5,B.Lib.,Calcutta
6,B.Sc.,Bangalore

$ awk '
BEGIN {FS=OFS=","} {cnt[$1]++}
!A[$1] {A[$1] = $0; next}
{A[$1] = A[$1] "," $2 "," $3}
END {for(i in A) {print cnt[i],A[i]}
}' in.txt

2,1,B.Sc,Calcutta,M.Sc,Stanford
1,2,M.A.,Pune
1,3,M.Tech,IIT Guwahati
2,4,B.Tech,Shibpur,M.Tech,Jadavpur
1,5,B.Lib.,Calcutta
1,6,B.Sc.,Bangalore

Saurav Bhattacharyya said...

Dear Jadu,My heartiest thanks for this magnificient piece of script.

Jadu Saikia said...

@Saurav Bhattacharyya, my pleasure.

Deepak Shrivastava said...

Hi,

i have a text file like this:

1141666113 1141
1141666117 1141
1141666119 1141
1141666120 1141
1815062287 1815
1815062290 1815
1815062351 1815
1815062578 1815
1815062593 1815
2240523210 2240
2240523213 2240
2240523218 2240
2240523220 2240


i want output as 3 different files named 1141, 1815, 2240 and content of these files should be values in 1st column corresponding to these values. for example content of file 2240 should be as below:

2240523210
2240523213
2240523218
2240523220

as you can see, everything is already sorted and grouped together. i need to get the groups out in a separate file.

Please help me writing this script.

Thanks in advance.

ropata said...

@Deepak you can do it with a few one-line unix commands. If your original data is in 'datafile.txt' then you have 3 statements to run:

grep "^1141" datafile.txt | awk '{print $1}' > 1141.txt
grep "^1815" datafile.txt | awk '{print $1}' > 1815.txt
grep "^2240" datafile.txt | awk '{print $1}' > 2240.txt

© Jadu Saikia www.UNIXCL.com