Q: For a sample two-column data below (cont_bd.txt), how to sum the second column and group by the first column:
Input file:
$ cat cont_bd.txt
continent:mval
SA:2345
AF:123
SA:89
OC:890
EU:24
AF:90
NA:5678
AF:345
OC:90
OC:23
SA:1234
EU:90
AF:12
SA:909
Awk solution for group by clause implementation:
$ awk 'BEGIN{FS=":"; print "continent count total avg"} NR!=1 {a[$1]++;b[$1]=b[$1]+$2}END{for (i in a) printf("%s %10.0f %10.0f %10.2f\n", i, a[i], b[i], b[i]/a[i])} ' cont_bd.txt
Output:
continent count total avg
NA 1 5678 5678.00
OC 3 1003 334.33
AF 4 570 142.50
SA 4 4577 1144.25
EU 2 114 57.00
21 comments:
Thanks mate, this is gold! I used your ideas in this funky script. The hardest part was grokking how awk parses through a file in one hit!
@ropata, I just saw your script, great work. Thanks for mentioning me. keep in touch.
Thanks for this fantastic little script - I've used it for some Twitter data crunching here.
@The Hat, Welcome :-)
That's really a good analysis. Thanks for mentioning me. Keep in touch.
Hi Jadu Saikia...
your little script..looks good.. i need some enhancement for my requirement. I have failed to that using ur script.. could you help me to do..
I have 3 columns and i need to group up ...
SA 0 2345
AF 0 123
SA 1 89
OC 0 890
EU 1 24
AF 1 90
NA 0 5678
AF 1 345
OC 1 90
OC 0 23
SA 1 1234
EU 0 90
AF 0 12
SA 0 909
I want the results like this....
continent count0 count1 total0 total1
NA 1 0 5678 0
OC 2 1 913 90
AF 2 2 135 435
SA
EU
@manimaran, I have created a post of your query, please check today's post, hope this helps. Thanks for your comment. Keep in touch.
Hi Jadu Saikia,
Man... u r gr8.... thanks a ton.. u saved my life :)... its working...
Hi Jadu Saika,
I need to get the min and max of the available values.. help me if you can.
Expected output:
Continent count0 count1 sum0 sum1 min max
NA 1 0 5678 0 89 5678
OC 2 1 913 90 23 890
AF 2 2 135 435 12 345
SA 2 2 3254 1323 89 2345
EU 1 1 90 24 24 90
@manimaran, please check the comment on this post:
http://unstableme.blogspot.com/2010/03/awk-count-and-sum-on-multiple-columns.html
Hi Jadu Saikia,
is there any way to find the standard deviation of sum0 and sum1 for each row entry ?
Thanks,
-Manimaran
Tasty AWKness, Jadu! Nice work..saved me a bunch of time.
'sodo
Hi,
I had a similar problem to solve:
I would like to count occurences of the string in a log file, and sort them groupped per hour or minute.
As a total newbie to the linux scripting, I had written something like:
grep 'string' file.log | cut -c 1-23
which returned the result similar to this:
2011-08-06@00:13:43,147
2011-08-06@07:47:30,056
2011-08-06@08:48:15,559
2011-08-06@10:19:07,221
2011-08-06@10:21:45,268
Now I would like to group the results by 2011-08-06@?? and count the lines. The result should look like below:
hour count
2011-08-06@07 1
2011-08-06@17 3
2011-08-06@08 1
2011-08-06@18 1
2011-08-06@10 4
I had modified the given script:
awk 'BEGIN{FS=":"; print " hour count"} NR!=1 {a[$1]++;b[$1]=b[$1]+$2}END{for (i in a) printf("%s %10.0f \n", i, a[i])} ' list.txt
but it omitted the first line starting with
2011-08-06@00
After that I would like to pipe the results of the grep command to the oneliner which does the needed grouping
(I would like to avoid putting scripts into the system, it's rather preferred to copy-paste commands)
Please help :)
John
@John, thanks for the question.
The first line is omitted in your case as in the awk command, you had the statement
NR!=1
NR is nothing but the line number.
e.g.
$ cat file.txt
2011-08-06@00:13:43,147
2011-08-06@07:47:30,056
2011-08-06@08:48:15,559
2011-08-06@10:19:07,221
2011-08-06@10:21:45,268
$ awk '{print NR,$0}' file.txt
1 2011-08-06@00:13:43,147
2 2011-08-06@07:47:30,056
3 2011-08-06@08:48:15,559
4 2011-08-06@10:19:07,221
5 2011-08-06@10:21:45,268
$ cat file.txt | awk -F ":" '{a[$1]++} END {for (i in a) print i,a[i]}'
Output:
2011-08-06@07 1
2011-08-06@08 1
2011-08-06@00 1
2011-08-06@10 2
Hope this helps.
Regards,
Jadu
Great Hint.
Hi Jadu,
I have a nearly same kind of problem which need to be solved by Unix shell-scripting(awk scripting is also allowed).
Input File:
===========
1,B.Sc,Calcutta
1,M.Sc,Stanford
2,M.A.,Pune
3,M.Tech,IIT Guwahati
4,B.Tech,Shibpur
4,M.Tech,Jadavpur
5,B.Lib.,Calcutta
6,B.Sc.,Bangalore
The Output File would be like this:-
==============
2,1,B.Sc,Calcutta,M.Sc,Stanford
1,2,M.A.,Pune
1,3,M.Tech,IIT Guwahati
2,4,B.Tech,Shibpur,M.Tech,Jadavpur
1,5,B.Lib.,Calcutta
1,6,B.Sc.,Bangalore
i.e. ,, in a single line for each student.
For e.g.,Student No 1 has 2 qualifications, B.Sc from Calcutta
and M.Sc from Stanford.
Please suggest a time-efficient(as this needs to be simulated for crores of records) & brilliant way of doing this.I will highly appreciate any help from you.
Waiting for your kind & prompt help...
@Saurav, Thanks for asking the question. Hope this wiil help. Thanks.
$ cat in.txt
1,B.Sc,Calcutta
1,M.Sc,Stanford
2,M.A.,Pune
3,M.Tech,IIT Guwahati
4,B.Tech,Shibpur
4,M.Tech,Jadavpur
5,B.Lib.,Calcutta
6,B.Sc.,Bangalore
$ awk '
BEGIN {FS=OFS=","} {cnt[$1]++}
!A[$1] {A[$1] = $0; next}
{A[$1] = A[$1] "," $2 "," $3}
END {for(i in A) {print cnt[i],A[i]}
}' in.txt
2,1,B.Sc,Calcutta,M.Sc,Stanford
1,2,M.A.,Pune
1,3,M.Tech,IIT Guwahati
2,4,B.Tech,Shibpur,M.Tech,Jadavpur
1,5,B.Lib.,Calcutta
1,6,B.Sc.,Bangalore
Dear Jadu,My heartiest thanks for this magnificient piece of script.
@Saurav Bhattacharyya, my pleasure.
Hi,
i have a text file like this:
1141666113 1141
1141666117 1141
1141666119 1141
1141666120 1141
1815062287 1815
1815062290 1815
1815062351 1815
1815062578 1815
1815062593 1815
2240523210 2240
2240523213 2240
2240523218 2240
2240523220 2240
i want output as 3 different files named 1141, 1815, 2240 and content of these files should be values in 1st column corresponding to these values. for example content of file 2240 should be as below:
2240523210
2240523213
2240523218
2240523220
as you can see, everything is already sorted and grouped together. i need to get the groups out in a separate file.
Please help me writing this script.
Thanks in advance.
@Deepak you can do it with a few one-line unix commands. If your original data is in 'datafile.txt' then you have 3 statements to run:
grep "^1141" datafile.txt | awk '{print $1}' > 1141.txt
grep "^1815" datafile.txt | awk '{print $1}' > 1815.txt
grep "^2240" datafile.txt | awk '{print $1}' > 2240.txt
Dear Sir, i have a file like this
with comma separated 4 columns
26001815,xxxxx,2017-06-10,4
86233020,yyyyy,2017-06-10,1
07703839,eeeee,2017-06-10,2
04535903,ccccc,2017-06-10,1
36650473,ddddd,2017-06-10,4
26001815,xxxxx,2017-06-11,4
86233020,yyyyy,2017-06-11,1
07703839,eeeee,2017-06-11,3
04535903,ccccc,2017-06-11,3
36650473,ddddd,2017-06-11,1
I need to total the number of occurrence of last column for a given date.
the last column indicates the following:
1 = Transfer
2 = Suspend
3 = Resume
4 = Deactivated
Result should look like this
Transfer Suspend Resume Deactivated
2017-06-10 2 1 0 2
2017-06-11 2 0 2 1
Post a Comment