Monday, May 26, 2008

Count number of occurrences using awk


Input File: resVI.txt is a portion of the overall results of the class VI annual sports.


$ cat resVI.txt
AA:100m:Monday
DD:200m:Monday
AA:400m:Friday
AA:LOngJump:Tuesday
CC:HighJump:Wed
DD:1000m:Wed
BB:60kgarmrest:Mon


Now we have to calculate how many prizes each of the students(first field) won.
i.e.
Output Required:

AA (3 prizes)
BB (1 prizes)
CC (1 prizes)
DD (2 prizes)


Just we have to calculate the count of occurrences of each first filed in resVI.txt, as each line in resVI.txt corresponds to a prize in a particular category of sports.

Awk code using array:


$ awk '{count[$1]++}END{for(j in count) print j,"("count[j]" prizes)"}' FS=: resVI.txt


Individual steps would have been like this:


$ awk '$1 ~ /AA/ {++c} END {print c}' FS=: resVI.txt
3

$ awk '$1 ~ /BB/ {++c} END {print c}' FS=: resVI.txt
1

$ awk '$1 ~ /CC/ {++c} END {print c}' FS=: resVI.txt
1

$ awk '$1 ~ /DD/ {++c} END {print c}' FS=: resVI.txt
2


Similar Post of mine: Number of files modified in each month using awk

27 comments:

BobbyG said...

great example, have you got anything else with awk and counting instances ...

contact@fir3net.com

anumolusuvarna said...

Thanks so much for this post.
I was working on a similar example and was worried about counting.this was so useful to me

arun said...

AA 30
AA 30
AA 5
AA 33
BB 32
BC 30
BD 3
BC 38
AA 33
EE 34
BE 30

how do i count the sum of all $2 for each $1 reoccurance ??

Jadu Saikia said...

@Arun, thanks for the question.

Are you looking for something like this ?

$ awk '
{a[$1]++;b[$1]=b[$1]+$2} END{for (i in a) print i,a[i],b[i]}' ar.txt
AA 5 131
BB 1 32
BC 2 68
BD 1 3
BE 1 30
EE 1 34

Please let me know if I have misunderstood your requirement.

Marmot said...

i don't understand on what basis count[$1] and count[j]
are related

I always get hung up on stupid things. It's the story of my life.

Still any light you can cast would be gratefully received.

Jadu Saikia said...

@Marmot,

$ cat file.txt
AA 30
AA 30
AA 5
AA 33
BB 32
BC 30
BD 3
BC 38
AA 33
EE 34
BE 30

$ awk '{arr[$1]++} END {for(i in arr) print i,arr[i]}' file.txt
AA 5
BB 1
BC 2
BD 1
BE 1
EE 1

Here arr[$1]++ records the count of occurrence of each unique $1 values of file.txt in associative array 'arr' . And then we use the for construct to retrieve/print the count for each of the $1 values. Please let me know if this helps.

Tambrea Cosmin said...

What about sorting these results(let's say by the 2nd column)?

Jadu Saikia said...

@Tambrea Cosmin, something like this ?

$ cat file.txt
AA 30
AA 30
AA 5
AA 33
BB 32
BC 30
BD 3
BC 38
AA 33
EE 34
BE 30

$ awk '{arr[$1]++} END {for(i in arr) print i,arr[i]}' file.txt | sort -n -k2
BB 1
BD 1
BE 1
EE 1
BC 2
AA 5

$ awk '{arr[$1]++} END {for(i in arr) print i,arr[i]}' file.txt | sort -nr -k2
AA 5
BC 2
EE 1
BE 1
BD 1
BB 1

Adri said...

Thanks for the examples!

What about doing the average number of occurences?

For instances:
AA
AA
BB
BB
AA
CC
CC

I would like to get the value 2.33, as AA is repeated 3 times and BB,CC 2 times.
Thanks in advance.

Jadu Saikia said...

@Adri,

something like this ?

$ cat file.txt
AA
AA
BB
BB
AA
CC
CC

$ awk '{arr[$1]++} END {for(i in arr) print i,NR/arr[i]}' file.txt
AA 2.33333
BB 3.5
CC 3.5

Also if you want to calculate the percentage of each occurrence you can refer to http://unstableme.blogspot.com/2008/09/calculate-percentage-using-awk-in-bash.html

Hope this helps.
Thanks a lot for the question.

Adri said...

Thanks for you fast answer Jadu!

Actually it is not exactly what I need.
I'm interested only in getting ONE value.

If I have the following file:
AA
AA
BB
BB
CC
CC

This value should be 2, as each variable is repeated twice

If I have the following file:
AA
AA
AA
BB
BB
CC
CC
The value should be 2,33 as AA is repeated 3 times, BB twice and CC twice.

I hope my explanation are clearer.

The point is that I can't use NR
because there is other lines that are not relevant in my file. If there is 7 lines with AA,BB or CC (as in my 1st example), the value "7" should be obtain thanks to a sum of these values but not using NR, because NR might be 10 (if I have 3 other lines in my file).

Thanks for your help!

abhinav0208 said...

how we check that how mant instances are running of C program
at a particular time?

Bala J said...

BalaJ
@Jadu Saikia, your awk group count & Sum is useful to me. Thanks alot.

Dj said...

I want to calculate the number of occurrences of each letter in each column.

A C T
A A T
B T A
A C B
B C C
C C B
T A T
C A A

sagar Utturkar said...

What if I want to print counts only above 2? or equal to 2?

Jadu Saikia said...

@Sargar, you can do a comparison before printing, something like:

$ awk '{arr[$1]++} END {for(i in arr) {if (arr[i] > 2) print i,arr[i]}}' file.txt
AA 5


$ awk '{arr[$1]++} END {for(i in arr) {if (arr[i] >= 2) print i,arr[i]}}' file.txt
AA 5
BC 2


$ awk '{arr[$1]++} END {for(i in arr) {if (arr[i] == 2) print i,arr[i]}}' file.txt
BC 2

Valerio Ciotti said...

Hi, I have a file with 2 colums like that:

A B
A C
A F
B C
B D
C A
C B
D B

I would like to count how many times a letter appear in the left and in the right column. I would like something like:

A 3 1
B 2 3
C 2 2
D 1 1
F 0 1

Jadu Saikia said...

@Valerio,something like this ?

$ cat file.txt
A B
A C
A F
B C
B D
C A
C B
D B


$ awk 'BEGIN {OFS=","} {A[$1]++; B[$2]++} END {for(i in B) {print i,A[i],B[i]}}' file.txt


A,3,1
B,2,3
C,2,2
D,1,1
F,,1

Valerio Ciotti said...

Yes, thank you really much!!

Valerio Ciotti said...

Wait, there is just a little problem, I want also that it put out a 0 when it can t find anything, is that possible?

Jadu Saikia said...

@Valerio,

This should work:

$ awk '{A[$1]++; B[$2]++} END {for(i in B) {print i,A[i]=="" ? 0:A[i], B[i]=="" ? 0:B[i]}}' file.txt

A 3 1
B 2 3
C 2 2
D 1 1
F 0 1


you can also refer my awk if-else post http://unstableme.blogspot.in/2009/09/if-else-examples-in-awk-bash.html

Hope this helps. Thanks.

Frozen_toes said...

How do I count the unique values in column 2 depending on the unique values of column 1 in a file like this :
$ cat file.txt
AA 30
AA 30
AA 5
AA 33
BB 32
BC 30
BD 3
BC 38
AA 33
EE 34
BE 30

where the output would be :

AA 3
BB 1
BC 2
BD 1
BE 1
EE 1

Thanks :) Great blog by the way.

Jadu Saikia said...

@Frozen_toes Thanks for the question.

A not so good way would be:

$ awk '{arr[$1" "$2]++} END {for(i in arr) print i,arr[i]}' file.txt
EE 34 1
BC 30 1
AA 30 2
BB 32 1
BE 30 1
AA 33 2
AA 5 1
BC 38 1
BD 3 1

$ awk '{arr[$1" "$2]++} END {for(i in arr) print i,arr[i]}' file.txt | awk '{arr[$1]++} END {for(i in arr) print i,arr[i]}'
AA 3
BB 1
BC 2
BD 1
BE 1
EE 1

Hope this helps. Thanks.

shabhonam said...

How would I go from this

AA 30
AA 30
BB 32
BB 30
CC 3
CC 33

to

AA 60
BB 62
CC 36

Thanks

Jadu Saikia said...

@shabhonam this you can achieve through this:


$ cat file.txt
AA 30
AA 30
BB 32
BB 30
CC 3
CC 33

$ awk '{arr[$1]+=+$2} END{for (i in arr) print i,arr[i]}' file.txt

AA 60
BB 62
CC 36

shabhonam said...

Thanks alot.

shanya said...

Good morning I have a file with 8 columns and I want the following output: group by the second column, sum the five column, count the ocorrence of the third column

© Jadu Saikia www.UNIXCL.com