Monday, May 18, 2009

Group similar items using awk array


Thought of continuing a similar post (w.r.t. few of my recent posts) based on awk array.

Input file:

$ cat details.txt
Manager1|sw1
Manager3|sw5
Manager1|sw4
Manager2|sw9
Manager2|sw12
Manager1|sw2
Manager1|sw0

Output required:
Group the similar (based on $1) fields($2) together, i.e. group the engineers which are under a particular common manager. i.e. required output:

Manager1|sw1,sw4,sw2,sw0
Manager2|sw9,sw12
Manager3|sw5

Awk solution:

$ awk '
BEGIN {FS=OFS="|"}
!A[$1] {A[$1] = $0; next}
{A[$1] = A[$1] "," $2}
END {for(i in A) {print A[i]}
}' details.txt


Lets add one more field as the "team" field. e.g. "Manager1" manages engineer "sw1" which is from team1.


$ cat details1.txt
Manager1|team1|sw1
Manager3|team4|sw5
Manager1|team2|sw4
Manager2|team5|sw9
Manager2|team5|sw12
Manager1|team3|sw2
Manager1|team2|sw0

Now lets try to group the engineers which are from the same team and are being managed by the same common manager. The awk solution would be:

$ awk '
BEGIN {FS=OFS="|"}
!A[$1$2] {A[$1$2] = $0; next}
{A[$1$2] = A[$1$2] "," $3}
END {for(i in A) {print A[i]}
}' details1.txt

o/p:

Manager1|team1|sw1
Manager1|team2|sw4,sw0
Manager1|team3|sw2
Manager3|team4|sw5
Manager2|team5|sw9,sw12

4 comments:

Nathan said...

#####################################################################################################
# Note : the complexity of the #following solution is worse than #O(n^2) , while n is the number
# of lines in the input file. #while this solution works , I wont use #it over the awk solution.
#####################################################################################################

#!/bin/bash

declare -a ARR_MNGRS
declare -i NUM_MNGRS
declare -i COMMA=0

ARR_MNGRS=(`cat details.txt | cut -d'|' -f1 | sort | uniq`)
NUM_MNGRS=${#ARR_MNGRS[*]}


for ((i=0; i<$NUM_MNGRS; i++))
do


printf "%s|" "${ARR_MNGRS[$i]}"


while read LINE
do

if [[ ${LINE%%|*} = ${ARR_MNGRS[$i]} ]]
then

if [[ $COMMA -eq 0 ]]
then
printf "%s," "${LINE##*|}"
COMMA=1
elif [[ $COMMA -eq 1 ]]
then
printf "%s" "${LINE##*|}"
((COMMA++))
elif [[ $COMMA -eq 2 ]]
then
printf ",%s" "${LINE##*|}"
fi
fi

done < details.txt

echo "" # needed for the newline
COMMA=0

done

Nathan said...

Complexity : more than O(n^2) , so its not so great.

I hope I didn't double posted.

#!/bin/bash

declare -a ARR_MNGRS
declare -i NUM_MNGRS
declare -i COMMA=0

ARR_MNGRS=(`cat details.txt | cut -d'|' -f1 | sort | uniq`)
NUM_MNGRS=${#ARR_MNGRS[*]}


for ((i=0; i<$NUM_MNGRS; i++))
do


printf "%s|" "${ARR_MNGRS[$i]}"


while read LINE
do

if [[ ${LINE%%|*} = ${ARR_MNGRS[$i]} ]]
then

if [[ $COMMA -eq 0 ]]
then
printf "%s," "${LINE##*|}"
COMMA=1
elif [[ $COMMA -eq 1 ]]
then
printf "%s" "${LINE##*|}"
((COMMA++))
elif [[ $COMMA -eq 2 ]]
then
printf ",%s" "${LINE##*|}"
fi
fi

done < details.txt

echo "" # needed for the newline
COMMA=0

done

Unknown said...

@Nathan, thanks for the pure bash solution . Really useful.

Another alternative using awk:

$ awk 'BEGIN {FS=OFS="|"} {
arr[$1] = ($1 in arr) ? arr[$1] "," $2 : $2
}
END {
for (i in arr)
print i, arr[i]
}' details.txt

Manager1|sw1,sw4,sw2,sw0
Manager2|sw9,sw12
Manager3|sw5

Anirudh said...

Here's a pure sed based solution
of the managers pbm.
Note: I am using the y/// command
since POSIX sed dont support the
[^\n] syntax.

cat - <<__DATA__ |
Manager1|team1|sw1
Manager3|team4|sw5
Manager1|team2|sw4
Manager2|team5|sw9
Manager2|team5|sw12
Manager1|team3|sw2
Manager1|team2|sw0
__DATA__
sed -e '
G

y/\n_/_\n/

s/^\([^|_]*\)|\([^|_]*\)|\([^|_]*\)_\1|\2|\([^|_]*\)/\1|\2|\4,\3/;ta
s/^\([^|_]*\)|\([^|_]*\)|\([^|_]*\)_\(.*_\)\1|\2|\([^|_]*\)/\4\1|\2|\5,\3/;ta

s/_$//
s/^\([^_]*\)_\(.*\)/\2_\1/

:a
y/\n_/_\n/
h;$!d
'

© Jadu Saikia www.UNIXCL.com