## Monday, May 18, 2009

### Group similar items using awk array

Thought of continuing a similar post (w.r.t. few of my recent posts) based on awk array.

Input file:
`\$ cat details.txtManager1|sw1Manager3|sw5Manager1|sw4Manager2|sw9Manager2|sw12Manager1|sw2Manager1|sw0`

Output required:
Group the similar (based on \$1) fields(\$2) together, i.e. group the engineers which are under a particular common manager. i.e. required output:
`Manager1|sw1,sw4,sw2,sw0Manager2|sw9,sw12Manager3|sw5`

Awk solution:
`\$ awk '   BEGIN {FS=OFS="|"}   !A[\$1] {A[\$1] = \$0; next}   {A[\$1] = A[\$1] "," \$2}   END {for(i in A) {print A[i]}}' details.txt`

Lets add one more field as the "team" field. e.g. "Manager1" manages engineer "sw1" which is from team1.

`\$ cat details1.txtManager1|team1|sw1Manager3|team4|sw5Manager1|team2|sw4Manager2|team5|sw9Manager2|team5|sw12Manager1|team3|sw2Manager1|team2|sw0`

Now lets try to group the engineers which are from the same team and are being managed by the same common manager. The awk solution would be:
`\$ awk '   BEGIN {FS=OFS="|"}   !A[\$1\$2] {A[\$1\$2] = \$0; next}   {A[\$1\$2] = A[\$1\$2] "," \$3}   END {for(i in A) {print A[i]}}' details1.txto/p:Manager1|team1|sw1Manager1|team2|sw4,sw0Manager1|team3|sw2Manager3|team4|sw5Manager2|team5|sw9,sw12`

Nathan said...

#####################################################################################################
# Note : the complexity of the #following solution is worse than #O(n^2) , while n is the number
# of lines in the input file. #while this solution works , I wont use #it over the awk solution.
#####################################################################################################

#!/bin/bash

declare -a ARR_MNGRS
declare -i NUM_MNGRS
declare -i COMMA=0

ARR_MNGRS=(`cat details.txt | cut -d'|' -f1 | sort | uniq`)
NUM_MNGRS=\${#ARR_MNGRS[*]}

for ((i=0; i<\$NUM_MNGRS; i++))
do

printf "%s|" "\${ARR_MNGRS[\$i]}"

do

if [[ \${LINE%%|*} = \${ARR_MNGRS[\$i]} ]]
then

if [[ \$COMMA -eq 0 ]]
then
printf "%s," "\${LINE##*|}"
COMMA=1
elif [[ \$COMMA -eq 1 ]]
then
printf "%s" "\${LINE##*|}"
((COMMA++))
elif [[ \$COMMA -eq 2 ]]
then
printf ",%s" "\${LINE##*|}"
fi
fi

done < details.txt

echo "" # needed for the newline
COMMA=0

done

Nathan said...

Complexity : more than O(n^2) , so its not so great.

I hope I didn't double posted.

#!/bin/bash

declare -a ARR_MNGRS
declare -i NUM_MNGRS
declare -i COMMA=0

ARR_MNGRS=(`cat details.txt | cut -d'|' -f1 | sort | uniq`)
NUM_MNGRS=\${#ARR_MNGRS[*]}

for ((i=0; i<\$NUM_MNGRS; i++))
do

printf "%s|" "\${ARR_MNGRS[\$i]}"

do

if [[ \${LINE%%|*} = \${ARR_MNGRS[\$i]} ]]
then

if [[ \$COMMA -eq 0 ]]
then
printf "%s," "\${LINE##*|}"
COMMA=1
elif [[ \$COMMA -eq 1 ]]
then
printf "%s" "\${LINE##*|}"
((COMMA++))
elif [[ \$COMMA -eq 2 ]]
then
printf ",%s" "\${LINE##*|}"
fi
fi

done < details.txt

echo "" # needed for the newline
COMMA=0

done

Unknown said...

@Nathan, thanks for the pure bash solution . Really useful.

Another alternative using awk:

\$ awk 'BEGIN {FS=OFS="|"} {
arr[\$1] = (\$1 in arr) ? arr[\$1] "," \$2 : \$2
}
END {
for (i in arr)
print i, arr[i]
}' details.txt

Manager1|sw1,sw4,sw2,sw0
Manager2|sw9,sw12
Manager3|sw5

Anirudh said...

Here's a pure sed based solution
of the managers pbm.
Note: I am using the y/// command
since POSIX sed dont support the
[^\n] syntax.

cat - <<__DATA__ |
Manager1|team1|sw1
Manager3|team4|sw5
Manager1|team2|sw4
Manager2|team5|sw9
Manager2|team5|sw12
Manager1|team3|sw2
Manager1|team2|sw0
__DATA__
sed -e '
G

y/\n_/_\n/

s/^\([^|_]*\)|\([^|_]*\)|\([^|_]*\)_\1|\2|\([^|_]*\)/\1|\2|\4,\3/;ta
s/^\([^|_]*\)|\([^|_]*\)|\([^|_]*\)_\(.*_\)\1|\2|\([^|_]*\)/\4\1|\2|\5,\3/;ta

s/_\$//
s/^\([^_]*\)_\(.*\)/\2_\1/

:a
y/\n_/_\n/
h;\$!d
'