Thursday, May 21, 2009

Remove duplicate consecutive fields or lines - awk


Input file:

$ cat details.txt
1|Manager1|sw1
2|Manager3|sw5
3|Manager1|sw4
4|Manager2|sw9
5|Manager2|sw12
6|Manager1|sw2
7|Manager1|sw0

Required: Remove the duplicate lines from the above file for which the occurrence of 2nd field is consecutive.

The following awk one liner going to remove the duplicate lines based on consecutive 2nd field (is going to keep first occurrence)

$ awk 'x !~ $2; {x=$2}' FS=\| details.txt

o/p:

1|Manager1|sw1
2|Manager3|sw5
3|Manager1|sw4
4|Manager2|sw9
6|Manager1|sw2


And to remove duplicate, nonconsecutive fields

$ awk '!arr[$2]++' FS=\| details.txt

o/p:

1|Manager1|sw1
2|Manager3|sw5
4|Manager2|sw9

Related posts:
- Remove duplicate based on fields in awk
- Remove duplicate without sorting file - awk
- Remove duplicate blank lines using awk in bash

5 comments:

Nathan said...

and with bash without awk

!/bin/bash

declare DATA
declare -i I=1

DATA=$( (for LINE in $(cat details.txt)
do

echo "${LINE#*|}"

done) | uniq --check-chars=10 )

for LINE in $(echo "$DATA")
do

echo "${I}|${LINE}"
((I++))

done


removing the duplicates is quite straight forward from here.

Jadu Saikia said...

@Nathan, the solution looks so great.

For more inconsistent first field size we can use "check-chars" some bigger values , right ?

Thanks for your comment/post.

Tim said...

I'm trying to wrap my head around what the line "awk '!arr[$2]++' FS=\| details.txt" does.

I understand it is creating an array where the index is the 2nd field, delimited by the "|" character.

However, I don't understand the point of the bang, or the ++. I know what both do separately but can't figure out the logic of them in this line.

Any light shed on how this generates the output it does would be greatly appreciated.

Jadu Saikia said...

@Tim, thanks for the question. Even I had tough time understanding it. Something from my note that I kept earlier on this:

It records the lines seen array 'arr' (associative-array) and at the same time tests if it had seen the line before.
If it had seen the line before, then arr[line] > 0 and !arr[line] == 0
Any expression that evaluates to
false : no action, no print
true : {print}

In this example:

1|Manager1|sw1
2|Manager3|sw5
3|Manager1|sw4
4|Manager2|sw9
5|Manager2|sw12
6|Manager1|sw2
7|Manager1|sw0

awk '!arr[$2]++' FS=\| details.txt

when awk sees the first "Manager1", it evaluates "!arr["Manager1"]++".
arr["Manager1"] is false, but !arr["Manager1"] is true, so it print "Manager1"
Then it increments arr["Manager1"] by 1 with ++ pre inc operator.
Array 'arr' now contains one value arr["Manager1"] == 1

then it sees 'Manager3', first time, so same like above
Now array 'arr' now contains two values arr["Manager1"] == 1 and arr["Manager3"] == 1

Now awk sees second "Manager1". Now arr["Manager1"] is true, and !arr["Manager1"] is false, so its not going to be printed.
Array 'arr' still conatins two values, arr["Manager1"] == 2 and arr["Manager3"] == 1

similarly for other $2 values.

Hope this helps. Keep in touch.

Bhargav L said...

Hi, Thanks for the post.
Can you help me out in this. Actually, I have rows like

50121 abc.com 28/1/2014-12:00:00
52111 xyz.com 27/1/2014-12:00:00
deusr abc.com 26/1/2014-12:00:00
50121 abc.com 26/1/2014-12:00:00
52111 abc.com 25/1/2014-12:00:00

Desired Output:
50121 abc.com 28/1/2014-12:00:00
52111 xyz.com 27/1/2014-12:00:00
deusr abc.com 26/1/2014-12:00:00
52111 abc.com 25/1/2014-12:00:00

I removed duplicates based on first column and got the output as
50121 abc.com 28/1/2014-12:00:00
52111 xyz.com 27/1/2014-12:00:00
deusr abc.com 26/1/2014-12:00:00

but the issue here is I am willing to remove duplicates based on 2 columns comparison for each row. i.e., 1st and 2nd one. I am trying using 'awk' command. But I am not getting it. Can you help me out in this please..

© Jadu Saikia www.UNIXCL.com