Monday, March 17, 2008

Remove duplicates based on fields - AWK


$ cat file.txt
DD:12:A
AA:11:N
EE:13:B
AA:11:F
BB:09:K
DD:13:X

#Based on first field. Duplicates are DD,AA
$ awk '!x[$1]++' FS=":" file.txt
DD:12:A
AA:11:N
EE:13:B
BB:09:K

Again,

$ cat file.txt
DD:12:A
AA:11:N
EE:13:B
AA:11:F
BB:09:K
DD:13:X


#This time, based on first and 2nd field.Only duplicate combination is (AA:11)
$ awk '!x[$1,$2]++' FS=":" file.txt
DD:12:A
AA:11:N
EE:13:B
BB:09:K
DD:13:X

Related post:

Remove duplicates without sorting file.

4 comments:

Abdel Ra'ouf said...

Hello there,

Any Idea on how to replace duplicate line with blank line instead of deleting them?
e.g.

Input:
test1
test1
test2
test2
test2
test3

Output:
test1

test2


test3

Thanks in advance.

Jadu Saikia said...

@Abdel: you can do something like this

$ awk 'x[$0]++ {$0=""} {print}' new.txt

Please let me know if that works. Thanks.

Thais said...

Hi,
Do you know how can I exclude duplicate cases based based on other collumn?
Duplicates are located in column 4 and I would like to exclude the duplicate cases that column 2 and column 3 are not the same.

Input:
file 10 --- rs11511647 NA 62766
file 10 10 rs11511647 NA 62766
file 5 --- rs22334455 NA 63767
file 5 --- rs12354678 NA 63768
file 5 5 rs12354678 NA 63768

Desired output:

file 10 10 rs11511647 NA 62766
file 5 --- rs22334455 NA 63767
file 5 5 rs12354678 NA 63768

Thanks,

Thais

Thais said...

Hi there,

I would like to exclude duplicates based on the 3 column. Duplicates are located in column 3 and I would like to keep the duplicate cases that have column2=column3.

Input file
file 10 --- rs11511647 NA 62766
file 10 10 rs11511647 NA 62766
file 5 --- rs22334455 NA 63767
file 5 --- rs12354678 NA 63768
file 5 5 rs12354678 NA 63768

Desired output file

file 10 10 rs11511647 NA 62766
file 5 --- rs22334455 NA 63767
file 5 5 rs12354678 NA 63768

Thanks in advance,

Thais

© Jadu Saikia www.UNIXCL.com