Friday, March 7, 2008

Remove duplicates without sorting file - BASH


Usually whenever we have to remove duplicate entries from a file, we do a sort of the entries and then eliminate the duplicates using "uniq" command.


But if we have to remove the duplicates and preserve the same order of occurrence of the entries, here is the way:

Sample file:
$ cat file3
AAAA
FFFF
BBBB
BBBB
CCCC
AAAA
FFFF
DDDD

- uniq without sorting will not remove all duplicates.
$ uniq file3
AAAA
FFFF
BBBB
CCCC
AAAA
FFFF
DDDD

- sort gives an option (-u) to sort and uniq, but we lost the original occurrence order of the entries
$ sort -u file3
AAAA
BBBB
CCCC
DDDD
FFFF

- sort and then uniq, removed the duplicates, but order ?? (Same as above)
$ sort file3 | uniq
AAAA
BBBB
CCCC
DDDD
FFFF

- here is the solution using AWK:
$ awk ' !x[$0]++' file3
AAAA
FFFF
BBBB
CCCC
DDDD

26 comments:

VeRTiTO said...

nice and simply explained post, keep up the nice linux blog posting!

Anonymous said...

Just what I was looking for. Worked perfect on the file I had to remove dupes from. I have no idea what the awk stuff means :p but it works Great.

Anonymous said...

what I was looking for.
Thanks :)

Anonymous said...

I like your solution!

It's very simple and minimal!

Unknown said...

@VeRTiTO, Jelloir, Sangram, Praveen, thank you all for commenting your views on this post. Keep visiting.

ijjjj said...

Just what I needed, huge thanks!

Ram said...

did not work with file where some lines had leading tabs

Unknown said...

@Ram, could you please put an example that you have tried. Thanks

David L said...

how would I export the result to a new file? I tried
$ awk' !x[$0]++' file3 > file4

but file4 remains empty?

David L said...

Hi Jadu,

on the file test.tex

the command
$ awk ' !x[$0]++' test.tex

removes both lines for some of the duplicates

do you know why?

Thanks,

David

Unknown said...

@David, thanks for the comment.

I tried the same awk line with your 'test.tex' and it works fine for me. Even exporting the results to a file also works.

Which OS you are in ? Could you please try with nawk or /usr/xpg4/bin/awk.

David L said...

Hi Jaidu,

You are right - the command does what you say it will do - but not what I wanted it to do which was to only remove duplicate lines but not lines that appear > twice.

Is there a way to remove only lines that appear twice, but not lines that appear more than twice?

For some background, I have exported tables from phpmyadmin to LaTeX, and phpmyadmin exports duplicate tables. So I want to remove the duplicate information but not the duplicate formatting lines.

I don't want to do it by hand because I want to be able to automatically update the file. I'll either have figure out how to get phpmyadmin not to output duplicate tables when exporting to LaTeX, or export to a spreadsheet and then convert it to LaTeX.

In any case, thanks for your help!

fyi I am using Ubuntu 10.04 and nawk gives the same results as awk.

Unknown said...

@David, I have put a python script on my python starter blog which removes duplicate lines from a file which appears exactly twice, please find it here:

http://pythonstarter.blogspot.com/2010/07/python-remove-duplicate-lines-from-file.html

Sandeep Singh Bisht said...

awk trick is really awesome...thanks man

Sandeep Singh Bisht said...

awk trick is really awesome....thanks man :)

pavanDurga said...

I am getting following error while using this syntax

awk `'!x[$0]++'` trail.txt
!x[$0]++: not found
awk: syntax error near line 1
awk: bailing out near line 1

can you please help me on this

Unknown said...

@pavanDurga : Thanks for the question.

What OS are you in ? Is it Solaris ? Please try using nawk or /usr/xpg4/bin/awk .

senthil babu said...

Please use single quote(') instead of using `

Unknown said...

@senthil babu, hey thanks for catching this. I just missed that.

@pavanDurga, could you please perform the same. Thanks.

lotto said...

This is exactly what i was looking for. Many thanks!

comrad said...

Was exactly what I was looking for...
But why it works??? :-O

Suresh said...

Elegant! In the spirit of "teach a man to fish", can you please explain how this awk one-liner works?

Thanks,

- Suresh

Unknown said...

@Suresh, Peteris has done a great job in explaining most of the awk one liners and he has explained this as well in detail here :

http://www.catonmat.net/blog/awk-one-liners-explained-part-two/

Section:
43. Remove duplicate, nonconsecutive lines.

Hope this helps.

Ioan (Nini) Indreias said...

And this is a small fix to ignore the "case":
awk '!x[toupper($0)]++'

Unknown said...

Duplicate Files Deleter is a powerful best free duplicate finder tool to find and resolve duplicate photos, documents, spreadsheets, MP3's, and more! Removing duplicates will also help to speed up indexing and reduces back up size and time. Your computer isn’t fully optimized until you’ve removed all unnecessary duplicate files.

Anonymous said...

Thanks for sharing!

© Jadu Saikia www.UNIXCL.com