Friday, March 7, 2008

Remove duplicates without sorting file - BASH


Usually whenever we have to remove duplicate entries from a file, we do a sort of the entries and then eliminate the duplicates using "uniq" command.


But if we have to remove the duplicates and preserve the same order of occurrence of the entries, here is the way:

Sample file:
$ cat file3
AAAA
FFFF
BBBB
BBBB
CCCC
AAAA
FFFF
DDDD

- uniq without sorting will not remove all duplicates.
$ uniq file3
AAAA
FFFF
BBBB
CCCC
AAAA
FFFF
DDDD

- sort gives an option (-u) to sort and uniq, but we lost the original occurrence order of the entries
$ sort -u file3
AAAA
BBBB
CCCC
DDDD
FFFF

- sort and then uniq, removed the duplicates, but order ?? (Same as above)
$ sort file3 | uniq
AAAA
BBBB
CCCC
DDDD
FFFF

- here is the solution using AWK:
$ awk ' !x[$0]++' file3
AAAA
FFFF
BBBB
CCCC
DDDD

27 comments:

VeRTiTO said...

nice and simply explained post, keep up the nice linux blog posting!

Jelloir said...

Just what I was looking for. Worked perfect on the file I had to remove dupes from. I have no idea what the awk stuff means :p but it works Great.

Sangram said...

what I was looking for.
Thanks :)

Praveen Puri said...

I like your solution!

It's very simple and minimal!

Jadu Kumar Saikia said...

@VeRTiTO, Jelloir, Sangram, Praveen, thank you all for commenting your views on this post. Keep visiting.

Andreas said...

Just what I needed, huge thanks!

Ram said...

did not work with file where some lines had leading tabs

Jadu Saikia said...

@Ram, could you please put an example that you have tried. Thanks

David said...

how would I export the result to a new file? I tried
$ awk' !x[$0]++' file3 > file4

but file4 remains empty?

David said...

Hi Jadu,

on the file test.tex

the command
$ awk ' !x[$0]++' test.tex

removes both lines for some of the duplicates

do you know why?

Thanks,

David

Jadu Saikia said...

@David, thanks for the comment.

I tried the same awk line with your 'test.tex' and it works fine for me. Even exporting the results to a file also works.

Which OS you are in ? Could you please try with nawk or /usr/xpg4/bin/awk.

David said...

Hi Jaidu,

You are right - the command does what you say it will do - but not what I wanted it to do which was to only remove duplicate lines but not lines that appear > twice.

Is there a way to remove only lines that appear twice, but not lines that appear more than twice?

For some background, I have exported tables from phpmyadmin to LaTeX, and phpmyadmin exports duplicate tables. So I want to remove the duplicate information but not the duplicate formatting lines.

I don't want to do it by hand because I want to be able to automatically update the file. I'll either have figure out how to get phpmyadmin not to output duplicate tables when exporting to LaTeX, or export to a spreadsheet and then convert it to LaTeX.

In any case, thanks for your help!

fyi I am using Ubuntu 10.04 and nawk gives the same results as awk.

Jadu Saikia said...

@David, I have put a python script on my python starter blog which removes duplicate lines from a file which appears exactly twice, please find it here:

http://pythonstarter.blogspot.com/2010/07/python-remove-duplicate-lines-from-file.html

SANDEEP said...

awk trick is really awesome...thanks man

SANDEEP said...

awk trick is really awesome....thanks man :)

dan rogy said...

Do you need to find duplicate files?

Improve computer performance by deleting duplicate files
Identical files not only waste your hard disk space, but also may cause system slowdowns. By deleting duplicate files you can reduce time needed to defragment your hard drives and minimize time used by antivirus to scan your computer.


Sort and organize your media collections
Media files collections, such as music, video, images and photos, often become the primary source of identical files. If you have a music collection of several hundreds or even thousands mp3-files, you may want to sort them by deleting identical tracks.

With Duplicate File Finder you can organize your media files and increase free disk space needed to enlarge your collection.

Find duplicate files by content!
Duplicate Files Deleter has the MD5 search engine which allows the program to search for duplicate files by content, regardless of other match criteria. It would be helpful, for example, when two identical mp3 tracks or video files have different names.

http://duplicateFilesDeleter.com

pavanDurga said...

I am getting following error while using this syntax

awk `'!x[$0]++'` trail.txt
!x[$0]++: not found
awk: syntax error near line 1
awk: bailing out near line 1

can you please help me on this

Jadu Saikia said...

@pavanDurga : Thanks for the question.

What OS are you in ? Is it Solaris ? Please try using nawk or /usr/xpg4/bin/awk .

senthil babu said...

Please use single quote(') instead of using `

Jadu Saikia said...

@senthil babu, hey thanks for catching this. I just missed that.

@pavanDurga, could you please perform the same. Thanks.

lotto said...

This is exactly what i was looking for. Many thanks!

comrad said...

Was exactly what I was looking for...
But why it works??? :-O

Suresh said...

Elegant! In the spirit of "teach a man to fish", can you please explain how this awk one-liner works?

Thanks,

- Suresh

Jadu Saikia said...

@Suresh, Peteris has done a great job in explaining most of the awk one liners and he has explained this as well in detail here :

http://www.catonmat.net/blog/awk-one-liners-explained-part-two/

Section:
43. Remove duplicate, nonconsecutive lines.

Hope this helps.

Ioan (Nini) Indreias said...

And this is a small fix to ignore the "case":
awk '!x[toupper($0)]++'

Martin OI said...

Duplicate Files Deleter is a powerful best free duplicate finder tool to find and resolve duplicate photos, documents, spreadsheets, MP3's, and more! Removing duplicates will also help to speed up indexing and reduces back up size and time. Your computer isn’t fully optimized until you’ve removed all unnecessary duplicate files.

~christiank said...

Thanks for sharing!

© Jadu Saikia www.UNIXCL.com