Usually whenever we have to remove duplicate entries from a file, we do a sort of the entries and then eliminate the duplicates using "uniq" command.
But if we have to remove the duplicates and preserve the same order of occurrence of the entries, here is the way:
Sample file:
$ cat file3
AAAA
FFFF
BBBB
BBBB
CCCC
AAAA
FFFF
DDDD
- uniq without sorting will not remove all duplicates.
$ uniq file3
AAAA
FFFF
BBBB
CCCC
AAAA
FFFF
DDDD
- sort gives an option (-u) to sort and uniq, but we lost the original occurrence order of the entries
$ sort -u file3
AAAA
BBBB
CCCC
DDDD
FFFF
- sort and then uniq, removed the duplicates, but order ?? (Same as above)
$ sort file3 | uniq
AAAA
BBBB
CCCC
DDDD
FFFF
- here is the solution using AWK:
$ awk ' !x[$0]++' file3
AAAA
FFFF
BBBB
CCCC
DDDD
26 comments:
nice and simply explained post, keep up the nice linux blog posting!
Just what I was looking for. Worked perfect on the file I had to remove dupes from. I have no idea what the awk stuff means :p but it works Great.
what I was looking for.
Thanks :)
I like your solution!
It's very simple and minimal!
@VeRTiTO, Jelloir, Sangram, Praveen, thank you all for commenting your views on this post. Keep visiting.
Just what I needed, huge thanks!
did not work with file where some lines had leading tabs
@Ram, could you please put an example that you have tried. Thanks
how would I export the result to a new file? I tried
$ awk' !x[$0]++' file3 > file4
but file4 remains empty?
Hi Jadu,
on the file test.tex
the command
$ awk ' !x[$0]++' test.tex
removes both lines for some of the duplicates
do you know why?
Thanks,
David
@David, thanks for the comment.
I tried the same awk line with your 'test.tex' and it works fine for me. Even exporting the results to a file also works.
Which OS you are in ? Could you please try with nawk or /usr/xpg4/bin/awk.
Hi Jaidu,
You are right - the command does what you say it will do - but not what I wanted it to do which was to only remove duplicate lines but not lines that appear > twice.
Is there a way to remove only lines that appear twice, but not lines that appear more than twice?
For some background, I have exported tables from phpmyadmin to LaTeX, and phpmyadmin exports duplicate tables. So I want to remove the duplicate information but not the duplicate formatting lines.
I don't want to do it by hand because I want to be able to automatically update the file. I'll either have figure out how to get phpmyadmin not to output duplicate tables when exporting to LaTeX, or export to a spreadsheet and then convert it to LaTeX.
In any case, thanks for your help!
fyi I am using Ubuntu 10.04 and nawk gives the same results as awk.
@David, I have put a python script on my python starter blog which removes duplicate lines from a file which appears exactly twice, please find it here:
http://pythonstarter.blogspot.com/2010/07/python-remove-duplicate-lines-from-file.html
awk trick is really awesome...thanks man
awk trick is really awesome....thanks man :)
I am getting following error while using this syntax
awk `'!x[$0]++'` trail.txt
!x[$0]++: not found
awk: syntax error near line 1
awk: bailing out near line 1
can you please help me on this
@pavanDurga : Thanks for the question.
What OS are you in ? Is it Solaris ? Please try using nawk or /usr/xpg4/bin/awk .
Please use single quote(') instead of using `
@senthil babu, hey thanks for catching this. I just missed that.
@pavanDurga, could you please perform the same. Thanks.
This is exactly what i was looking for. Many thanks!
Was exactly what I was looking for...
But why it works??? :-O
Elegant! In the spirit of "teach a man to fish", can you please explain how this awk one-liner works?
Thanks,
- Suresh
@Suresh, Peteris has done a great job in explaining most of the awk one liners and he has explained this as well in detail here :
http://www.catonmat.net/blog/awk-one-liners-explained-part-two/
Section:
43. Remove duplicate, nonconsecutive lines.
Hope this helps.
And this is a small fix to ignore the "case":
awk '!x[toupper($0)]++'
Duplicate Files Deleter is a powerful best free duplicate finder tool to find and resolve duplicate photos, documents, spreadsheets, MP3's, and more! Removing duplicates will also help to speed up indexing and reduces back up size and time. Your computer isn’t fully optimized until you’ve removed all unnecessary duplicate files.
Thanks for sharing!
Post a Comment