Sunday, April 12, 2009

Sort and then diff two files - bash


e.g.

$ cat file1.dat
342
123
123
789
120

$ cat file2.dat
342
120
900
789
123
123

Now normal diff:

$ diff file1.dat file2.dat
1a2,4
> 120
> 900
> 789
4,5d6
< 789
< 120

Requirement: Sort both the files and take diff of the sorted outputs.

Normal approach would have been:

Way1:
$ sort file1.dat -o file1.dat.srt
$ sort file2.dat -o file2.dat.srt

And then, diffing the sorted output files.
$ diff file1.dat.srt file2.dat.srt
5a6
> 900

Way2:
Sort only one file. say

$ sort file2.dat -o file2.dat.srt

And then
$ sort file1.dat | diff - file2.dat.srt

Way 3: Using bash process substitution technique*
Other way of doing this without creating those temporary sorted files.

$ diff <(sort file1.dat) <(sort file2.dat)
5a6
> 900

* Bash process substitution:
Using this technique output of process (or processes) can be feed to stdin of another process (piping technique only allows stdout of one single command to another command stdin)

e.g.

$ diff <(command1) <(command2)
will show difference of command1 and command2 output.

More about bash process substitution can be read from here

2 comments:

Custos said...

I'm stuck on a small project could you pls help me process large CSV's using Gnu sed, diff, uniq, paste, etc outside bash (i'm in a windows env, but if necessary i can install win bash)-

I have multiple large (sometimes 1Gb each CSV files that look like-

File1:

ESu1 Date Time,ESu1 Open,ESu1 Volume,ESu1 SynthVIX,ESu1 VWAP
7/22/2011 15:14,1340.75,5090,0.111752652,1339.303711
7/22/2011 15:13,1341,2387,0.111752652,1339.296875

File 2

HGU1 Date Time,HGU1 Open,HGU1 Volume,HGU1 SynthVIX,HGU1 VWAP
7/22/2011 15:14,4.406000137,6,0.124712922,4.40087986
7/22/2011 15:13,4.407999992,4,0.079364739,4.400878429

Each file's 1st column is "Date Time" column where the column data is like: 7/22/2011 15:10

The "Date Time" column is to be used as an index to match up all 13+ files and produce a "Master" File (if you will) with only 1 "Date Time" column (preferable 1st col)

The issue is all files have some "Date Time" data missing, so 10 might have "7/22/2011 15:03" but other wont. So that data needs to be ignored.

Only data that has the same "Date Time" cell values in All files must be used to produce the Master File.

It seems one could use sed to compare col1 of each file and then discard the non missing rows and join the rest using paste etc.

Could you pls help me?

Any and all help will be appreciated.

Jadu Saikia said...

@Custos, definitely I will try to help you here. A quick info that I would require:

Assuming these are the two files:

File1:

ESu1 Date Time,ESu1 Open,ESu1 Volume,ESu1 SynthVIX,ESu1 VWAP
7/22/2011 15:14,1340.75,5090,0.111752652,1339.303711
7/22/2011 15:13,1341,2387,0.111752652,1339.296875

File 2

HGU1 Date Time,HGU1 Open,HGU1 Volume,HGU1 SynthVIX,HGU1 VWAP
7/22/2011 15:14,4.406000137,6,0.124712922,4.40087986
7/22/2011 15:13,4.407999992,4,0.079364739,4.400878429

what's your expected output ?

© Jadu Saikia www.UNIXCL.com