Wednesday, February 10, 2010

Bash - find diff based on a field


Input files:

$ cat file1.txt
timestamp,records,attempts
1264723200,96,6117
1264809600,856,7298
1264896000,904,8628
1264982400,0,0
1265155200,644,1786

$ cat file2.txt
timestamp,records,attempts
1264723200,96,6117
1264809600,859,7298
1264896000,904,8628
1264982400,0,0
1265155200,641,1786

Required:

If file1.txt and file2.txt do not have same values in the 2nd field (i.e 'records' field) in any one of the lines (both the files are sorted on first field), print "BAD", else "GOOD".

As the above two files are already sorted based on first field (i.e. timestamp), we can use join command to join the lines of the two files and then use awk to perform the comparison between two fields of the joined output.

$ join -t, -1 1 -2 1 file1.txt file2.txt

Output:
timestamp,records,attempts,records,attempts
1264723200,96,6117,96,6117
1264809600,856,7298,859,7298
1264896000,904,8628,904,8628
1264982400,0,0,0,0
1265155200,644,1786,641,1786

Lines for which the 2nd and 4th fields in the above output do not match.

$ join -t, -1 1 -2 1 file1.txt file2.txt | awk '$2 != $4' FS=\,
1264809600,856,7298,859,7298
1265155200,644,1786,641,1786

How many lines like that ?

$ join -t, -1 1 -2 1 file1.txt file2.txt | awk '$2 != $4 {++c} END {print c}' FS=\,
2

or you can use 'wc -l' for the count

$ join -t, -1 1 -2 1 file1.txt file2.txt | awk '$2 != $4' FS=\, | wc -l
2

Now based on the above count, you can conclude if there is difference between file1.txt and file2.txt based on the 2nd field (i.e. records).

$ unmatched_count=$(join -t, -1 1 -2 1 file1.txt file2.txt | awk '$2 != $4 {++c} END {print c}' FS=\,)
$ [ -z $unmatched_count ] && echo "OK" || echo "BAD"

Output:
BAD

or

$ unmatched_count=$(join -t, -1 1 -2 1 file1.txt file2.txt | awk '$2 != $4' FS=\, | wc -l)
$ [ $unmatched_count -eq 0 ] && echo "OK" || echo "BAD"

Output:
BAD

Related posts:

- Bash join command explained
- Perform outer join using awk
- Performing join using awk
- Search and print output using awk

7 comments:

Derek Evan Schrock said...

If you don't need to store the count.

join -t, -1 1 -2 1 1.txt 2.txt | awk '$2 != $4 { exit 1 }' FS=\, && echo BAD || echo GOOD

Jadu Saikia said...

@Derek, thanks for the solution.

KHUNGBO said...

Hi Jadu Saikia,
I have a problem with checkpoint log. Can you help me !?
Example:
logA.csv: num;date;time;dst;src;service...
logB.csv:
num;date;time;src,service;dst...
logC.csv:
num;date;time;src;service;dst...
File log.csv contains:
num;date;time;src;dst;service
001;01march2010,00:01:11,1.1.1.1;2.2.2.2;137
...
...

I want to filter some fields which is important from csv files.
How can i do that !?
Thanks for helping me !
Can you give me your email !?

Jadu Saikia said...

@KHUNGBO, thanks for your comment. It would be very helpful if you put some examples of your expected output (as you mentioned : I want to filter some fields which is important from csv files.)... Please post your input and your expected output. I would definitely help you on this. Thanks.

KHUNGBO said...

Hi Jadu Saikia,
log1.csv
num;date;time;action;src;dst;proto;service;s_port;d_port;more_information
000;01032010;00:00:01;accept;tcp;1.1.1.1;2.2.2.2;tcp;http;33232;80;more_information
001;01032010;00:00:02;accept;tcp;1.1.1.1;3.3.3.3;tcp;http;33233;80;more_information
002;01032010;00:00:04;accept;tcp;1.1.1.1;2.2.2.2;tcp;http;33232;80;more_information
...

log2.csv
num;date;time;action;proto;service;src;dst;s_port;d_port;more_information
000;01032010;01:00:01;accept;tcp;http;1.1.1.2;3.3.3.3;33232;80;more_information
001;01032010;01:00:02;accept;tcp;http;1.1.1.2;4.4.4.4;33233;80;more_information
002;01032010;01:00:04;accept;tcp;http;1.1.1.2;5.4.3.2;33232;80;more_information

log3.csv
num;date;time;action;proto;src;dst;service;s_port;d_port;more_information
001;01032010;02:00:01;accept;tcp;1.1.1.2;3.3.3.3;http;33232;80;more_information
002;01032010;02:00:02;accept;tcp;1.1.1.2;4.4.4.4;http;33233;80;more_information
003;01032010;02:00:04;accept;tcp;1.1.1.2;5.4.3.2;http;33232;80;more_information

Notice these field is messed, not stable
I want create a csv file which filtered some field from that csv files
num;date;time;action;src;dst;service
other fields don't need.
which num field is new field which sorted by date and time

Thanks for help me !
My email: khungbo33@gmail.com

Jadu Saikia said...

@KHUNGBO,

Let me know if this helps:

$ awk -v colseq="num;date;time;action;src;dst;service" '
BEGIN {
FS=OFS=";"
colsN=split(colseq, colsA, FS)
}
FNR==1 {
for(i=1; i<=NF; i++)
headA[$i] = i
print colseq
next
}
{
for(i=1; i<=colsN; i++)
printf("%s%s", $headA[colsA[i]], (i==colsN) ? ORS : OFS)
}' log*.csv

Output:
num;date;time;action;src;dst;service
000;01032010;00:00:01;accept;tcp;1.1.1.1;tcp
001;01032010;00:00:02;accept;tcp;1.1.1.1;tcp
002;01032010;00:00:04;accept;tcp;1.1.1.1;tcp
num;date;time;action;src;dst;service
000;01032010;01:00:01;accept;1.1.1.2;3.3.3.3;http
001;01032010;01:00:02;accept;1.1.1.2;4.4.4.4;http
002;01032010;01:00:04;accept;1.1.1.2;5.4.3.2;http
num;date;time;action;src;dst;service
001;01032010;02:00:01;accept;1.1.1.2;3.3.3.3;http
002;01032010;02:00:02;accept;1.1.1.2;4.4.4.4;http
003;01032010;02:00:04;accept;1.1.1.2;5.4.3.2;http

You might also find these two post useful:

http://unstableme.blogspot.com/2008/10/find-column-number-based-on-field.html

http://unstableme.blogspot.com/2008/03/print-field-based-on-header-awk.html

Keep in touch. Thanks. // Jadu

KHUNGBO said...

Thanks a lot , Jadu Saikia
I'll try it !

© Jadu Saikia www.UNIXCL.com