Tuesday, October 12, 2010

Awk - Print particular instances of a file


I am requesting everyone to provide a better alternative (in any scripting language) to this problem. Thanks in advance.

My input file has the following format:
- A instance is a combination of 'h' 'v' and one or more 'i' lines.
- All lines starting with 'h' are header lines and the 3rd field in that line is the 'header number'.
- All lines starting with 'v' are version lines and the second field in that line is the 'version number'.

$ cat file.txt
h,1,100
v,1
i,rt,200
i,rt,210
i,rt,810
h,1,101
v,5
i,rt,500
i,rt,700
h,1,100
v,2
i,rt,100
i,rt,910
h,1,500
v,1
i,rt,190
h,1,100
v,1
i,rt,900
i,rt,210
h,1,300
v,1
i,rt,800
i,rt,210

Required:
- Print all the 'i' lines associated with header number '100' and version number '1'
i.e. required output:

i,rt,200
i,rt,210
i,rt,810
i,rt,900
i,rt,210

The quick solution I can think about is to associate 'header number' and 'version number' with all 'i' lines

$ awk -F "," '$1=="h" {h_value=$NF}
$1=="v" {v_value=$NF}
$1=="i" {print h_value,v_value,$0}
' file.txt

Output:

100 1 i,rt,200
100 1 i,rt,210
100 1 i,rt,810
101 5 i,rt,500
101 5 i,rt,700
100 2 i,rt,100
100 2 i,rt,910
500 1 i,rt,190
100 1 i,rt,900
100 1 i,rt,210
300 1 i,rt,800
300 1 i,rt,210

And then print only the lines with header number=100 and version number=1.

$ awk -F "," '$1=="h" {h_value=$NF}
$1=="v" {v_value=$NF}
$1=="i" {print h_value,v_value,$0}
' file.txt | awk '$1==100 && $2==1 {print $NF}'

Output:

i,rt,200
i,rt,210
i,rt,810
i,rt,900
i,rt,210

I am sure there can be a better solution to this problem. Readers, please put your solutions here in the comment section. Much appreciated.

Related posts:
- Count instances without specific line in UNIX using Python
- Print last instance of a file in UNIX
- Print first few instances of a file - python

11 comments:

Unknown said...

Heres the quick and dirty perl version.

perl -lne 'my @d = split(/,/);
if( $d[0] eq "h" ){
$header = ( $d[2] == 100 );
}
if( $header && $d[0] eq "v" ){
$version = ($d[1] == 1 );
} elsif ( $header && $version && $d[0] eq "i" ){
print @d;
}
' < input.dat

Unknown said...

Or how about a full perl version with a configurable header and version filter?

Usage: ./prog.pl filename header version

#!/usr/bin/env perl

use strict;
use warnings;

my ($file,$header_seek,$version_seek) = @ARGV[0..2];

open(F,$file) or die("Cant open file $file: $!");

my ($header,$version);

print "Looking for header\t: $header_seek\n";
print "Looking for version\t: $version_seek\n";

while(<F>){
my @d = split(/,/);

if( $d[0] eq "h" ){
$header = ( $d[2] == $header_seek );
}
if( $header && $d[0] eq "v" ){
$version = ($d[1] == $version_seek );
}elsif( $header && $version && $d[0] eq "i" ){
print @d;
}
}

exit 0;

Unknown said...

@Botty, thanks for both the perl solutions, I will use them and will update you. thanks again.

Ryan T. said...

Just add an 'if' to the 'i' line case: awk -F, '$1 == "h" {h=$3} $1 == "v" {v=$2} $1 == "i" { if( h == 100 && v == 1 ) print $0 }'

Unknown said...

@Ryan T : Thanks for the awk one, its useful.

bruddah said...

I would use sed to put each instance on its own line, and pipe the result to | grep 'h,1,100 v,1'

satish said...

hi jadu help me out...

file1.csv
GJR10101315511000145,22093165,96734353,250,GJR,UER
HPR10101318071200662,57298772,03289112,250,HPR,DLR
HPR10101317581100059,57651616,02894754,250,HPR,DLR
HRR10101318031000352,02190237,03963371,250,HRR,DLR
HRR10101322141200125,02191868,03426983,250,HRR,DLR
HRR10101318071100028,02192292,77203779,250,HRR,BHR
KOR10101317231000008,04090878,14635365,250,KOR,WBR
KOR10101315171200004,04798063,14367284,250,KOR,WBR
KOR10101316291200024,13996348,97000215,250,KOR,BHR

file2.csv
KOR10101315511000150,57457784,250
GJR10101315511000145,96734353,250
UER10101315511000150,66734353,200
HRR10101315511000151,96734353,206
UER10101315511000164,96734353,200
UER10101319451200169,96734353,250
GJR10101315511000145,96734353,206
........
.n numbers.....
.......
GHR10101315511000150,36646466,200
HPR10101318071200662,03289112,250
HPR10101318071200662,03289112,206
DLR10101318071200664,03289112,206

OUTPUT:
FILE3.CSV
UER10101315511000164,96734353,200
DLR10101318071200664,03289112,206

Let me explain..
now i have to get column 1 removing first three character in file1.csv Eg:-like(10101315511000145) and add that 6th column to it like(UER10101315511000145).Now from 145 i have to increament+1 till 999 to it and compare column1 from file2 whichever number appears first it should display that line from file2.csv....so here i get UER10101315511000150 but 2nd colum doesnt matches to 3rd column in file1..so i have to go for next increament which matches that too.

thanks in advance

Mahesh Kharvi said...

awk -F "," '$1=="h" {h_value=$NF} $1=="v" {v_value=$NF} $1=="i" && v_value == "1" && h_value == "100" {print h_value,v_value,$0}' file.txt

Anirudh said...

perl -0777ne '
print for/(?<=^h,1,100\nv,1)(?:\ni,rt,\d+)+$/msg;print"\n";
'

Anirudh said...

sed -e '
/^h,1,100$/N
/^i,rt,[1-9][0-9]*\n/bc
/\nv,1$/!d
s/.*//
:a
${s/$/\n/;bb;}
N
/\ni,rt,[1-9][0-9]*$/ba
:b
s/^\n//
:c
P;D
'

Unknown said...

perl -0777ne 'print/(?<=^h,1,100\nv,1\n)(?:i.+\n)+/mg' file.txt

© Jadu Saikia www.UNIXCL.com