Thursday, December 11, 2008

Split file based on pattern - awk


Input file:

$ cat address.txt
Mr X
Koramangala Post
3rd Cross, 17th Main
PIN: 12345
Mr Y
NGV
PIN: 45678
Mr Z
5th Ave, #23
NHM Post
LKV
PIN: 32456

Required: Divide the above file in sub files each containing one address (One address being from the name of the person to the PIN number)

$ awk '
BEGIN{ fn = "add1.txt"; n = 1}
{
print > fn
if (substr($0,1,3) == "PIN") {
close (fn)
n++
fn = "add" n ".txt"
}
}
' address.txt

So output sub files files:

$ cat add1.txt
Mr X
Koramangala Post
3rd Cross, 17th Main
PIN: 12345

$ cat add2.txt
Mr Y
NGV
PIN: 45678

$ cat add3.txt
Mr Z
5th Ave, #23
NHM Post
LKV
PIN: 32456

Similar posts:

- Subdividing large file into multiple files using awk
- Send alternate lines to separate files using awk
- Break lines into multiple lines using sed and awk
- Split file and add header to each file using awk

7 comments:

Javier Montoya said...

Hi everybody,

I’ve a file that looks like:

0 0 0 0 0 0
87 62 90 180 1.40679 1.60570860e-01
0 0 0 0 0 0
88 62 89 179 1.39871 1.76044390e-01
0 0 0 0 0 0
88 64 86 172 1.34657 1.50280803e-01
0 0 0 0 0 0
87 63 88 176 1.38235 1.94590941e-01
0 0 0 0 0 0
116 45 64 129 1.01130 1.18465826e-01
88 63 87 175 1.36837 1.46118164e-01
0 0 0 0 0 0
87 61 93 187 1.46723 1.99260086e-01

The lines containing 0’s can be thought as being delimiters. I need to find between the each pair of delimiters, which line has the highest value in the last column and return it for further processing. Does anybody could shed some light on it?

Best regards,

Jadu Saikia said...

@Javier,a related one

$ cat jm.txt
87 62 90 180 1.40679 1.60570860e-01
88 62 89 179 1.39871 1.76044390e-01
88 64 86 172 1.34657 1.50280803e-01
87 63 88 176 1.38235 1.94590941e-01
116 45 64 129 1.01130 1.18465826e-01
88 63 87 175 1.36837 1.46118164e-01
87 61 93 187 1.46723 1.99260086e-01

$ awk '
min=="" {
min=max=$NF
}
{
if ($NF > max) {max = $NF};
if ($NF < min) {min = $NF};
}
END {
print "minimum:" min;
print "maximum:" max;
}
' jm.txt

minimum:1.18465826e-01
maximum:1.99260086e-01

Sandeep said...

Hi everyone..

I have a file that looks like this

VU0000078
some text1

VU0000078
some text2

VU0000078
some text3

VU0000088
some text4

VU0000088
some text5

VU0000145
some text6

VU0000145
some text7

.. and so on

So I want to split this file so that each subfile contains all text under the same "VU numbers".

So in this case I will have one file with

VU0000078
some text1

VU0000078
some text2

VU0000078
some text3

Next file will be

VU0000088
some text4

VU0000088
some text5

third file will be


VU0000145
some text6

VU0000145
some text7


and so on....

So i have like 10000 such files to make..

Help!!!

Thanks

Jadu Saikia said...

@Sandeep,

quickly I can think of a solution like this, please let me know if it helps. Else we can try a python program for this.

$ cat file.txt
VU0000078
some text1
some more text1

VU0000078
some text2

VU0000078
some text3

VU0000088
some text4

VU0000088
some text5
some more text5

VU0000145
some text6

VU0000145
some text7
some text7

VU0000078
some text8

#This is assuming ~ is not part of any text in file.txt. else we can use any other delimiter say some control character.

$ awk '$0 ~ /^VU/ {bookmark=$0}
$0 == '\n' {print; next}
$0 !~ /^VU/ && $0 != '\n' {print bookmark"~"$0}' file.txt | awk -F "~" '{print $NF > $1".out"}'

$ ls -1 VU0000*
VU0000078.out
VU0000088.out
VU0000145.out

$ cat VU0000078.out
some text1
some more text1
some text2
some text3
some text8

$ cat VU0000088.out
some text4
some text5
some more text5

$ cat VU0000145.out
some text6
some text7
some text7

sandy said...

hey Jadu

Thanks it works!!! but there is a problem.

I want to process the files generated and so I forgot to mention that I want those headers too... since headers occur in the text too and their location is important.

Basically so for example I want all the text between the first VU00000?? and last VU00000??not excluding the headers.. And also the header dont precede the text.. they follow it.. My file looks more like this

text1.1(contains VU0000078)
VU0000078
text1.2(contains VU0000078)
VU0000078
text1.3(contains VU0000078)
VU0000078
text2.1(contains VU0000088)
VU0000088
and so on..

Sorry I am sure I have confused you now!!

But I have found a way to deal with this using csplit but I guess using awk will be much faster...

So looking forward for the awk solution .. Thanks

Jadu Saikia said...

@Sandy, I will definitely try to assist you on this. Could you please put here your input file and your expected output, thanks again for asking the question.

sandy said...

hey Jadu

so the file looks like this

23
MOE2009 3D

some 50 to 60 lines of coordinates
M END
>
1

>
51.051895

>
0.79439676

>
1

>
3.6425204

>
0.1773992

>
0.94735718

>
23

>
VU0000036

>
36

$$$$
23
MOE2009 3D

33 34 0 0 0 0 0 0 0 0999 V2000
again 60 to 70 lines of coordinates
M END
>
1

>
51.051895

>
0.79439658

>
1

>
3.6425591

>
0.17738996

>
0.94737715

>
23

>
VU0000036

>
36

$$$$
20
MOE2009 3D
again coordinates
M END
>
2

>
49.205296

>
0.90112072

>
1

>
3.9111233

>
0.13690513

>
0.89833623

>
20

>
VU0000038

>
38



and so on....

so you see the CORP_ID that will keep changing after 20 repetitions. here i have given a short example of how my file is....

so output of this example file will be

1. First File - VU0000036.txt

23
MOE2009 3D

some 50 to 60 lines of coordinates
M END
>
1

>
51.051895

>
0.79439676

>
1

>
3.6425204

>
0.1773992

>
0.94735718

>
23

>
VU0000036

>
36

$$$$
23
MOE2009 3D

33 34 0 0 0 0 0 0 0 0999 V2000
again 60 to 70 lines of coordinates
M END
>
1

>
51.051895

>
0.79439658

>
1

>
3.6425591

>
0.17738996

>
0.94737715

>
23

>
VU0000036

and the second file will be VU0000038... the last part

$$$$
20
MOE2009 3D
again coordinates
M END
>
2

>
49.205296

>
0.90112072

>
1

>
3.9111233

>
0.13690513

>
0.89833623

>
20

>
VU0000038


I hope I have made it clear... I guess its going to take lot of space though..
sorry

© Jadu Saikia www.UNIXCL.com