Thursday, December 11, 2008
Split file based on pattern - awk
Input file:
$ cat address.txt
Mr X
Koramangala Post
3rd Cross, 17th Main
PIN: 12345
Mr Y
NGV
PIN: 45678
Mr Z
5th Ave, #23
NHM Post
LKV
PIN: 32456
Required: Divide the above file in sub files each containing one address (One address being from the name of the person to the PIN number)
$ awk '
BEGIN{ fn = "add1.txt"; n = 1}
{
print > fn
if (substr($0,1,3) == "PIN") {
close (fn)
n++
fn = "add" n ".txt"
}
}
' address.txt
So output sub files files:
$ cat add1.txt
Mr X
Koramangala Post
3rd Cross, 17th Main
PIN: 12345
$ cat add2.txt
Mr Y
NGV
PIN: 45678
$ cat add3.txt
Mr Z
5th Ave, #23
NHM Post
LKV
PIN: 32456
Similar posts:
- Subdividing large file into multiple files using awk
- Send alternate lines to separate files using awk
- Break lines into multiple lines using sed and awk
- Split file and add header to each file using awk
Subscribe to:
Post Comments (Atom)
© Jadu Saikia www.UNIXCL.com
7 comments:
Hi everybody,
I’ve a file that looks like:
0 0 0 0 0 0
87 62 90 180 1.40679 1.60570860e-01
0 0 0 0 0 0
88 62 89 179 1.39871 1.76044390e-01
0 0 0 0 0 0
88 64 86 172 1.34657 1.50280803e-01
0 0 0 0 0 0
87 63 88 176 1.38235 1.94590941e-01
0 0 0 0 0 0
116 45 64 129 1.01130 1.18465826e-01
88 63 87 175 1.36837 1.46118164e-01
0 0 0 0 0 0
87 61 93 187 1.46723 1.99260086e-01
The lines containing 0’s can be thought as being delimiters. I need to find between the each pair of delimiters, which line has the highest value in the last column and return it for further processing. Does anybody could shed some light on it?
Best regards,
@Javier,a related one
$ cat jm.txt
87 62 90 180 1.40679 1.60570860e-01
88 62 89 179 1.39871 1.76044390e-01
88 64 86 172 1.34657 1.50280803e-01
87 63 88 176 1.38235 1.94590941e-01
116 45 64 129 1.01130 1.18465826e-01
88 63 87 175 1.36837 1.46118164e-01
87 61 93 187 1.46723 1.99260086e-01
$ awk '
min=="" {
min=max=$NF
}
{
if ($NF > max) {max = $NF};
if ($NF < min) {min = $NF};
}
END {
print "minimum:" min;
print "maximum:" max;
}
' jm.txt
minimum:1.18465826e-01
maximum:1.99260086e-01
Hi everyone..
I have a file that looks like this
VU0000078
some text1
VU0000078
some text2
VU0000078
some text3
VU0000088
some text4
VU0000088
some text5
VU0000145
some text6
VU0000145
some text7
.. and so on
So I want to split this file so that each subfile contains all text under the same "VU numbers".
So in this case I will have one file with
VU0000078
some text1
VU0000078
some text2
VU0000078
some text3
Next file will be
VU0000088
some text4
VU0000088
some text5
third file will be
VU0000145
some text6
VU0000145
some text7
and so on....
So i have like 10000 such files to make..
Help!!!
Thanks
@Sandeep,
quickly I can think of a solution like this, please let me know if it helps. Else we can try a python program for this.
$ cat file.txt
VU0000078
some text1
some more text1
VU0000078
some text2
VU0000078
some text3
VU0000088
some text4
VU0000088
some text5
some more text5
VU0000145
some text6
VU0000145
some text7
some text7
VU0000078
some text8
#This is assuming ~ is not part of any text in file.txt. else we can use any other delimiter say some control character.
$ awk '$0 ~ /^VU/ {bookmark=$0}
$0 == '\n' {print; next}
$0 !~ /^VU/ && $0 != '\n' {print bookmark"~"$0}' file.txt | awk -F "~" '{print $NF > $1".out"}'
$ ls -1 VU0000*
VU0000078.out
VU0000088.out
VU0000145.out
$ cat VU0000078.out
some text1
some more text1
some text2
some text3
some text8
$ cat VU0000088.out
some text4
some text5
some more text5
$ cat VU0000145.out
some text6
some text7
some text7
hey Jadu
Thanks it works!!! but there is a problem.
I want to process the files generated and so I forgot to mention that I want those headers too... since headers occur in the text too and their location is important.
Basically so for example I want all the text between the first VU00000?? and last VU00000??not excluding the headers.. And also the header dont precede the text.. they follow it.. My file looks more like this
text1.1(contains VU0000078)
VU0000078
text1.2(contains VU0000078)
VU0000078
text1.3(contains VU0000078)
VU0000078
text2.1(contains VU0000088)
VU0000088
and so on..
Sorry I am sure I have confused you now!!
But I have found a way to deal with this using csplit but I guess using awk will be much faster...
So looking forward for the awk solution .. Thanks
@Sandy, I will definitely try to assist you on this. Could you please put here your input file and your expected output, thanks again for asking the question.
hey Jadu
so the file looks like this
23
MOE2009 3D
some 50 to 60 lines of coordinates
M END
>
1
>
51.051895
>
0.79439676
>
1
>
3.6425204
>
0.1773992
>
0.94735718
>
23
>
VU0000036
>
36
$$$$
23
MOE2009 3D
33 34 0 0 0 0 0 0 0 0999 V2000
again 60 to 70 lines of coordinates
M END
>
1
>
51.051895
>
0.79439658
>
1
>
3.6425591
>
0.17738996
>
0.94737715
>
23
>
VU0000036
>
36
$$$$
20
MOE2009 3D
again coordinates
M END
>
2
>
49.205296
>
0.90112072
>
1
>
3.9111233
>
0.13690513
>
0.89833623
>
20
>
VU0000038
>
38
and so on....
so you see the CORP_ID that will keep changing after 20 repetitions. here i have given a short example of how my file is....
so output of this example file will be
1. First File - VU0000036.txt
23
MOE2009 3D
some 50 to 60 lines of coordinates
M END
>
1
>
51.051895
>
0.79439676
>
1
>
3.6425204
>
0.1773992
>
0.94735718
>
23
>
VU0000036
>
36
$$$$
23
MOE2009 3D
33 34 0 0 0 0 0 0 0 0999 V2000
again 60 to 70 lines of coordinates
M END
>
1
>
51.051895
>
0.79439658
>
1
>
3.6425591
>
0.17738996
>
0.94737715
>
23
>
VU0000036
and the second file will be VU0000038... the last part
$$$$
20
MOE2009 3D
again coordinates
M END
>
2
>
49.205296
>
0.90112072
>
1
>
3.9111233
>
0.13690513
>
0.89833623
>
20
>
VU0000038
I hope I have made it clear... I guess its going to take lot of space though..
sorry
Post a Comment