Wednesday, September 9, 2009

Split file using awk - few examples


One of good use of awk is splitting files (based on different conditions) into sub-files.
Lets see some examples:

Example 1:

Input file log1.txt:
- Line starting with H is the main header line.
- Line starting with "h" and subsequent lines starting with "s" (till the next "h" line) are part of the same entry/section.

$ cat log1.txt
H,555,etho0
h,1,3
s,1233,456456,1212
s,4251,452456,7215
s,6283,851456,1219
h,9,2
s,2233,156456,1912
s,9233,256456,8212
h,2,4
s,4233,456456,1212
s,7251,252456,7215
s,1288,851456,9219
s,9183,851456,6219

Required:

- Split or subdivide the above file into sub files corresponding to each entry (one entry being the section starting with "h" and "s" lines till the next "h" line)
- Each sub-file should contain(start with) the main header line ("H" line).
- The required output is 3 sub files with the following contents and filename convention.

$ cat 555.1.1.log
H,555,etho0
h,1,3
s,1233,456456,1212
s,4251,452456,7215
s,6283,851456,1219

$ cat 555.9.2.log
H,555,etho0
h,9,2
s,2233,156456,1912
s,9233,256456,8212

$ cat 555.2.3.log
H,555,etho0
h,2,4
s,4233,456456,1212
s,7251,252456,7215
s,1288,851456,9219
s,9183,851456,6219

The awk program:

$ awk -F "," '
$1=="H" {mainH=$0;id=$2;next}
/^h/{
hid=$2;close(id"."hid"."f".log")
f++
print mainH > id"."hid"."f".log"
}
{print $0 > id"."hid"."f".log"}
' log1.txt

Example 2:

Input file:

$ cat log.txt
1252468812,yahoo,3.5
1252468812,hotmail,2.4
1252468819,yahoo,1.2
1252468812,msn,8.9
1252468923,gmail,12
1252468819,live,3.4
1252468929,yahoo,9.0
1252468929,msn,1.2

Required:

a) Split the above files based on the first field (i.e. lines with same first field should go to the same file)

The awk one liner:

$ awk -F "," '{close(f);f=$1}{print > f".txt"}' log.txt

Output:
Above file is splited into the following sub-files.

$ cat 1252468812.txt
1252468812,yahoo,3.5
1252468812,hotmail,2.4
1252468812,msn,8.9

$ cat 1252468819.txt
1252468819,yahoo,1.2
1252468819,live,3.4

$ cat 1252468923.txt
1252468923,gmail,12

$ cat 1252468929.txt
1252468929,yahoo,9.0
1252468929,msn,1.2

b) Send every 3 lines of above file into a sub file.

The awk code:

$ awk '{print >("log_" int((NR+2)/3))}' log.txt

Output:
The sub-files generated.

$ cat log_1
1252468812,yahoo,3.5
1252468812,hotmail,2.4
1252468819,yahoo,1.2

$ cat log_2
1252468812,msn,8.9
1252468923,gmail,12
1252468819,live,3.4

$ cat log_3
1252468929,yahoo,9.0
1252468929,msn,1.2

Related post:

28 comments:

Anand Subramanian said...

Jadu, enjoyed reading this entry.
I have a question. I want to split a file at a line number.
I have a file which has 10000 lines. I need first 5000 to go in file1.
The next 2000 to go in file2.
The next 2000 to go in file3.
The last 1000 to go in file4.

How to do this?

Jadu Saikia said...

@Anand, since your num-lines to split with is not constant, I can think of a bash script for the same.

Make sure you have a .bak of your parent main file.

e.g.

$ cat main.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

$ cat split.sh
#!/bin/sh

file='main.txt'
c=0

for i in 5 3 2 2
do
echo "Extracting $i lines from $file"
head -$i main.txt > file.$c
((c+=1))
sed "1,$i d" main.txt > main.txt.tmp
mv main.txt.tmp main.txt
done

$ ./split.sh
Extracting 5 lines from main.txt
Extracting 3 lines from main.txt
Extracting 2 lines from main.txt
Extracting 2 lines from main.txt

$ cat file.0
1
2
3
4
5

$ cat file.1
6
7
8

$ cat file.2
9
10

$ cat file.3
11
12

Thanks for your comment.

steve said...

Jadu,

While your example 2b works, it only works with short input files. A better solution for the problem "Send every 3 lines of above file into a sub file" is
awk '{fn = NR % 3; print > ("log_" fn)}' log.txt

steve said...

Jadu,

While your example 2b works, it only works with short input files. A better solution for the problem "Send every 3 lines of above file into a sub file" is
awk '{fn = NR % 3; print > ("log_" fn)}' log.txt

El principito said...

Hi Jadu,

You are very expert in awk I think. I only know things I saw in the web. Now I have a data which I need to separate ike subfiles. The have a description label of 5 lines

DESCRIPTION=Unknown Line Type
ID=0
CODIGO=51
TIPO=Falla inversa
NOMBRE=
CUADRANGUL=
ZONA_UTM=18
720196.346 8617311.085 -999999
720125.284 8617256.012 -999999
720022.148 8617175.128 -999999
719893.968 8617111.038 -999999
719826.108 8617088.418 -999999
719697.929 8617058.258 -999999
719471.729 8617016.788 -999999
719283.230 8616975.318 -999999
719083.420 8616933.849 -999999
718860.991 8616914.999 -999999
718740.351 8616892.379 -999999
718612.172 8616850.909 -999999
718457.602 8616775.509 -999999
718272.873 8616673.719 -999999
718103.223 8616620.940 -999999
718050.444 8616579.470 -999999
718008.974 8616522.920 -999999
717963.734 8616447.520 -999999
717869.484 8616304.261 -999999
717824.244 8616274.101 -999999
717779.004 8616259.021 -999999
717696.065 8616221.321 -999999
717624.435 8616157.231 -999999
717484.945 8615991.352 -999999
717383.156 8615900.872 -999999
717336.468 8615869.357 -999999
717273.553 8615845.603 -999999
DESCRIPTION=Unknown Line Type
ID=0
CODIGO=51
TIPO=Falla inversa
NOMBRE=
CUADRANGUL=
ZONA_UTM=18
724897.564 8617294.747 -999999
724689.529 8617175.871 -999999
724358.406 8617065.497 -999999
724118.462 8616969.519 -999999
724012.886 8616931.128 -999999
723926.506 8616916.732 -999999
723698.559 8616947.924 -999999
723482.610 8616957.522 -999999
723295.453 8616938.327 -999999
722973.928 8616861.545 -999999
722527.632 8616751.170 -999999
722016.551 8616662.391 -999999
721728.618 8616604.804 -999999
721493.473 8616532.821 -999999
721243.931 8616412.849 -999999
720850.422 8616201.698 -999999
720408.925 8615961.754 -999999
720248.163 8615882.573 -999999
720008.219 8615724.209 -999999
719840.258 8615618.634 -999999
719465.945 8615469.869 -999999
719240.397 8615383.489 -999999
719067.638 8615253.919 -999999
718933.269 8615109.952 -999999
718760.509 8614841.215 -999999
718582.950 8614481.299 -999999
718467.777 8614313.338 -999999
718381.397 8614241.355 -999999
718160.649 8614097.388 -999999
717843.922 8613867.042 -999999
717685.559 8613708.679 -999999
717426.420 8613387.154 -999999
717349.215 8613311.538 -999999
717235.899 8613217.464 -999999
DESCRIPTION=Unknown Line Type
ID=0
CODIGO=30
TIPO=Falla
NOMBRE=
CUADRANGUL=
ZONA_UTM=18
725457.135 8604909.351 -999999
725428.262 8605411.654 -999999
725407.602 8605676.481 -999999
725396.662 8605824.459 -999999
725383.185 8605913.134 -999999
DESCRIPTION=Unknown Line Type
ID=0
CODIGO=30
TIPO=Falla
NOMBRE=
CUADRANGUL=
ZONA_UTM=18
725313.881 8603696.004 -999999
725420.228 8603987.394 -999999
725443.624 8604070.344 -999999
725454.259 8604144.787 -999999
725458.513 8604499.985 -999999


I really appreciate your help

I congratulate to you because unix and this kinds of software are very useful and you know how to use

Best Regards

Ana

Jadu Saikia said...

@Ana, sorry could not reply to you last day. Assuming you want to split the master file into subfiles starting with "DESCRIPTION", here is a solution using awk:

$ awk '/^DESCRIPTION/{close("sub_"f);f++}{print $0 > "sub_"f}' file.txt

Please let me know if your requirement was different. Thanks for your comment.

sathish said...

Hi
this command is creating the files after splitting in the same directory where we are currently but i havea req where the files shld write in to seperate dir after splitting . Please help me on this

sathish said...

Hi i a requiremnt where the files after shld wrote in to another directory not in the directory where i am running this command - how to do this

Jadu Saikia said...

@Satish,

Thanks for the question.

This creates sub files under ./dir1

awk '{print >("dir1/log_" int((NR+2)/3))}' file.txt

Hope this helps.

zvtral said...

hi jadu
i have a question how about this
ex:
1252468812,yahoo,3.5
1252468812,hotmail,2.4
1252468819,yahoo,1.2
msn,1252468812,8.9
1252468923,gmail,12
1252468819,live,3.4
yahoo,1252468812,9.0
1252468929,msn,1.2

so output will like this
1252468812,yahoo,3.5
1252468812,hotmail,2.4
msn,1252468812,8.9
yahoo,1252468812,9.0

they will be get "1252468812" in $1 and $2

zvtral said...

hlow jadu thx for your tutor
i have question about awk
ex:
1252468812,yahoo,3.5
1252468812,hotmail,2.4
1252468819,yahoo,1.2
msn,1252468812,8.9
1252468923,gmail,12
live,1252468812,3.4
yahoo,1252468812,9.0
1252468929,msn,1.2

output
1252468812,yahoo,3.5
1252468812,hotmail,2.4
msn,1252468812,8.9
1252468923,gmail,12

they will get 1252468812 in $1 and $2 how will do that

thx before

saravanan said...

717484.945 8615991.352 -999999
717383.156 8615900.872 -999999
717336.468 8615869.357 -999999
717273.553 8615845.603 -999999
DESCRIPTION=Unknown Line Type
ID=0
CODIGO=51
TIPO=Falla inversa
NOMBRE=
CUADRANGUL=
ZONA_UTM=18
724897.564 8617294.747 -999999
hekkkkiii
724358.406 8617065.497 -999999
hekkkkiii
724012.886 8616931.128 -999999
723926.506 8616916.732 -999999
723698.559 8616947.924 -999999
723482.610 8616957.522 -999999
723295.453 8616938.327 -999999
hekkkkiii
722527.632 8616751.170 -999999
722016.551 8616662.391 -999999
721728.618 8616604.804 -999999
721493.473 8616532.821 -999999
hekkkkiii
720850.422 8616201.698 -999999
720408.925 8615961.754 -999999
720248.163 8615882.5
hekkkkiii

i need to split this into 3(depending on my requirement)..
output shuld like this...

file1


717484.945 8615991.352 -999999
717383.156 8615900.872 -999999
717336.468 8615869.357 -999999
717273.553 8615845.603 -999999
DESCRIPTION=Unknown Line Type
ID=0
CODIGO=51
TIPO=Falla inversa
NOMBRE=
CUADRANGUL=
ZONA_UTM=18
724897.564 8617294.747 -999999
hekkkkiii
724358.406 8617065.497 -999999
hekkkkiii
724012.886 8616931.128 -999999
723926.506 8616916.732 -999999
723698.559 8616947.924 -999999
723482.610 8616957.522 -999999
723295.453 8616938.327 -999999


file2

hekkkkiii
722527.632 8616751.170 -999999
722016.551 8616662.391 -999999
721728.618 8616604.804 -999999
721493.473 8616532.821 -999999
hekkkkiii


file3

720850.422 8616201.698 -999999
720408.925 8615961.754 -999999
720248.163 8615882.5
hekkkkiii
720850.422 8616201.698 -999999
720408.925 8615961.754 -999999
720248.163 8615882.5
hekkkkiii

it should not break the hekkkkiii....while spliting the file...

PP said...

awk -F "," '{close(f);f=$1}{print > f".txt"}' log.txt

Can anyone explain woking of above command?

To be specific I don't understand how above command is comparing records with same first field and pushing the record in the same file with filename=first field of current record(I got the part how files having filenames = distinct(first column) is created.)

PP said...

awk -F "," '{close(f);f=$1}{print > f".txt"}' log.txt

Can anyone explain woking of above command?

To be specific I don't understand how above command is comparing records with same first field and pushing the record in the same file with filename=first field of current record(I got the part how files having filenames = distinct(first column) is created.)

PP said...

awk -F "," '{close(f);f=$1}{print > f".txt"}' log.txt

Can anyone explain woking of above command?

To be specific I don't understand how above command is comparing records with same first field and pushing the record in the same file with filename=first field of current record(I got the part how files having filenames = distinct(first column) is created.)

aniruddh said...

What to do if i`m required to keep 1 header also,in your example 2(a)(the yahoo one).
Please provide similar awk one liner

Jadu Saikia said...

@Aniruddh, thanks for the query. Could you please put the input and expected output with one example.

tomj said...

I am pretty new to scripting & need to take a rather large file & break it up with the value from 4 separate columns. The total combinations will leave me with about 1080 files. First column company has 15 possible values. The second State has 3, Class has 2 & Month has 12. SO a file by month, company, state, class for each combination. Can you help me out? THanks.

Jadu Saikia said...

@tomj, something like this ?

e.g.

$ cat log.txt
ms,us,c3,jan,40
goog,us,c3,jan,50
ms,us,c3,jan,30
ms,us,c3,jan,10
ms,us,c3,feb,40


$ awk -F "," '{close(f);f=$1"."$2"."$3"."$4}{print > f".txt"}' log.txt

Output:
$ cat ms.us.c3.jan.txt
ms,us,c3,jan,40
ms,us,c3,jan,30
ms,us,c3,jan,10

$ cat ms.us.c3.feb.txt
ms,us,c3,feb,40

$ cat goog.us.c3.jan.txt
goog,us,c3,jan,50

Could you please try and let me know. Thanks.

tomj said...

Jadu, I guess I need more specifics as I have never wiritten any Unix scripts before. I have a file: PowerCenter/TgtFiles/dfel_dw/elig_archive/elig_list_comb_arch_1.out
I need to put the script commmand to break this file up by Company, Cust Class, State, & month in a Scripts Directory. The columns for Company, Cust Class, State, & month are 2, 5, 11, 79 respectively. I was jut going to create in NotePad then FTP to my Unix Server. Any additional help you can provide I appreciate. Here is what I tried:

/infile/PowerCenter/TgtFiles/dfel_dw/elig_archive/elig_list_comb_arch_1.out

$ awk -F "," '{close(f);f=$2"."$5"."$11"."$79}{print > f".txt"}' log.txt

SSH:

ELIG_ARCH_SPLIT.sh
ELIG_ARCH_SPLIT.sh: /infile/PowerCenter/TgtFiles/dfel_dw/elig_archive/elig_list_comb_arch_1.out^M: not found.
ELIG_ARCH_SPLIT.sh[2]: ^M: not found.
ELIG_ARCH_SPLIT.sh[3]: $: not found.

tomj said...

Jadu, I made an attempt at this & maybe its a permission setting but here is my script:

awk -F "," '{close(f);f=$2"."$5"."$11"."$79}{print > f".txt"}' elig_list_comb_arch_1.out


Here are my results:

rsetl01d:dfelp:/PowerCenter/TgtFiles/dfel_dw/elig_archive==>ELIG_ARCH_SPLIT.sh
.wk: 0602-533 Cannot find or open file elig_list_comb_arch_1.out
The source line number is 1.
rsetl01d:dfelp:/PowerCenter/TgtFiles/dfel_dw/elig_archive==>

tomj said...

I created a smaller test file to break up. I created the awk in a .sh file:
awk -F "," '{f=$2"."$5"."$11"."$79; print > $f".txt"}' ./shortoutput.txt

results:

awk: 0602-562 Field $(PECO Energy.Residential.PA.3) is not correct.
The input line number is 1. The file is ./shortoutput.txt.
The source line number is 1.
What should I try now?

Jadu Saikia said...

@tomj,

could you please try this:

awk -F "," '{close(f);f=$2"."$5"."$11"."$79}{print > f".txt"}' ./shortoutput.txt

tomj said...

Jadu, thanks. That appears to work both from the command line & creatinga .sh file. I think I just need to go back & remove spaces from the $2 so I don't have spaces in the file name.

tomj said...
This comment has been removed by the author.
tomj said...
This comment has been removed by the author.
Colecandoo said...

Hello there.

I just tried example 2 of this post and received the following error:

awk: syntax error at source line 1
context is
{close(f);f=$1}{print > >>> f".txt" <<<
awk: illegal statement at source line 1

what could I have done incorrectly?

Running terminal from Mac OSX 10.6.8

Colecandoo said...

Hello there

have just tried example 2 and had the following error:

awk: syntax error at source line 1
context is
{close(f);f=$1}{print > >>> f".txt" <<<
awk: illegal statement at source line 1

what could I have done incorrectly?

running terminal from Mac OSX 10.6.8

© Jadu Saikia www.UNIXCL.com