Tuesday, January 12, 2010

Replace values in XML using sed and awk


Even though sed or awk is not suggested to work with XMLs (generally a programing language like Perl or Python is more preferred), there can be simple cases where we can make use of sed or awk to perform certain search and replace in simple XML s. Here is an simple example.

Input XML file:

$ cat test.xml
<?xml version="1.0" encoding="UTF-8"?>
<StudentInfo Version="1">
<Student>
<StudentId>INS469</StudentId>
<ClassId>21</ClassId>
<Amount>90</Amount>
<Location>AA</Location>
</Student>
<Student>
<StudentId>CSI150</StudentId>
<ClassId>71</ClassId>
<Amount>82</Amount>
<Location>AX</Location>
</Student>
<Student>
<StudentId>E687</StudentId>
<ClassId>12</ClassId>
<Amount>30</Amount>
<Location>AR</Location>
</Student>
</StudentInfo>

Required: Add value 20 t0 the existing 'Amount' values for each student entry in the above XML, i.e. the required output will be something like this:

<?xml version="1.0" encoding="UTF-8"?>
<StudentInfo Version="1">
<Student>
<StudentId>INS469</StudentId>
<ClassId>21</ClassId>
<Amount>110</Amount>
<Location>AA</Location>
</Student>
<Student>
<StudentId>CSI150</StudentId>
<ClassId>71</ClassId>
<Amount>102</Amount>
<Location>AX</Location>
</Student>
<Student>
<StudentId>E687</StudentId>
<ClassId>12</ClassId>
<Amount>50</Amount>
<Location>AR</Location>
</Student>
</StudentInfo>

The awk solution:

$ awk '
BEGIN { FS = "[<|>]" }
{
if ($2 == "Amount") {
sub($3,$3+20)
}
print
}
' test.xml

And to replace the 'Amount' value in the input XML with a constant number say '100' here is a sed solution:

$ sed 's#\(<Amount>\)[0-9]*\(</Amount>\)#\1'100'\2#g' test.xml

Output:

<?xml version="1.0" encoding="UTF-8"?>
<StudentInfo Version="1">
<Student>
<StudentId>INS469</StudentId>
<ClassId>21</ClassId>
<Amount>100</Amount>
<Location>AA</Location>
</Student>
<Student>
<StudentId>CSI150</StudentId>
<ClassId>71</ClassId>
<Amount>100</Amount>
<Location>AX</Location>
</Student>
<Student>
<StudentId>E687</StudentId>
<ClassId>12</ClassId>
<Amount>100</Amount>
<Location>AR</Location>
</Student>
</StudentInfo>

The \1 and \2 (Specifying which occurrence) used above can be well understood from few of the related posts I am mentioning below. Also there is a useful description of the same on grymoire.com

Related posts:

- Printing single quote with awk in bash
- Printing column of file using sed
- Use sed to replace part of a file in bash
- Convert date format using sed and awk
- Format lines using sed

2 comments:

awkseeker said...

Hi Juda, I have asked you about xml in post about "sum group by with awk". But I found my answer here, so I deleted my comments and would ask more in this section.
Similar with this sample, suppose I want to get Amount of StudentId E687 or I want to list all StudentId and Amount of ClassId 12. I prefer oneliner in awk. Please advise.
Best regards,

Adithya Kiran said...

never use sed or awk to parse xml. Use xmlstarlet command

© Jadu Saikia www.UNIXCL.com