I have an xml file like this:
<?xml version="1.0" encoding="utf-8"?><!--Generated by Screaming Frog SEO Spider 16.3-->
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://orinab.com/</loc>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://orinab.com/cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%B4%D9%BE%D8%B2%D8%AE%D8%A7%D9%86%D9%87-%D8%A2%D9%85%D8%A7%D8%AF%D9%87-%D9%81%D9%84%D8%B2%DB%8C-%D8%AF%D8%B1%D8%A8-%DA%86%D9%88%D8%A8%DB%8C</loc>
<changefreq>daily</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>https://orinab.com/sales-associates</loc>
<changefreq>daily</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
...
and I want to append links with kitchen-cabinet
rule to a list.
any suggestions would be appreciated.
2
Answers
I am not that good with xml, but one thing you can use is regex:
Edit:
The Python-docs have a good introduction to using xml.etree.ElementTree for parsing XML-files.
In general, you must use
to obtain a representation of your XML-file.
In your case,
root
will be the elementurlset
, and the next level of nesting are theurl
s. What complicates it a little bit is the fact that your XML introduces a namespace, but as per the instruction this is solved with defining a dictionary containing namespaces.So if you do
then
kitchencabinets
be a list the list you are after.