Python XML parser with a specific rule - SEO

mm
December 8, 2021
123 views
0 votes
2 Answers

I have an xml file like this:

<?xml version="1.0" encoding="utf-8"?><!--Generated by Screaming Frog SEO Spider 16.3-->
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://orinab.com/</loc>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://orinab.com/cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%B4%D9%BE%D8%B2%D8%AE%D8%A7%D9%86%D9%87-%D8%A2%D9%85%D8%A7%D8%AF%D9%87-%D9%81%D9%84%D8%B2%DB%8C-%D8%AF%D8%B1%D8%A8-%DA%86%D9%88%D8%A8%DB%8C</loc>
    <changefreq>daily</changefreq>
    <priority>0.9</priority>
  </url>
  <url>
    <loc>https://orinab.com/sales-associates</loc>
    <changefreq>daily</changefreq>
    <priority>0.9</priority>
  </url>
  <url>
    <loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
  ...

and I want to append links with kitchen-cabinet rule to a list.
any suggestions would be appreciated.

Tags: python

Answers

I am not that good with xml, but one thing you can use is regex:

import re
reg = re.compile(r'(https:.*kitchen-cabinet.*)(?=<)')
reg.findall(xml)

>> ['https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC']

# xml variable:
xml = '''
  <url>
    <loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
'''
reg.findall(xml)
>>> ['https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC',
 'https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC',
 'https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC']

Edit:

with open('file.xml', 'r') as f:
    trim = reg.findall(f.read())
    print(trim)

- eandklahn
- December 8, 2021 at 12:56 pm
- 0 votes
0
The Python-docs have a good introduction to using xml.etree.ElementTree for parsing XML-files.

In general, you must use
```
import xml.etree.ElementTree as ET

tree = ET.parse(<your_filename_here>)
root = tree.getroot()
```
to obtain a representation of your XML-file.

In your case, root will be the element urlset, and the next level of nesting are the urls. What complicates it a little bit is the fact that your XML introduces a namespace, but as per the instruction this is solved with defining a dictionary containing namespaces.

So if you do
```
ns = {'ns1': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
kitchencabinets = []
for child in root:
    url = child.find('ns1:loc', ns).text
    if 'kitchen-kabinet' in url: kitchencabinets.append(url)
```
then kitchencabinets be a list the list you are after.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Python XML parser with a specific rule – SEO

Answers