skip to Main Content

I have an xml file like this:

<?xml version="1.0" encoding="utf-8"?><!--Generated by Screaming Frog SEO Spider 16.3-->
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://orinab.com/</loc>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://orinab.com/cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%B4%D9%BE%D8%B2%D8%AE%D8%A7%D9%86%D9%87-%D8%A2%D9%85%D8%A7%D8%AF%D9%87-%D9%81%D9%84%D8%B2%DB%8C-%D8%AF%D8%B1%D8%A8-%DA%86%D9%88%D8%A8%DB%8C</loc>
    <changefreq>daily</changefreq>
    <priority>0.9</priority>
  </url>
  <url>
    <loc>https://orinab.com/sales-associates</loc>
    <changefreq>daily</changefreq>
    <priority>0.9</priority>
  </url>
  <url>
    <loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
  ...

and I want to append links with kitchen-cabinet rule to a list.
any suggestions would be appreciated.

2

Answers


  1. I am not that good with xml, but one thing you can use is regex:

    import re
    reg = re.compile(r'(https:.*kitchen-cabinet.*)(?=<)')
    reg.findall(xml)
    
    >> ['https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC']
    
    # xml variable:
    xml = '''
      <url>
        <loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
        <changefreq>daily</changefreq>
        <priority>0.8</priority>
      </url>
      <url>
        <loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
        <changefreq>daily</changefreq>
        <priority>0.8</priority>
      </url>
      <url>
        <loc>https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC</loc>
        <changefreq>daily</changefreq>
        <priority>0.8</priority>
      </url>
    '''
    reg.findall(xml)
    >>> ['https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC',
     'https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC',
     'https://orinab.com/kitchen-cabinet/%DA%A9%D8%A7%D8%A8%DB%8C%D9%86%D8%AA-%D8%A2%D8%A8%DA%86%DA%A9%D8%A7%D9%86-%D9%81%D9%84%D8%B2%DB%8C-%D8%B1%D9%86%DA%AF-%DA%A9%D8%A7%D8%AC']
    
    

    Edit:

    with open('file.xml', 'r') as f:
        trim = reg.findall(f.read())
        print(trim)
    
    
    Login or Signup to reply.
  2. The Python-docs have a good introduction to using xml.etree.ElementTree for parsing XML-files.

    In general, you must use

    import xml.etree.ElementTree as ET
    
    tree = ET.parse(<your_filename_here>)
    root = tree.getroot()
    

    to obtain a representation of your XML-file.

    In your case, root will be the element urlset, and the next level of nesting are the urls. What complicates it a little bit is the fact that your XML introduces a namespace, but as per the instruction this is solved with defining a dictionary containing namespaces.

    So if you do

    ns = {'ns1': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
    kitchencabinets = []
    for child in root:
        url = child.find('ns1:loc', ns).text
        if 'kitchen-kabinet' in url: kitchencabinets.append(url)
    

    then kitchencabinets be a list the list you are after.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search