skip to Main Content

I parsing a URL from feedparser and trying to get all columns, but i do not get all columns as a out put, not sure where the issue is. If you execute the below. I do not get data for few columns, but the data do exist as u can check in browser.

my code

import feedparser
import pandas as pd 

xmldoc = feedparser.parse('http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US')
df_cols = [
    "title", "url", "endsAt", "image225","currency"
    "price", "orginalPrice", "discountPercentage", "quantity", "shippingCost","dealUrl"
]
rows = []

for entry in xmldoc.entries:
    s_title = entry.get("title","")
    s_url = entry.get("url", "")
    s_endsAt = entry.get("endsAt", "")
    s_image225 = entry.get("image225", "")
    s_currency = entry.get("currency", "")
    s_price = entry.get("price","")
    s_orginalPrice = entry.get("orginalPrice","")
    s_discountPercentage = entry.get ("discountPercentage","")
    s_quantity = entry.get("quantity","")
    s_shippingCost = entry.get("shippingCost", "")
    s_dealUrl = entry.get("dealUrl", "")#.replace('YOURUSERIDHERE','2427312')
       
        
    rows.append({"title":s_title, "url": s_url, "endsAt": s_endsAt, 
                 "image225": s_image225,"currency": s_currency,"price":s_price,
                 "orginalPrice": s_orginalPrice,"discountPercentage": s_discountPercentage,"quantity": s_quantity,
                 "shippingCost": s_shippingCost,"dealUrl": s_dealUrl})

out_df = pd.DataFrame(rows, columns=df_cols)

out_df

tried this, but this doesn’t give me any data only few columns (headers i suppose)

import lxml.etree as ET 
import urllib

response = urllib.request.urlopen('http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US')
xml = response.read()

root = ET.fromstring(xml)
for item in root.findall('.*/item'):
       
    df = pd.DataFrame([{item.tag: item.text if item.text.strip() != "" else item.find("*").text
                       for item in lnk.findall("*") if item is not None} 
                       for lnk in root.findall('.//item')])
                       
df

Is is possible to iterate the URL offset within an array as below and result the output to a PD. When I try this it does work partially with issues (i.e) I have few elements missing, resulting in this error AttributeError: object has no attribute 'price', shipping cost etc., How do we handle if null for an element?

my code

 import feedparser
    import pandas as pd
    #from simplified_scrapy import SimplifiedDoc, utils, req
    getdeals = ['http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200',
            'http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200&offset=200',
            'http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200&offset=400']
    
    posts=[]
    for urls in getdeals:
        feed = feedparser.parse(urls)
        for deals in feed.entries:
            print (deals)
            posts.append((deals.title,deals.endsat,deals.image225,deals.price,deals.originalprice,deals.discountpercentage,deals.shippingcost,deals.dealurl))
    df=pd.DataFrame(posts,columns=['title','endsat','image2255','price','originalprice','discountpercentage','shippingcost','dealurl'])
    df.tail()

Also , similarly how to loop multiple JSON responses

 url= ["https://merchants.apis.com/v4/publisher/159663/offers?country=US&limit=2000",
"https://merchants.apis.com/v4/publisher/159663/offers?country=US&offset=2001&limit=2000"]
    
    
    response = requests.request("GET", url, headers=headers, params=querystring)
    response = response.json()
    
    
    name = []
    logo = []
    date_added = []
    description = []
    for i in range(len(response['offers'])):
        name.append(response['offers'][i]['merchant_details']['name'])
        logo.append(response['offers'][i]['merchant_details']['metadata']['logo'])
        date_added.append(response['offers'][i]['date_added'])
        description.append(response['offers'][i]['description'])
         try:
            verticals.append(response['offers'][i]['merchant_details']['verticals'][0])
        except IndexError:
            verticals.append('NA')
        pass
        
    data1 = pd.DataFrame({'name':name,'logo':logo,'verticals':verticals, 'date_added':date_added,'description':description})

2

Answers


  1. Because the Ebay XML has a default namespace in root you need to define a prefix to this namespace URI in order to parse by named nodes. See how namespace dictionary is used in second argument of findall and the .tag needs namespace removed from retrieved value. Do note the opening for loop is not required for below list/dict comprehension solution.

    import lxml.etree as ET 
    import urllib
    import pandas as pd
    
    response = urllib.request.urlopen('http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US')
    xml = response.read()
    
    root = ET.fromstring(xml)
    nmsp = {'doc': 'http://www.ebay.com/marketplace/rps/v1/feed'}
       
    df = pd.DataFrame([{item.tag.replace(f"{{{nmsp['doc']}}}", ''): item.text 
                               if item.text.strip() != "" else item.find("*").text
                       for item in lnk.findall("*") if item is not None} 
                       for lnk in root.findall('.//doc:item', nmsp)])
                           
    

    Output (running exact posted code above)

    df
    #         itemId                                              title  ... shippingCost                                dealUrl
    #0  372639986116  Samsung Galaxy BUDS SM-R170 (Bluetooth 5.0) He...  ...         0.00  https://www.ebay.com/deals/6052526231
    #1  153918933129  Lenovo ThinkPad X1 Carbon Gen7, 14" FHD IPS, i...  ...         0.00  https://www.ebay.com/deals/6052642213
    #2  283899231838  Ray Ban RB4278 628271 51 Black Matte Black Pla...  ...         0.00  https://www.ebay.com/deals/6051914268
    #3  283957227324                  Ghost of Tsushima - PlayStation 4  ...         0.00  https://www.ebay.com/deals/6052642134
    #4  202905303442  Samsung Galaxy S20+ Plus SM-G985F/DS 128GB 8GB...  ...         0.00  https://www.ebay.com/deals/6052752611
    #5  332946625819  DEWALT DCB6092 20V/60V MAX FLEXVOLT 9 Ah Li-Io...  ...         0.00  https://www.ebay.com/deals/6052523001
    #6  264175647395  Citizen Eco-Drive Men's Silver Dial Black Leat...  ...         0.00  https://www.ebay.com/deals/6051783829
    #7  303374676252  Champion Authentic Cotton 9-Inch Men's Shorts ...  ...         0.00  https://www.ebay.com/deals/6051880500
    #8  202940881433   Samsung QN65Q90TAFXZA 65" 4K QLED Smart UHD T...  ...         0.00  https://www.ebay.com/deals/6052527037
    #9  400789484589  Light Blue by Dolce & Gabbana D&G Perfume Wome...  ...         0.00  https://www.ebay.com/deals/6052122816
    
    Login or Signup to reply.
  2. Another method.

    import pandas as pd
    from simplified_scrapy import SimplifiedDoc, utils, req
    
    getdeals = ['http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200',
                'http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200&offset=200',
                'http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200&offset=400']
        
    posts=[]
    header = ['title','endsAt','image255','price','originalPrice','discountPercentage','shippingCost','dealUrl']
    for url in getdeals:
        try: # It's a good habit to have try and exception in your code.
            feed = SimplifiedDoc(req.get(url))
            for deals in feed.selects('item'):
                row = []
                for h in header: row.append(deals.select(h+">text()")) # Returns None when the element does not exist
                posts.append(row)
        except Exception as e:
            print (e)
            
    df=pd.DataFrame(posts,columns=header)
    df.tail()
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search