skip to Main Content

I’m wanting to construct a dataframe by taking data from each page of an api (100 rows per page limit). Currently the code below returns all the data but it is structured wrong.

There are 17 headers, therefore I require the data in 17 columns. However, it outputs a dataframe of [100 rows x 1700 columns], where I need [10000 rows x 17 columns].

I’m unsure of how I can go about achieving this – any help would be greatly appreciated.

from ebaysdk.finding import Connection as finding
from bs4 import BeautifulSoup
import pandas as pd

x = []

for i in range(1,101):
    print(type(i))
    api = finding(siteid='EBAY-GB',appid='some_id',config_file=None)

    response = api.execute('findItemsByKeywords', {'keywords': 'phone', 'outputSelector' : 'SellerInfo',
    'paginationInput': {'entriesPerPage': '2','pageNumber': ' '+str(i)}})    

    soup = BeautifulSoup(response.content, 'lxml')

    items = soup.find_all('item')

    headers = ['itemid','title','categoryname','categoryid','postalcode','location','sellerusername','feedbackscore','positivefeedbackpercent','topratedseller','shippingservicecost','buyitnowavailable','currentprice','starttime','endtime','watchcount','conditionid']

    for object in headers:
        values = [element.text for element in soup.find_all(object)]
        x.append(values)
        df = pd.DataFrame(x)
        df = df.T
    print(x)
#[['152668959069', '252999725410'], ['Samsung GALAXY Ace GT-S5830i (Unlocked) Smartphone Android Phone- ALL COLOURS UK', '8GB 3G Unlocked Android 5.1 Quad Core Smartphone Mobile Phone 2 SIM GPS qHD'], ['Mobile & Smart Phones', 'Mobile & Smart Phones'], ['9355', '9355'], ['RM137PP'], ['Rainham,United Kingdom', 'United Kingdom'], ['deals4u_shop', 'smartlife2017'], ['15700', '456'], ['99.9', '98.5'], ['true', 'true'], ['0.0', '0.0'], ['false', 'false'], ['32.49', '48.9'], ['2017-08-18T18:36:28.000Z', '2017-06-19T09:04:40.000Z'], ['2017-12-16T18:36:28.000Z', '2017-12-16T09:04:40.000Z'], ['272', '134'], ['1000', '1000']]

    print(df)
             0                                                  1   
0  152668959069  Samsung GALAXY Ace GT-S5830i (Unlocked) Smartp...   
1  252999725410  8GB 3G Unlocked Android 5.1 Quad Core Smartpho...   

                      2     3        4                       5   
0  Mobile & Smart Phones  9355  RM137PP  Rainham,United Kingdom   
1  Mobile & Smart Phones  9355     None          United Kingdom   

              6      7     8     9   ...    24    25    26   27     28    29  
0   deals4u_shop  15700  99.9  true  ...   456  98.5  true  0.0  false  48.9   

1  smartlife2017    456  98.5  true  ...   456  98.5  true  0.0  false  48.9   

                         30                        31   32    33  
0  2017-06-19T09:04:40.000Z  2017-12-16T09:04:40.000Z  214  1000  
1  2017-06-19T09:04:40.000Z  2017-12-16T09:04:40.000Z  182  1000  

edit: added more code and printed x for the first 2 entries from the first page and df for first 2 entries from 2 pages.

3

Answers


  1. Chosen as BEST ANSWER
    from ebaysdk.finding import Connection as finding
    from bs4 import BeautifulSoup
    import pandas as pd
    
    def flatten(lst):
       for x in lst:
          if isinstance(x, list):
             for y in flatten(x):
                yield y           
          else:
                yield x
    
    full_dict = {}
    result = {}
    
    for i in range(1,101):
    print(i)
    
        api = finding(siteid='EBAY-GB',appid='some key',config_file=None)
        response = api.execute('findItemsByKeywords', {'keywords': 'phone', 'outputSelector' : 'SellerInfo',
    'paginationInput': {'entriesPerPage': '100','pageNumber': ' '+str(i)}})    
    
        soup = BeautifulSoup(response.content, 'lxml')
    
        items = soup.find_all('item')
    
        headers_tuple = ('itemid','title','categoryname','categoryid','postalcode','location','sellerusername','feedbackscore','positivefeedbackpercent','topratedseller','shippingservicecost','buyitnowavailable','currentprice','starttime','endtime','watchcount','conditionid')
    
        data_dict = {}
    
        for obj in headers_tuple:
            x = [element.text for element in soup.find_all(obj)]
            data_dict[obj] = x
        for key in (data_dict.keys() | full_dict.keys()):
            if key in data_dict: result.setdefault(key, []).append(data_dict[key])
            if key in full_dict: result.setdefault(key, []).append(full_dict[key])
    
    final_dict = {k: list(flatten(v)) for k, v in result.items()}
    df = pd.DataFrame.from_dict(final_dict, orient='index')
    df = df.T
    

    This is the answer I came to if anyone's interested. It works but the column order gets changed for some reason and I'm not sure why. Thanks for all your help!


  2. This should work better.

    Dictionary comprehension version:

    data_dict = {obj: [element.text for element in soup.find_all(obj)] for obj in headers}    
    df = pd.DataFrame(data_dict)
    

    Loop version:

    data_dict = {}
    for obj in headers:
        data_dict[obj] = [element.text for element in soup.find_all(obj)]
    
    df = pd.DataFrame(data_dict)
    
    Login or Signup to reply.
  3. Consider iteratively appending to a list of dataframes with final concatenation:

    ...
    df_list = []
    api = finding(siteid='EBAY-GB',appid='some_id',config_file=None)
    
    for i in range(1,101):
        print(i)
        response = api.execute('findItemsByKeywords', 
                               {'keywords': 'phone',
                                'outputSelector' : 'SellerInfo',
                                'paginationInput': {'entriesPerPage': '2',
                                                    'pageNumber': ' '+str(i)}})    
    
        soup = BeautifulSoup(response.content, 'lxml')
    
        headers = ['itemid','title','categoryname','categoryid','postalcode','location',
                   'sellerusername','feedbackscore','positivefeedbackpercent','topratedseller',
                   'shippingservicecost','buyitnowavailable','currentprice','starttime',
                   'endtime','watchcount','conditionid']
    
        # LIST COMPREHENSION PARSING ELEMENTS OF API RESPONSE
        values = [element.text for element in soup.find_all(obj) for obj in headers]
    
        # DICT COMPREHENSION WITH ZIP TO DF THAT NAMES EACH COLUMN WITH VALUE & FILLS MISSING
        tmp = pd.DataFrame({h:v if len(v) > 1 else v+[None] for h,v in zip(headers, values)})
    
        # APPENDS TO LIST
        df_list.append(tmp)
    
    # ROW BINDS TO FINAL DF
    final_df = pd.concat(df_list, ignore_index=True)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search