skip to Main Content

I have a dataframe with a list of urls for which I want to extract a couple of values. The returned key/values should then be added to the original dataframe with the keys as new columns and the respective values.

I thought that this would magically happen with
result_type='expand' which it obviously doesn’t. When I try

df5["data"] = df5.apply(lambda x: request_function(x['url']),axis=1, result_type='expand')

I end up with my results all in one data column:

[{'title': ['Python Notebooks: Connect to Google Search Console API and Extract Data - Adapt'], 'description': []}]

The result I am aiming for is a Dataframe with the following 3 columns:

| URL|      Title      |  Description|

Here is my code:

import requests
from requests_html import HTMLSession
import pandas as pd
from urllib import parse

ex_dic = {'url': ['https://www.searchenginejournal.com/reorganizing-xml-sitemaps-python/295539/', 'https://searchengineland.com/check-urls-indexed-google-using-python-259773', 'https://adaptpartners.com/technical-seo/python-notebooks-connect-to-google-search-console-api-and-extract-data/']}

df5 = pd.DataFrame(ex_dic)
df5

def request_function(url):
    try:
        found_results = []
        r = session.get(url)
        title = r.html.xpath('//title/text()')
        description = r.html.xpath("//meta[@name='description']/@content")
        found_results.append({ 'title': title, 'description': description})
        return found_results


    except requests.RequestException:
        print("Connectivity error")      
    except (KeyError):
        print("anoter error")

df5.apply(lambda x: request_function(x['url']),axis=1, result_type='expand')

2

Answers


  1. ex_dic should be list of dict, so that you can update the applied attribute.

    import requests
    from requests_html import HTMLSession
    import pandas as pd
    from urllib import parse
    
    ex_dic = {'url': ['https://www.searchenginejournal.com/reorganizing-xml-sitemaps-python/295539/', 'https://searchengineland.com/check-urls-indexed-google-using-python-259773', 'https://adaptpartners.com/technical-seo/python-notebooks-connect-to-google-search-console-api-and-extract-data/']}
    
    ex_dic['url'] = [{'url': item} for item in ex_dic['url']]
    
    df5 = pd.DataFrame(ex_dic)
    session = HTMLSession()
    
    def request_function(url):
        try:
            print(url)
            r = session.get(url['url'])
            title = r.html.xpath('//title/text()')
            description = r.html.xpath("//meta[@name='description']/@content")
            url.update({ 'title': title, 'description': description})
            return url
    
    
        except requests.RequestException:
            print("Connectivity error")      
        except (KeyError):
            print("anoter error")
    
    df6 = df5.apply(lambda x: request_function(x['url']),axis=1, result_type='expand')
    print df6
    
    Login or Signup to reply.
  2. It actually works as you expect, if your function would return just a dictionary, not a list of dictionaries. Further, inside of your keys just provide a string, not a list. Then it works as you expect. See my example code:

    import requests
    import pandas as pd
    from urllib import parse
    
    ex_dic = {'url': ['https://www.searchenginejournal.com/reorganizing-xml-sitemaps-python/295539/', 'https://searchengineland.com/check-urls-indexed-google-using-python-259773', 'https://adaptpartners.com/technical-seo/python-notebooks-connect-to-google-search-console-api-and-extract-data/']}
    
    df5 = pd.DataFrame(ex_dic)
    #rint(df5)
    
    def request_function(url):
        return {'title': 'Python Notebooks: Connect to Google Search Console API and Extract Data - Adapt', 
                'description': ''}
    
    
    df6 = df5.apply(lambda x: request_function(x['url']), axis=1, result_type='expand')
    df7 = pd.concat([df5,df6],1)
    
    
    df7
    

    Gives you this:

    dataframe screenshot

    You can also just adjust your lambda function:

    df6 = df5.apply(lambda x: request_function(x['url'])[0], axis=1, result_type='expand')
    

    But you still need to ensure that the key values are strings, not lists.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search