skip to Main Content

Here is my code:

BLOCK 1

import requests
import pandas as pd

url = ('http://www.omdbapi.com/' '?apikey=ff21610b&t=social+network')
r = requests.get(url)
json_data = r.json()
# from app
print(json_data['Awards'])
json_dict = dict(json_data)
tab=""
# printing all data as Dictionary
print("JSON as Dictionary (all):n")
for k,v in json_dict.items():
  if len(k) > 6:
    tab = "t"
  else:
    tab = "tt"
  print(str(k) + ":" + tab + str(v))
df = pd.DataFrame(json_dict)
df.drop_duplicates(inplace=True)
# printing Pandas DataFrame of all data
print("JSON as DataFrame (all):n{}".format(df))

I was just testing out an example question on DataCamp. Then I went off exploring different things. The question stops at print(json_data['Awards']). I went further and was testing converting the JSON file to a dictionary and creating a pandas DataFrame of it. Interestingly, my output is as follows:

Won 3 Oscars. Another 165 wins & 168 nominations.
JSON as Dictionary (all):

Title:      The Social Network
Year:       2010
Rated:      PG-13
Released:   01 Oct 2010
Runtime:    120 min
Genre:      Biography, Drama
Director:   David Fincher
Writer:     Aaron Sorkin (screenplay), Ben Mezrich (book)
Actors:     Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
Plot:       Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
Language:   English, French
Country:    USA
Awards:     Won 3 Oscars. Another 165 wins & 168 nominations.
Poster:     https://m.media-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg
Ratings:    [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating: 7.7
imdbVotes:  542,658
imdbID:     tt1285016
Type:       movie
DVD:        11 Jan 2011
BoxOffice:  $96,400,000
Production: Columbia Pictures
Website:    http://www.thesocialnetwork-movie.com/
Response:   True
Traceback (most recent call last):
  File "C:UsersrschostaOneDrive - Incitec Pivot LimitedDocumentsData Scienceomdb-api-test.py", line 20, in <module>
    df.drop_duplicates(inplace=True)
  File "C:UsersrschostaAppDataLocalContinuumanaconda3libsite-packagespandascoreframe.py", line 3535, in drop_duplicates
    duplicated = self.duplicated(subset, keep=keep)
  File "C:UsersrschostaAppDataLocalContinuumanaconda3libsite-packagespandascoreframe.py", line 3582, in duplicated
    labels, shape = map(list, zip(*map(f, vals)))
  File "C:UsersrschostaAppDataLocalContinuumanaconda3libsite-packagespandascoreframe.py", line 3570, in f
    vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
  File "C:UsersrschostaAppDataLocalContinuumanaconda3libsite-packagespandascorealgorithms.py", line 471, in factorize
    labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1367, in pandas._libs.hashtable.PyObjectHashTable.get_labels
TypeError: unhashable type: 'dict'

I was doing some research on .drop_duplicates(), as I’ve used this before and it works just fine. Here is an example code of it working just fine:

BLOCK 2

import pandas as pd
import numpy as np

#Create a DataFrame
d = {
    'Name':['Alisa','Bobby','jodha','jack','raghu','Cathrine',
            'Alisa','Bobby','kumar','Alisa','Alex','Cathrine'],
    'Age':[26,24,23,22,23,24,26,24,22,23,24,24],

    'Score':[85,63,55,74,31,77,85,63,42,62,89,77]}

df = pd.DataFrame(d,columns=['Name','Age','Score'])
print(df)
df.drop_duplicates(keep=False, inplace=True)
print(df)

Notice the two blocks of code have some differences. I imported numpy as np on my first script and it didn’t change the results.

Any ideas on how to make the drop_duplicates() method work on BLOCK 1?

OUTPUT BLOCK 1 – A

Per request of @Wen, here is the data as a dictionary:

{'Title': 'The Social Network', 'Year': '2010', 'Rated': 'PG-13', 'Released': '01 Oct 2010', 'Runtime': '120 min', 'Genre': 'Biography, Drama', 'Director': 'David Fincher', 'Writer': 'Aaron Sorkin (screenplay), Ben Mezrich (book)', 'Actors': 'Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons', 'Plot': 'Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.', 'Language': 'English, French', 'Country': 'USA', 'Awards': 'Won 3 Oscars. Another 165 wins & 168 nominations.', 'Poster': 'https://m.media-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg', 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}], 'Metascore': '95', 'imdbRating': '7.7', 'imdbVotes': '542,658', 'imdbID': 'tt1285016', 'Type': 'movie', 'DVD': '11 Jan 2011', 'BoxOffice': '$96,400,000', 'Production': 'Columbia Pictures', 'Website': 'http://www.thesocialnetwork-movie.com/', 'Response': 'True'}

Now that I am not calling the .drop_duplicates() method while I am working on converting the Ratings dictionaries into columns before I remove duplicates, I also have more output in the tabular list I printed of the dictionary for which is a bit easier to read:

Title:      The Social Network
Year:       2010
Rated:      PG-13
Released:   01 Oct 2010
Runtime:    120 min
Genre:      Biography, Drama
Director:   David Fincher
Writer:     Aaron Sorkin (screenplay), Ben Mezrich (book)
Actors:     Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
Plot:       Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
Language:   English, French
Country:    USA
Awards:     Won 3 Oscars. Another 165 wins & 168 nominations.
Poster:     https://m.media-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg
Ratings:    [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating: 7.7
imdbVotes:  542,658
imdbID:     tt1285016
Type:       movie
DVD:        11 Jan 2011
BoxOffice:  $96,400,000
Production: Columbia Pictures
Website:    http://www.thesocialnetwork-movie.com/
Response:   True

2

Answers


  1. You have a Ratings column which is filled with dictionaries. So you can’t use drop_duplicates because dicts are mutable and not hashable.

    As a solution, you can transform these values to be a frozenset of the tuples, and then use drop_duplicates.

    df['Ratings'] = df.Ratings.transform(lambda k: frozenset(k.items()))
    df.drop_duplicates()
    

    Or choose only the columns you want to use as a reference. For example, if you want to remove duplicates based only on year and title, you can do something like

    ref_cols = ['Title', 'Year']
    df.loc[~df[ref_cols].duplicated()]
    
    Login or Signup to reply.
  2. Object usually create those problem , one way is to convert the dict or list to str

    df['Ratings1'] = df.Ratings.astype(str)
    df=df.drop_duplicates(df.columns.difference(['Ratings'])).drop('Ratings1')
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search