skip to Main Content

I have the following Python code to check if a MariaDB record exists already, and then inserting. However, I am having duplicates being inserted. Is there something wrong with the code, or is there a better way to do it? I’m new to using Python-MariaDB.

import mysql.connector as mariadb
from hashlib import sha1

mariadb_connection = mariadb.connect(user='root', password='', database='tweets_db')

# The values below are retrieved from Twitter API using Tweepy
# For simplicity, I've provided some sample values
id = '1a23bas'
tweet = 'Clear skies'
longitude = -84.361549
latitude = 34.022003
created_at = '2017-09-27'
collected_at = '2017-09-27'
collection_type = 'stream'
lang = 'us-en'
place_name = 'Roswell'
country_code = 'USA'
cronjob_tag = 'None'
user_id = '23abask'
user_name = 'tsoukalos'
user_geoenabled = 0
user_lang = 'us-en'
user_location = 'Roswell'
user_timezone = 'American/Eastern'
user_verified = 1
tweet_hash = sha1(tweet).hexdigest()

cursor = mariadb_connection.cursor(buffered=True)
cursor.execute("SELECT Count(id) FROM tweets WHERE tweet_hash = %s", (tweet_hash,))
if cursor.fetchone()[0] == 0:
    cursor.execute("INSERT INTO tweets(id,tweet,tweet_hash,longitude,latitude,created_at,collected_at,collection_type,lang,place_name,country_code,cronjob_tag,user_id,user_name,user_geoenabled,user_lang,user_location,user_timezone,user_verified) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)", (id,tweet,tweet_hash,longitude,latitude,created_at,collected_at,collection_type,lang,place_name,country_code,cronjob_tag,user_id,user_name,user_geoenabled,user_lang,user_location,user_timezone,user_verified))
    mariadb_connection.commit()
    cursor.close()
else:
    cursor.close()
    return

Below is the code for the table.

CREATE TABLE tweets (
  id VARCHAR(255) NOT NULL,
  tweet VARCHAR(255) NOT NULL,
  tweet_hash VARCHAR(255) DEFAULT NULL,
  longitude FLOAT DEFAULT NULL,
  latitude FLOAT DEFAULT NULL,
  created_at DATETIME DEFAULT NULL,
  collected_at DATETIME DEFAULT NULL,
  collection_type enum('stream','search') DEFAULT NULL,
  lang VARCHAR(10) DEFAULT NULL,
  place_name VARCHAR(255) DEFAULT NULL,
  country_code VARCHAR(5) DEFAULT NULL,
  cronjob_tag VARCHAR(255) DEFAULT NULL,
  user_id VARCHAR(255) DEFAULT NULL,
  user_name VARCHAR(20) DEFAULT NULL,
  user_geoenabled TINYINT(1) DEFAULT NULL,
  user_lang VARCHAR(10) DEFAULT NULL,
  user_location VARCHAR(255) DEFAULT NULL,
  user_timezone VARCHAR(100) DEFAULT NULL,
  user_verified TINYINT(1) DEFAULT NULL
);

2

Answers


  1. Add unique constant to tweet_has filed.

    alter table tweets  modify tweet_hash varchar(255) UNIQUE ;
    
    Login or Signup to reply.
  2. Every table should have a PRIMARY KEY. Is id supposed to be that? (The CREATE TABLE is not saying so.) A PK is, by definition, UNIQUE, so that would cause an error on inserting a duplicate.

    Meanwhile:

    • Why have a tweet_hash? Simply index tweet.
    • Don’t say 255 when there are specific limits smaller than that.
    • user_id and user_name should be in another “lookup” table, not both in this table.
    • Does user_verified belong with the user? Or with each tweet?
    • If you are expecting millions of tweets, this table needs to be made smaller and indexed — else you will run into performance problems.
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search