I’m adding data to a MySQL 8
database on PythonAnywhere. I’m doing this via a python 3.10
script run in VSCode
using an SSH Tunnel
. When I add the data it takes approximately 50 seconds to add 275 rows, data in json format is 750kb so total likely a bit less than that. No error messages.
When I use the same script to add the identical data to a database on my local machine it takes 0.2 seconds. Playing with batch sized between commits reduced the total time from 75 seconds to 50 seconds, but that’s the best I’ve got. It’s for a web scraper, when the full thing is running it could have a few hundred to a few thousand rows to add/remove, so it’s not tenable to leave it like this. With performance this poor I must be doing something wrong, but I can’t see what it is. The timer is wrapped around the insert query, it doesn’t include the ssh connection time etc.
What’s going on here?
The table columns are as follows, so far now indexes being used:
"listingID int PRIMARY KEY AUTO_INCREMENT, types VARCHAR(11), town VARCHAR(255), postcode CHAR(5), price INT UNSIGNED, agent VARCHAR(50), ref VARCHAR(30), bedrooms SMALLINT UNSIGNED, rooms SMALLINT UNSIGNED, plot MEDIUMINT UNSIGNED, size MEDIUMINT UNSIGNED, link_url VARCHAR(1024), description VARCHAR(14000), photos TEXT, photos_hosted TEXT, gps POINT, id VARCHAR(80), types_original VARCHAR(30)"
The functions adding the data are below. As I said it completes in 0.2 seconds locally so I don’t think I’m doing anything too bad there, even if it’s not perfect. listings
is a list of dictionaries. The weird use of dictionaries for the inserts is because I was using named placeholders with mysql.connector, but I had to switch to MySQLdb due to another problem with accessing the database, and it doesn’t permit named placeholders.
def insert_data_to_table(cursor, table_name, columns_list, values_dict, gps_string):
# Creates a string of csv %s placeholders for the values to be inserted
placeholders = ", ".join(f"%s" for _ in values_dict.keys())
# Creates a string of columns to be inserted
columns = ", ".join(columns_list)
# If GPS data is provided, insert specific string into query. Else build query without GPS sub string.
if gps_string:
insert_query = f"INSERT INTO {table_name} (gps, {columns}) VALUES ({gps_string}, {placeholders})"
else:
insert_query = f"INSERT INTO {table_name} ({columns}) VALUES ({placeholders})"
cursor.execute(insert_query, tuple(values_dict.values()))
def add_listings(cursor, listings):
print("Adding data...")
counter = 0
for listing in listings:
# Committing every 50 rows rather than after each row reduces time by approx 33%, minimal differences
between 20, 50, 100 etc.
if counter > 50:
print("Commit", counter)
db.commit()
counter = 0
columns_list = []
values_dict = {key: None for key in listing if key != "gps"}
if listing.get("gps") is None:
gps_string = None
elif isinstance(listing.get("gps"), list):
gps_string = f"ST_GeomFromText('POINT({round(listing['gps'][0], 6)} {round(listing['gps'][1], 6)})', 4326)"
for key in values_dict:
if isinstance(listing.get(key), list):
values_dict[key] = ":;:".join([str(x) for x in listing[key]])
else:
values_dict[key] = listing.get(key)
columns_list.append(key)
insert_data_to_table(cursor, table_name, columns_list, values_dict, gps_string)
counter += 1
db.commit()
2
Answers
You can expand your solution. Upload the data file to the target server and run the migration from crontab if it is possible. Perhaps it will be faster.
It could be because you are hosting the server with PythonAnywhere? If you are using a free beginner plan, it would probably have insufficient CPU and bandwidth for your task. It would be too slow.
My guess is the bandwidth for the server on pythonanywhere is not enough. If you are finding a free solution for hosting, you can try other companies that provide free hosting (free tier)
I personally use AWS, it has a free tier for 12 months. There’s probably more out there and you can google more about it.
Hope this helps you!