We are in the middle of refactoring a custom-built middleware/API from Python 2 to 3. The software is responsible for creating and updating categories, products, listings, orders, stock, etc. on many different platforms (like ebay, amazon, shopware) by data exported from our ERP software. The software runs on a virtualised Ubuntu server and has its own MariaDB database, where it runs daily. We are reaching the scripts responsible of processing the ERP softwares data exports and are now debating about the ‘best’ approach on how to deal with certain associative tables of ours.
An example:
There are two tables for dealing with product-to-image relationships, ‘global_images’ (~140k rows) and ‘article_image_mapping’ (~250k rows). ‘global_images’ contains the images filename as well as the sort order and is referenced via its ID in ‘article_image_mapping’ along with the correspoding product ID.
Whenever a product gets a new image for example, we have to make sure to keep all entries up to date. The old ‘article_image_mapping’ row needs to be deleted, the new one needs to be referenced and we also have to update the sort order of all entries relating to the product ID.
This procedure of course is no big deal, similar to Updating an associative table in MySQL, but we were thinking:
What keeps us from simply truncating both tables and recreating them every day? This would keep our code cleaner and simpler and as long as the product ID stays the same, other references can change however they like. Also, we wouldn’t bloat the AI-index with hundreds of thousands ON DUPLICATE KEY UPDATE queries, even if this is probably negligible.
Somehow this doesn’t feel as elegant though. We also have to make sure our exception handling is up to par, since further scripts cannot run without the image-product mapping.
3
Answers
"Best practice" is very much an opinion-based question – but I can outline some of the trade-offs.
For maintainability, you probably want the technical design to reflect the lifecycle of the business entities as closely as possible. From this point of view, "truncate the tables and repopulate from scratch" is probably not what a new developer would expect – in the business domain, the photos don’t disappear every night and reappear the next morning.
For performance reasons, repopulating all 250K records is probably also not great – especially if the data set grows over time.
For bug resistance, refreshing the data every night might avoid bugs, because the images on the hard drive are effectively a foreign key relationship to an entity outside the database, and therefore not easily verified using standard relational logic.
On the other hand, that bug resistance may be problematic if you have to write code dedicated to figuring out whether the image population logic has completed before running other parts of the script.
If your focus is on the data being correct, the rebuilding the entire tables each day is a very reasonable solution. I’m not sure what your exact process is, but if it easily fits in your time constraints for rebuilding and your resources, then you know the data is what you want.
The main advantage is that the data is simple. With an
update
approach, you have to deal withinsert
/update
/delete
logic. And when handling edge cases, it might not be clear exactly what you need to be doing.The main disadvantage is that you might end up rewriting history. If changes are made in the source data that affect history, then things could get confusing. This can be an issue with reporting.
I often design systems that are rebuilt every day . . . but there is a caveat. They are on cloud servers where storage is essentially free and we can archive old copies to see "what really happened" in the past.
If what you are loading is a complete copy of all the information, including relationships, then do this:
Load all the new data into new tables.
Do this in a single statement (virtually no downtime):
DROP TABLE t1_old, t2_old, …;
No fuss, no muss, virtually no downtime.
(If your daily dump is incremental, then we need more details.)
FOREIGN KEYs
may trip up any schema; do you have such? If so, you probably need to disable them across any action.