Heads up: I’m a non-background who just learns bits from everywhere.
In a personal project, I’m building a star schema for DWH on my local machine. In a fact table, I collected fields from 8 other tables (3 from OLTP and 5 for Dim tables). The main table used to ingest in the fact one has about 120K rows. Very minimal change on the logic because I want to store everything at granular level.
After a few days struggling with scripts to loading in batch (1000rows per ingest) by Python script, I finally got the hang of it. And now it comes to the next issue…it takes incredibly long to ingest 100k records with 8 tables joins. Can you share your experience with the similar situation? what I should do to ingest data faster? This kind of speed is unacceptable.
What I did? – Indexing for OLTP tables – though never tested its effectiveness . For other techniques such as filtering with conditions, partitioning, etc… I don’t see any of those applicable to my case, but I can be wrong. My suspect is something lying in the hardwares’ defaults that I have a very little knowledge of.
I hope I provide enough information and everything is clear. If not, please ask to clarify.
FactOrderItem = """
INSERT INTO olist_olap.FactOrderItem( OrderItemID,
OrderID,
ProdKey,
SellerKey,
CustomerKey,
price,
FreightValue,
OrderStatusKey,
DateKey,
PayMethodKey,
PaymentValue,
PurchaseTimestamp,
ApprovedAt,
ShippingLimitDate,
DeliveredCarrierDate,
DeliveredCustomerDate,
EstimatedDeliverDate,
Vversion)
SELECT
CONCAT(oi.order_id, '-', oi.order_item_id) AS OrderItemID,
oi.order_id,
prod.ProdKey,
sell.SellerKey,
cust.CustomerKey,
oi.price,
oi.freight_value,
stat.OrderStatusKey,
IF(CHAR_LENGTH(order_purchase_timestamp)>5, DATE_FORMAT(order_purchase_timestamp, '%Y%m%d'), NULL),
pay.PayMethodKey,
paym.payment_value,
IF(CHAR_LENGTH(ord.order_purchase_timestamp) > 5, CAST(ord.order_purchase_timestamp AS DATETIME), NULL),
IF(CHAR_LENGTH(ord.order_approved_at) > 5, CAST(ord.order_approved_at AS DATETIME), NULL),
IF(CHAR_LENGTH(oi.shipping_limit_date) > 5, CAST(oi.shipping_limit_date AS DATETIME), NULL),
IF(CHAR_LENGTH(ord.order_delivered_carrier_date) > 5, CAST(ord.order_delivered_carrier_date AS DATETIME), NULL),
IF(CHAR_LENGTH(ord.order_delivered_customer_date) > 5, CAST(ord.order_delivered_customer_date AS DATETIME), NULL),
IF(CHAR_LENGTH(ord.order_estimated_delivery_date) > 5, CAST(ord.order_estimated_delivery_date AS DATETIME), NULL),
1
FROM olist_oltp.olist_order_items_dataset oi
LEFT JOIN olist_oltp.olist_orders_dataset ord
ON oi.order_id = ord.order_id
LEFT JOIN olist_oltp.olist_order_payments_dataset paym
ON ord.order_id = paym.order_id
LEFT JOIN olist_olap.DimCustomer cust
ON ord.customer_id = cust.CustomerID
LEFT JOIN olist_olap.DimOrderStatus stat
ON ord.order_status = stat.OrderStatus
LEFT JOIN olist_olap.DimProduct prod
ON oi.product_id = prod.ProdID
LEFT JOIN olist_olap.DimSeller sell
ON oi.seller_id = sell.SellerID
LEFT JOIN olist_olap.DimPayMethod pay
ON paym.payment_type = pay.PaymentType
LIMIT %s OFFSET %s;
"""
FactOrderItem_script = """SELECT COUNT(*) FROM olist_oltp.olist_order_items_dataset;"""
and this is the function I used to ingest data in batches
dict_fact_tables = {
FactOrderItem_script: FactOrderItem,
FactOrderReview_script: FactOrderReview
}
def populate_olap_tables(dic, cursor, batch_size, offset):
"""
Function to populate the OLAP tables
"""
for key in dic:
cursor.execute(key)
count = cursor.fetchone()[0]
print(count, "records found")
# Loop through the entire table with batch size predefined
# and offset updated
while count > 0:
if count < batch_size:
batch_size = count
try:
cursor.execute(dic[key], (batch_size, offset))
offset += batch_size
count -= batch_size
print(count, "records remaining")
connection.commit()
except Exception as error:
print(f"The connection error details are: {error}")
print("The tables have been successfully populated.")
# Establish a connection
try:
connection = mysql.connector.connect(
host=host,
user=user,
password=password
)
if connection.is_connected():
print("Connected to the database")
# Load the script
batch_size = 1000
offset = 0
cursor = connection.cursor()
# Count the number of records required to load in OLTP
populate_olap_tables(dict_fact_tables, cursor, batch_size, offset)
finally:
if connection.is_connected():
connection.close()
print("The connection is closed")
I would really appreciate any advice, sources to refer to, so I can dig deeper on this issue. I’m very curious on how large businesses manage their DWH structure given that it experiences a large amount of reads and writes concurrently.
2
Answers
Improper use of joins in SQL queries can significantly impact the performance of a database system, leading to slow query execution times, high CPU and memory usage, and overall inefficiency in data retrieval. Here’s how improper joins can cause performance problems:
1. Full Table Scans
2. Cartesian Products
3. Inefficient Use of Indexes
4. Nested Loops for Large Datasets
5. Improper Join Order
6. Use of Non-Sargable Conditions
Mitigation Strategies
Understanding and addressing these issues can lead to significant improvements in database performance, making your applications faster and more responsive.
You have two big problems ATM:
your query is slow, and you don’t know which part(s) are slow.
We’ll get to those.
But first, let’s tackle the trivial.
DATE vs TEXT
You have a bunch of type CASTs like this:
They’re just an annoying distraction.
Deal with them outside of the current ETL situation.
Consider fixing up the
olist_orders_dataset
table.There are many options.
Use ALTER COLUMN, or ALTER TABLE … ADD COLUMN
followed by dropping the old text column.
Or create a new table, copy rows, do a DROP and a RENAME of tables.
Or don’t fix it up, just sweep the problem under the rug
with
CREATE VIEW olist_orders_dataset_v AS SELECT order_approved_at, ...
,and bury the date expressions in there.
The conventional
_v
suffix denotes "view".query plans
Use EXPLAIN SELECT … to learn what access path was chosen
by the backend planner.
Pay attention to (estimated) number of rows fetched, and
to the cost in seconds.
As you change the query, and/or do DROP / CREATE INDEX,
pay attention to how the plan changes.
It’s not the easiest thing in the world to interpret a query plan.
So start small.
Use simple SELECTs, then build them up into JOINs,
and see how the planner responds.
debugging
That can be a great clause when serving web pages,
on the theory that a web user likely won’t scroll down
to the millionth result row, so don’t bother retrieving it.
But client-side loops tend to be terrible for good planning.
The client knows it will eventually retrieve all rows,
but it hides that knowledge from the backend planner.
If the planner knows it will fetch all rows from a table,
then a tablescan is typically the best access path,
but it can’t make that decision if it can’t predict
future client requests.
One good thing about that LIMIT parameter is it lets
you go through a rapid edit-run-debug cycle as you change
details of the SQL.
Reap that same benefit by testing with small tables.
For example you
could CREATE TABLE olist_orders_dataset_small AS SELECT … ,
or you could just truncate the existing olist_orders_dataset
and then fully re-populate after you’re happy with debugging.
Again, the goal is better plans.
Ok, on to the meat of the problem.
The current app-level throughput of about one row per second
is clearly just terrible.
There must be some hidden repeated tablescans lurking within.
simplify
Your query runs slowly.
It involves several JOINs.
You don’t know which JOIN(s) are slow.
Delete them all.
Whittle down the query to the bare minimum,
and see it run quickly.
An INSERT … SELECT … should be able to
move more than ten thousand rows per second.
When you start asking it to do fancy things, it will slow down.
For now we will just insert default NULL column variables where an
FK value belongs.
Read and understand this performant EXPLAIN SELECT query plan.
Now start adding those JOINs back, one by one.
Or, roughly equivalently, populate columns one by one
with UPDATE … JOIN.
Pay attention to the timings, and (for slow operations) to the plans.
Maybe you’re missing a
PK index?
Every table should have a sensible PK.
Every FK relationship should be supported by foreign table’s index,
which almost always will be foreign table’s PK.
put it all together
By this point you have learned something about how to work
with (not against) the planner, to rapidly access your favorite rows.
Query times are now predictable, and are faster than before.
You may have changed how you index tables.
Even with multiple JOINs, you should be able to
query more than a thousand rows per second.
Your git repo has seen more than a few edits of SQL commands.
In a new schema, issue some DROP TABLEs to tear it all down,
then run your DDL which defines tables + indexes,
then run a single query to populate FactOrderItem.