skip to Main Content

Heads up: I’m a non-background who just learns bits from everywhere.

In a personal project, I’m building a star schema for DWH on my local machine. In a fact table, I collected fields from 8 other tables (3 from OLTP and 5 for Dim tables). The main table used to ingest in the fact one has about 120K rows. Very minimal change on the logic because I want to store everything at granular level.

After a few days struggling with scripts to loading in batch (1000rows per ingest) by Python script, I finally got the hang of it. And now it comes to the next issue…it takes incredibly long to ingest 100k records with 8 tables joins. Can you share your experience with the similar situation? what I should do to ingest data faster? This kind of speed is unacceptable.

What I did? – Indexing for OLTP tables – though never tested its effectiveness . For other techniques such as filtering with conditions, partitioning, etc… I don’t see any of those applicable to my case, but I can be wrong. My suspect is something lying in the hardwares’ defaults that I have a very little knowledge of.

I hope I provide enough information and everything is clear. If not, please ask to clarify.


FactOrderItem = """
                INSERT INTO olist_olap.FactOrderItem( OrderItemID,
                                                        OrderID,
                                                        ProdKey,
                                                        SellerKey,
                                                        CustomerKey,
                                                        price,
                                                        FreightValue,
                                                        OrderStatusKey,
                                                        DateKey,
                                                        PayMethodKey,
                                                        PaymentValue,
                                                        PurchaseTimestamp,
                                                        ApprovedAt,
                                                        ShippingLimitDate,
                                                        DeliveredCarrierDate,
                                                        DeliveredCustomerDate,
                                                        EstimatedDeliverDate,
                                                        Vversion)
                SELECT 
                        CONCAT(oi.order_id, '-', oi.order_item_id) AS OrderItemID,
                        oi.order_id,
                        prod.ProdKey,
                        sell.SellerKey,
                        cust.CustomerKey,
                        oi.price,
                        oi.freight_value,
                        stat.OrderStatusKey,
                        IF(CHAR_LENGTH(order_purchase_timestamp)>5, DATE_FORMAT(order_purchase_timestamp, '%Y%m%d'), NULL),
                        pay.PayMethodKey,
                        paym.payment_value,
                        IF(CHAR_LENGTH(ord.order_purchase_timestamp) > 5, CAST(ord.order_purchase_timestamp AS DATETIME), NULL),
                        IF(CHAR_LENGTH(ord.order_approved_at) > 5, CAST(ord.order_approved_at AS DATETIME), NULL),
                        IF(CHAR_LENGTH(oi.shipping_limit_date) > 5, CAST(oi.shipping_limit_date AS DATETIME), NULL),
                        IF(CHAR_LENGTH(ord.order_delivered_carrier_date) > 5, CAST(ord.order_delivered_carrier_date AS DATETIME), NULL),
                        IF(CHAR_LENGTH(ord.order_delivered_customer_date) > 5, CAST(ord.order_delivered_customer_date AS DATETIME), NULL),
                        IF(CHAR_LENGTH(ord.order_estimated_delivery_date) > 5, CAST(ord.order_estimated_delivery_date AS DATETIME), NULL),
                        1

                FROM olist_oltp.olist_order_items_dataset oi
                LEFT JOIN olist_oltp.olist_orders_dataset ord
                ON oi.order_id = ord.order_id
                LEFT JOIN olist_oltp.olist_order_payments_dataset paym
                ON ord.order_id = paym.order_id
                LEFT JOIN olist_olap.DimCustomer cust
                ON ord.customer_id = cust.CustomerID
                LEFT JOIN olist_olap.DimOrderStatus stat
                ON ord.order_status = stat.OrderStatus
                LEFT JOIN olist_olap.DimProduct prod
                ON oi.product_id = prod.ProdID
                LEFT JOIN olist_olap.DimSeller sell
                ON oi.seller_id = sell.SellerID
                LEFT JOIN olist_olap.DimPayMethod pay
                ON paym.payment_type = pay.PaymentType
                LIMIT %s OFFSET %s;
"""
FactOrderItem_script = """SELECT COUNT(*) FROM olist_oltp.olist_order_items_dataset;"""


and this is the function I used to ingest data in batches

dict_fact_tables = {
    FactOrderItem_script: FactOrderItem,
    FactOrderReview_script: FactOrderReview
}

def populate_olap_tables(dic, cursor, batch_size, offset):
    """
    Function to populate the OLAP tables
    """
    for key in dic:
        cursor.execute(key)
        count = cursor.fetchone()[0]
        print(count, "records found")

    # Loop through the entire table with batch size predefined
    # and offset updated
        while count > 0:
            if count < batch_size:
                batch_size = count
            try:
                cursor.execute(dic[key], (batch_size, offset))
                offset += batch_size
                count -= batch_size
                print(count, "records remaining")
                connection.commit()
            except Exception as error:
                print(f"The connection error details are: {error}")      

    print("The tables have been successfully populated.")


# Establish a connection
try:
    connection = mysql.connector.connect(
        host=host,
        user=user,
        password=password
    )
    if connection.is_connected():
        print("Connected to the database")
    # Load the script
        batch_size = 1000
        offset = 0
        cursor = connection.cursor()
        # Count the number of records required to load in OLTP

        populate_olap_tables(dict_fact_tables, cursor, batch_size, offset)

finally:
    if connection.is_connected():
        connection.close()
        print("The connection is closed")

I would really appreciate any advice, sources to refer to, so I can dig deeper on this issue. I’m very curious on how large businesses manage their DWH structure given that it experiences a large amount of reads and writes concurrently.

2

Answers


  1. Improper use of joins in SQL queries can significantly impact the performance of a database system, leading to slow query execution times, high CPU and memory usage, and overall inefficiency in data retrieval. Here’s how improper joins can cause performance problems:

    1. Full Table Scans

    • Issue: If a join condition is not properly indexed or if the join logic is complex, the database might resort to a full table scan to find matching rows. This means every row in the table is examined to determine if it meets the join condition.
    • Impact: Full table scans are resource-intensive, especially on large tables, leading to slow query performance and increased load on the database server.

    2. Cartesian Products

    • Issue: A Cartesian product occurs when a join condition is omitted or incorrectly specified. This results in every row from the first table being joined to every row from the second table.
    • Impact: The result set can become extremely large, exponentially increasing with the size of the involved tables, which can overwhelm the system resources and lead to severe performance degradation or even system crashes.

    3. Inefficient Use of Indexes

    • Issue: Indexes are designed to speed up data retrieval. However, if the join conditions do not match the indexed columns or if the query optimizer cannot utilize the indexes (due to complex expressions or functions applied to the join columns), the performance benefits of indexing are lost.
    • Impact: This leads to slower searches and data retrieval, as the database cannot efficiently locate the relevant rows in the joined tables.

    4. Nested Loops for Large Datasets

    • Issue: Nested loop joins, where the database engine iterates over each row of one table to find matching rows in another, can be efficient for small datasets. However, when applied to large datasets without proper indexing, they become highly inefficient.
    • Impact: The computational cost increases significantly with the size of the tables, leading to long execution times and high CPU usage.

    5. Improper Join Order

    • Issue: The order in which tables are joined can affect performance. If a large table is joined before filtering down a dataset through a smaller table or a where clause, it can result in unnecessary processing of a large number of rows.
    • Impact: This can cause increased memory usage and slower performance as the database works through more data than necessary.

    6. Use of Non-Sargable Conditions

    • Issue: Sargable conditions in joins (and where clauses) allow the database to take advantage of indexes. Non-sargable conditions (e.g., applying functions to columns before joining) prevent the use of indexes.
    • Impact: This forces the database to perform more labor-intensive searches rather than quick index lookups, slowing down query execution.

    Mitigation Strategies

    • Proper Indexing: Ensure that columns used in join conditions are indexed appropriately.
    • Optimize Query Logic: Write queries to avoid Cartesian products and ensure that join conditions are clear and correct.
    • Analyze and Optimize: Use query execution plans to understand how the database is executing joins and optimize based on recommendations.
    • Data Modeling: Sometimes, denormalizing your database or using materialized views can reduce the need for complex joins.

    Understanding and addressing these issues can lead to significant improvements in database performance, making your applications faster and more responsive.

    Login or Signup to reply.
  2. You have two big problems ATM:
    your query is slow, and you don’t know which part(s) are slow.
    We’ll get to those.
    But first, let’s tackle the trivial.

    DATE vs TEXT

    You have a bunch of type CASTs like this:

    IF(CHAR_LENGTH(ord.order_purchase_timestamp) > 5, CAST(ord.order_purchase_timestamp AS DATETIME), NULL),
    

    They’re just an annoying distraction.
    Deal with them outside of the current ETL situation.

    Consider fixing up the olist_orders_dataset table.
    There are many options.
    Use ALTER COLUMN, or ALTER TABLE … ADD COLUMN
    followed by dropping the old text column.
    Or create a new table, copy rows, do a DROP and a RENAME of tables.

    Or don’t fix it up, just sweep the problem under the rug
    with CREATE VIEW olist_orders_dataset_v AS SELECT order_approved_at, ...,
    and bury the date expressions in there.
    The conventional _v suffix denotes "view".

    query plans

    Use EXPLAIN SELECT … to learn what access path was chosen
    by the backend planner.

    Pay attention to (estimated) number of rows fetched, and
    to the cost in seconds.
    As you change the query, and/or do DROP / CREATE INDEX,
    pay attention to how the plan changes.

    It’s not the easiest thing in the world to interpret a query plan.
    So start small.
    Use simple SELECTs, then build them up into JOINs,
    and see how the planner responds.

    debugging

                    LIMIT %s OFFSET %s;
    

    That can be a great clause when serving web pages,
    on the theory that a web user likely won’t scroll down
    to the millionth result row, so don’t bother retrieving it.

    But client-side loops tend to be terrible for good planning.
    The client knows it will eventually retrieve all rows,
    but it hides that knowledge from the backend planner.
    If the planner knows it will fetch all rows from a table,
    then a tablescan is typically the best access path,
    but it can’t make that decision if it can’t predict
    future client requests.

    One good thing about that LIMIT parameter is it lets
    you go through a rapid edit-run-debug cycle as you change
    details of the SQL.
    Reap that same benefit by testing with small tables.
    For example you
    could CREATE TABLE olist_orders_dataset_small AS SELECT … ,
    or you could just truncate the existing olist_orders_dataset
    and then fully re-populate after you’re happy with debugging.
    Again, the goal is better plans.


    Ok, on to the meat of the problem.

    The current app-level throughput of about one row per second
    is clearly just terrible.
    There must be some hidden repeated tablescans lurking within.

    simplify

                LEFT JOIN olist_oltp.olist_order_payments_dataset paym ...
                LEFT JOIN olist_olap.DimCustomer cust ...
                LEFT JOIN olist_olap.DimOrderStatus stat ...
                LEFT JOIN olist_olap.DimProduct prod ...
                LEFT JOIN olist_olap.DimSeller sell ...
                LEFT JOIN olist_olap.DimPayMethod pay ...
    

    Your query runs slowly.
    It involves several JOINs.
    You don’t know which JOIN(s) are slow.

    Delete them all.

    Whittle down the query to the bare minimum,
    and see it run quickly.
    An INSERT … SELECT … should be able to
    move more than ten thousand rows per second.
    When you start asking it to do fancy things, it will slow down.
    For now we will just insert default NULL column variables where an
    FK value belongs.

    Read and understand this performant EXPLAIN SELECT query plan.

    Now start adding those JOINs back, one by one.
    Or, roughly equivalently, populate columns one by one
    with UPDATE … JOIN.
    Pay attention to the timings, and (for slow operations) to the plans.
    Maybe you’re missing a
    PK index?
    Every table should have a sensible PK.
    Every FK relationship should be supported by foreign table’s index,
    which almost always will be foreign table’s PK.

    put it all together

    By this point you have learned something about how to work
    with (not against) the planner, to rapidly access your favorite rows.
    Query times are now predictable, and are faster than before.
    You may have changed how you index tables.
    Even with multiple JOINs, you should be able to
    query more than a thousand rows per second.
    Your git repo has seen more than a few edits of SQL commands.

    In a new schema, issue some DROP TABLEs to tear it all down,
    then run your DDL which defines tables + indexes,
    then run a single query to populate FactOrderItem.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search