skip to Main Content

I am using tableau for a dashboard that pulls data in from a MySQL database. It works perfectly but the query takes too long to execute. The data is limited to the previous 2 years and the most recent Saturday that has passed from today.

Background info:
I have sales data on a daily cadence databased in a single table per account(client). There is a field called distributor_view which can have one of the following 2 values: Manufacturing and Sourcing. An item will sometimes disappear from the manufacturing view, but will always be on the sourcing. The revenue figures between the 2 are not equal, but for data completeness- if any items are missing from the manufacturing view (preferred) then the revenue is pulled from the sourcing.

Note: If all data is queried most of it will be duplicated on an itemId level- only about 5% of items are missing from the manufacturing view and need to be referenced by the sourcing view.

The logic of the query below is such that it will pull all items that are in the manufacturing view, and UNION ALL items that exist in the sourcing view and are missing from the manufacturing view.

Query:

SELECT * FROM 
(
    SELECT 'Client1' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
    FROM table1 
    WHERE 
        distributor_view = 'Manufacturing' 
        AND 
        YEAR(date) >= YEAR(CURDATE()) - 2 
        AND 
        date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
    UNION ALL
    SELECT 'Client1' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
    FROM table1 
    WHERE 
        distributor_view = 'Sourcing' 
        AND 
        (date, item_id) NOT IN (
            SELECT date, item_id FROM table1 WHERE distributor_view = 'Manufacturing' AND YEAR(date) >= YEAR(CURDATE()) - 2  AND date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
        ) 
        AND 
        YEAR(date) >= YEAR(CURDATE()) - 2 
        AND 
        date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
    UNION ALL
    SELECT 'Client2' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
    FROM table2
    WHERE 
        distributor_view = 'Manufacturing' 
        AND 
        YEAR(date) >= YEAR(CURDATE()) - 2 
        AND 
        date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
    UNION ALL
    SELECT 'Client2' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
    FROM table2
    WHERE 
        distributor_view = 'Sourcing' 
        AND 
        (date, item_id) NOT IN (
            SELECT date, item_id FROM table2 WHERE distributor_view = 'Manufacturing' AND YEAR(date) >= YEAR(CURDATE()) - 2  AND date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
        ) 
        AND 
        YEAR(date) >= YEAR(CURDATE()) - 2 
        AND 
        date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
) AS sales 
ORDER BY date DESC

In an earlier iteration of this I used Python to create a script that would delete out items from sourcing that existed in manufacturing. The query would then be much simpler- a select with no regard to distributor_view. I don’t like this method because it requires a script to be ran each time the data changes(daily).

Another option I’m brainstorming is to just pull all data and attempt to reconcile the two within tableau using some calculation.

2

Answers


  1. Chosen as BEST ANSWER

    I created a work-around using just Tableau. I imported all data without the nested UNIONs and no check for distributor_view:

    SELECT 'Client1' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
        FROM table1 
        WHERE 
            YEAR(date) >= YEAR(CURDATE()) - 2 
            AND 
            date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
    UNION ALL
    SELECT 'Client2' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
        FROM table2
        WHERE 
            YEAR(date) >= YEAR(CURDATE()) - 2 
            AND 
            date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
    

    In tableau I created a calculated field called sourcing_filter:

    IF {FIXED [itemID], [Date]: MIN([Distributor View])} = 'Manufacturing' AND [Distributor View] = 'Sourcing' THEN
        0
    ELSE
        1
    END
    

    I converted this calc field to a dimension, and then added it to the filter as 1 (or exclude 0). This accomplished the same end result with a much faster query execution time from ~70m to ~4m.


  2. (1) This is not sargable

    AND  YEAR(date) >= YEAR(CURDATE()) - 2
    

    Instead do

    AND  date >= CONCAT(YEAR(CURDATE(), '-01-01')
    

    (It is kludgy, but should allow for using an index involving the date column.)

    (2) I’m not sure about this syntax

    AND  date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
    

    Let’s change to

    AND  date <= CURDATE - INTERVAL DAYOFWEEK(CURDATE()) DAY
    

    (3) Add this to both tables:

    INDEX(distributor_view, date, item_id)
    

    (4) Two problems here, but neither can be fully optimized:

    AND  (date, item_id) NOT IN ( SELECT ... )
    

    First, "NOT IN". Let’s see the EXPLAIN in order to see if the Optimizer did anything good with it.

    Second, "row constructors" — such as (date, item_id) should be avoided. Although such ‘works’, it is not likely to be efficient. There have been some improvements in very recent versions. What version of MySQL or MariaDB are you using?

    Change to one of these:

    AND  NOT EXISTS ( SELECT ... )
    

    or

    LEFT JOIN ( SELECT ... ) ... WHERE ... IS NULL
    

    (5) get rid of the outer SELECT *. (Be sure to add the extra parens I show here.)

    ( SELECT ... )
    UNION ALL
    ( SELECT ... )
    UNION ALL
    ( SELECT ... )
    UNION ALL
    ( SELECT ... )
    ORDER BY date DESC
    

    Without the extra parens, the ORDER BY will [I think] apply only to the last SELECT. In general, when using UNION, extra parens is safe, and may be necessary.

    (6) If these suggestions cannot be applied to the framework that generated the query, you have my condolences.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search