I am using tableau for a dashboard that pulls data in from a MySQL database. It works perfectly but the query takes too long to execute. The data is limited to the previous 2 years and the most recent Saturday that has passed from today.
Background info:
I have sales data on a daily cadence databased in a single table per account(client). There is a field called distributor_view which can have one of the following 2 values: Manufacturing and Sourcing. An item will sometimes disappear from the manufacturing view, but will always be on the sourcing. The revenue figures between the 2 are not equal, but for data completeness- if any items are missing from the manufacturing view (preferred) then the revenue is pulled from the sourcing.
Note: If all data is queried most of it will be duplicated on an itemId level- only about 5% of items are missing from the manufacturing view and need to be referenced by the sourcing view.
The logic of the query below is such that it will pull all items that are in the manufacturing view, and UNION ALL items that exist in the sourcing view and are missing from the manufacturing view.
Query:
SELECT * FROM
(
SELECT 'Client1' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
FROM table1
WHERE
distributor_view = 'Manufacturing'
AND
YEAR(date) >= YEAR(CURDATE()) - 2
AND
date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
UNION ALL
SELECT 'Client1' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
FROM table1
WHERE
distributor_view = 'Sourcing'
AND
(date, item_id) NOT IN (
SELECT date, item_id FROM table1 WHERE distributor_view = 'Manufacturing' AND YEAR(date) >= YEAR(CURDATE()) - 2 AND date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
)
AND
YEAR(date) >= YEAR(CURDATE()) - 2
AND
date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
UNION ALL
SELECT 'Client2' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
FROM table2
WHERE
distributor_view = 'Manufacturing'
AND
YEAR(date) >= YEAR(CURDATE()) - 2
AND
date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
UNION ALL
SELECT 'Client2' as Client, date as 'Date', distributor_view as 'Distributor View', item_id as 'ItemID', product_title as 'Title', ordered_revenue as 'Ordered Revenue'
FROM table2
WHERE
distributor_view = 'Sourcing'
AND
(date, item_id) NOT IN (
SELECT date, item_id FROM table2 WHERE distributor_view = 'Manufacturing' AND YEAR(date) >= YEAR(CURDATE()) - 2 AND date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
)
AND
YEAR(date) >= YEAR(CURDATE()) - 2
AND
date <= DATE_ADD(CURDATE(), INTERVAL(-DAYOFWEEK(CURDATE())) DAY)
) AS sales
ORDER BY date DESC
In an earlier iteration of this I used Python to create a script that would delete out items from sourcing that existed in manufacturing. The query would then be much simpler- a select with no regard to distributor_view. I don’t like this method because it requires a script to be ran each time the data changes(daily).
Another option I’m brainstorming is to just pull all data and attempt to reconcile the two within tableau using some calculation.
2
Answers
I created a work-around using just Tableau. I imported all data without the nested UNIONs and no check for distributor_view:
In tableau I created a calculated field called sourcing_filter:
I converted this calc field to a dimension, and then added it to the filter as 1 (or exclude 0). This accomplished the same end result with a much faster query execution time from ~70m to ~4m.
(1) This is not sargable
Instead do
(It is kludgy, but should allow for using an index involving the
date
column.)(2) I’m not sure about this syntax
Let’s change to
(3) Add this to both tables:
(4) Two problems here, but neither can be fully optimized:
First, "NOT IN". Let’s see the
EXPLAIN
in order to see if the Optimizer did anything good with it.Second, "row constructors" — such as
(date, item_id)
should be avoided. Although such ‘works’, it is not likely to be efficient. There have been some improvements in very recent versions. What version of MySQL or MariaDB are you using?Change to one of these:
or
(5) get rid of the outer
SELECT *
. (Be sure to add the extra parens I show here.)Without the extra parens, the
ORDER BY
will [I think] apply only to the lastSELECT
. In general, when usingUNION
, extra parens is safe, and may be necessary.(6) If these suggestions cannot be applied to the framework that generated the query, you have my condolences.