skip to Main Content

Here is the scenario:
we are using Azure Synapse analytics with 100 DTU, we have several tables.
All tables are round robin distributed and clustered columnstore because data is loaded once in a day and we want it to be quick.
All these tables are big having 50+ columns and millions of records (around 30m)
There is one load_table table which contains information about last load of all the tables.

SQL Query1:

SELECT load_time FROM load_table WHERE table = 't2';

when I execute this query on table, this gives me one timestamp as output for that table. format is like this "YYYY-MM-DD TT". This query normally runs fine within 5-10 secs.

SQL Query2:

SELECT top 100 * FROM t2;

t2 is big fact table containing millions of records having 50+ cols. When I run this query it takes 2-3 mins.

Problem occurs when I join these two tables and it takes 2-3 hrs to execute.
I want the result like loadtime as first column and all columns from second fact table.

joining is like below


SELECT t1.load_time, t2.* FROM
(SELECT load_time FROM load_table WHERE table = 't2') as t1 INNER JOIN t2
on 1 = 1; 

Output look like below.
enter image description here

How can I achieve this result another way?

I am expecting below results
enter image description here

I have checked statistics of fact table those are updated
I have checked skew and it is fine

2

Answers


  1. your join-condition is a bit…special.
    If your load_table only returns one value per table (‘t2’), then you can just try this:

    SELECT TOP 100 
    t1.load_time,
    t2.[fact.col1],[fact.col2],...,[fact.col8]
    FROM load_table as t1 
    CROSS APPLY t2 
    WHERE t1.table='t2'
    

    and if you use a stored procedure, you can first save the value to a variable:

    DECLARE @TableTimestamp datetime;
    SELECT TOP 1 @TableTimestamp = load_time from load_table
    ;
    SELECT TOP 100
    @TableTimeStamp as LoadTime
    ,t2.[fact.col1],[fact.col2],...,[fact.col8]
    FROM t2
    
    Login or Signup to reply.
  2. when I execute this query on table, this gives me one timestamp as output for that table

    when it return only one row and one column then why not take this in variable.
    Declare @load_time datetime

    SELECT @load_time=load_time FROM load_table WHERE table = ‘t2’;

    Now,

    SELECT @load_time as load_time, t2.* FROM t2 (NOLOCK)

    i) DO not use * and mention all the column name which you need in your output.

    ii) Use nolock only when there is no chance of READ UNCommitted or it do not matter.

    iii) Why you will use million of data,why not limit your resultset by using Pagination or top clause.

    iv) Of course , we need at a minimum: the table and index definition for further investigation.also share your real query

    if you simply use

    select * from t2
    

    which has millions of rows it will always take time.
    Also,
    select top 100 * from t2 will also take time because it is without order by clause,sql optimizer is not sure to order on which column to get those 100 rows.

    So you have to share table scheme along with data type.Also if there is any index on table or unique key define.

    Then accordingly we can do Paging to get only selected rows.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search