skip to Main Content

My use case is as follows:

Our data source is an On-prem SQL Server, and it serves as our production database. Currently, we are building reports in Power BI and utilizing Snowflake as our data warehouse.

I aim to extract 10 to 15 tables for my Power BI reporting into Snowflake, specifically wanting to construct a SCD Type 1 Pipeline, without the need to retain historical data. To facilitate the data transfer from On-Prem to Snowflake, we are leveraging Azure Data Factory, with blob storage set up in Azure.

We already have a Self-hosted runtime in place that connects to Data Factory. Currently, I’ve employed the For Each loop activity to copy 15 tables from On-prem to Snowflake. The pipeline is scheduled to run daily, and each time it executes, it truncates all 15 tables before loading the data. However, this process is time-consuming due to the volume of data, especially since our On-prem SQL server is a legacy database.

My questions are as follows:

  1. Is there a more efficient approach within Data Factory for this task? What are the best practices recommended for this type of case study?

  2. Dataflow is not functioning with the Self-hosted runtime; how can I activate Dataflow for an On-prem database?

I would greatly appreciate any advice on the correct solution for this or pointers to relevant documentation or blogs.

2

Answers


    1. Copy only the changed data from SQL on-prem. If your source tables have an update date column, you can use a query to get only the rows changed from your last run.
    2. If you need to use Dataflows, first copy the data to a stg table on your source or to a file in Azure storage, and then run the dataflow on the stg table.
    Login or Signup to reply.
  1. For better performance, rather than doing a full load please follow incremental/delta load by leveraging watermark functionality.

    In watermark functionality, you would use a column as a bench mark which is usually audit columns like last updated data/modified data for that row and then you can load only the records from the source whose benchmarkvalue(onprem) >lastupdateddate in the sink(snowflake) .

    for reference :
    https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview

    There is another approach called CDC but that leverages dataflows and in your case with onprem it is not supported

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search