My use case is as follows:
Our data source is an On-prem SQL Server, and it serves as our production database. Currently, we are building reports in Power BI and utilizing Snowflake as our data warehouse.
I aim to extract 10 to 15 tables for my Power BI reporting into Snowflake, specifically wanting to construct a SCD Type 1 Pipeline, without the need to retain historical data. To facilitate the data transfer from On-Prem to Snowflake, we are leveraging Azure Data Factory, with blob storage set up in Azure.
We already have a Self-hosted runtime in place that connects to Data Factory. Currently, I’ve employed the For Each loop activity to copy 15 tables from On-prem to Snowflake. The pipeline is scheduled to run daily, and each time it executes, it truncates all 15 tables before loading the data. However, this process is time-consuming due to the volume of data, especially since our On-prem SQL server is a legacy database.
My questions are as follows:
-
Is there a more efficient approach within Data Factory for this task? What are the best practices recommended for this type of case study?
-
Dataflow is not functioning with the Self-hosted runtime; how can I activate Dataflow for an On-prem database?
I would greatly appreciate any advice on the correct solution for this or pointers to relevant documentation or blogs.
2
Answers
For better performance, rather than doing a full load please follow incremental/delta load by leveraging watermark functionality.
In watermark functionality, you would use a column as a bench mark which is usually audit columns like last updated data/modified data for that row and then you can load only the records from the source whose benchmarkvalue(onprem) >lastupdateddate in the sink(snowflake) .
for reference :
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview
There is another approach called CDC but that leverages dataflows and in your case with onprem it is not supported