One pipeline loads data from an API to ADLS gen2. I have fixed the file’s name so that it will replace the file in every run of the pipeline.
I am looking to get the number of rows available in the file to use that value in my next activity for some calculations. I want to receive that value as activity output for the next activity as an input parameter.
Please help me here how can I get it. Please suggest to me the process in the case of both .csv & .parquet files and which is convenient to implement
2
Answers
Assuming your file size is low or has less rows, you can use Lookup activity to get the count of rows within the file.
Lookup activity limitation : The Lookup activity has a maximum of 5,000 rows, and a maximum size of 4 MB.
In case if file is greater, than you can copy the file data into a SQL database and the copy activity log output would provide row counts.
In case if you cant use SQL, then you would have to use custom logics in databricks, Azure batch or Azure functions etc to determine the count
As you want to get the number of rows in the available file as lookup activity has limitation of 5000 rows and 4 mb to resolve this you can use data flow activity as below:
rownum
column with start and step value as 1.constant
column with value 1.count(rownum)
in aggregates.Select logging as none in Dataflow setting.
The data flow output will look like below:
Then you can access the value using
@activity('Data flow2').output.runStatus.output.sink1.value[0].count
this expression in next activity.