skip to Main Content

One pipeline loads data from an API to ADLS gen2. I have fixed the file’s name so that it will replace the file in every run of the pipeline.
I am looking to get the number of rows available in the file to use that value in my next activity for some calculations. I want to receive that value as activity output for the next activity as an input parameter.

Please help me here how can I get it. Please suggest to me the process in the case of both .csv & .parquet files and which is convenient to implement

2

Answers


  1. Assuming your file size is low or has less rows, you can use Lookup activity to get the count of rows within the file.
    Lookup activity limitation : The Lookup activity has a maximum of 5,000 rows, and a maximum size of 4 MB.

    In case if file is greater, than you can copy the file data into a SQL database and the copy activity log output would provide row counts.

    In case if you cant use SQL, then you would have to use custom logics in databricks, Azure batch or Azure functions etc to determine the count

    Login or Signup to reply.
  2. As you want to get the number of rows in the available file as lookup activity has limitation of 5000 rows and 4 mb to resolve this you can use data flow activity as below:

    • First take data flow activity and add source dataset as the file where you copied data from API.
    • Then take SurrogateKey transformation to add the rownum column with start and step value as 1.
      enter image description here
    • Then take Derived column transformation to add one constant column with value 1.
      enter image description here
    • Then take Aggregate transformation to get count of rows add above constant column in group by and create new column as count and give expression as count(rownum) in aggregates.
      enter image description here
      enter image description here
    • After this add sink as cache and check the Write to activity output option this will get you row count in dataflow output.
      enter image description here
      Select logging as none in Dataflow setting.

    The data flow output will look like below:

    enter image description here

    Then you can access the value using @activity('Data flow2').output.runStatus.output.sink1.value[0].count this expression in next activity.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search