skip to Main Content

How do you query the content in JSON files using Azure Data Factory? I want to filter any JSON files that contain criteria "code":"B" in markings>document>code and "date":"2020-01-01" in metadata>publication date>date. If there is a match, transfers the json files to another storage.

sample json:

{
   "record":{
       "ns":"mra",
       "type":"REC",
       "record":{
           "ns":"rec",
           "id":1001,
           "markings":{
               "document":{
                   "ns":"rec",
                   "code":"B"
               }
            }
       }
  },
  "metadata": { 
      "an": "0002641",
      "markings":{
         "record":{
            "distribution":{
                 
            }
         }   
     },
    "publication date:":{
        "date":"2020-01-01"
    }
  }
}   

2

Answers


  1. Chosen as BEST ANSWER

    enter image description here

    I couldn't see the First row uncheck option of Sink properties


  2. As your source files are in a single folder in Storage account, you can try the below approach to achieve your requirement.

    Use Dataflow to get the list of Json file names which satisfies this condition.

    Create JSON dataset and only give the path till your folder in the dataset and don’t give any name in the filename of the dataset.

    In the source of the dataflow, give this dataset and use wild card file path *.json and give a name for the Column to store file name like below.

    enter image description here

    Next, use filter transformation to set the filter condition. Give the below dynamic expression to filter out the required record.

    iif(hasPath('record.record.markings.document.code')&&(hasPath('metadata.{publication date:}.date')),(record.record.markings.document.code=='B')&&(metadata.{publication date:}.date=='2020-01-01'),toBoolean('false'))
    

    enter image description here

    This will give the result like below. The records and the file names will be filtered out.

    enter image description here

    Now, use a select transformation to remove the extra columns apart from filename column.

    To send this column data to pipeline, use sink cache.

    enter image description here

    In the pipeline, add a dataflow activity and set the Logging level to None and uncheck the First row only option of Sink properties in the dataflow activity.

    Run the pipeline and it will give the required Json file names in the dataflow output like below.

    enter image description here

    You need to pass this array to a For loop and use copy activity with parameterized datasets for the file names and copy each file to its target in each iteration.

    Add a For-each activity after dataflow activity and give the below expression to pass the above array to for loop.

    @activity('<Data flow activity name>').output.runStatus.output.<sink_name>.value
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search