Read "Integrated Dataset" into a Dataframe with Azure Synapse Notebook

Bama
January 22, 2024
186 views
0 votes
2 Answers

I know how to read a file in the Azure Data Lake Storage, but I don’t know how to read a file that is listed under the Integration Dataset section in Azure.

Answers

- DileeprajnarayanThumula
- January 19, 2024 at 8:36 am
- 0 votes
0
An integration dataset acts as a blueprint for your data, helping you specify its structure and whereabouts in a data store.
It is a guide that is used in your data pipelines and data flows how to interact with your stored information.

A CSV file with customer data in Azure Blob Storage.
You create an integration dataset to define details like the file’s structure (such as column names and data types) and its location in Blob Storage.
Later on, you can use this dataset within Synapse pipeline activities or data flows to read from or write to the CSV file.

An integration dataset does the heavy lifting of telling your systems "what" your data looks like and "where" to find it in a data store.

If you want to Read data from your Spark notebooks

I have tried the below approach:
```
data_path = spark.read.load(f"abfss://[email protected]/jan19.csv", format='csv', header=True)
data_path.show(10)
```
Results:
```
+---+------+--------+
| Id| Name |   Movie|
+---+------+--------+
|  1|Thomas|   peaky|
|  2|  Tony|Sopranos|
+---+------+--------+
```
Login or Signup to reply.

- kolb2055
- January 22, 2024 at 7:56 pm
- 0 votes
0
The integration dataset is not holding csv data. The delimited text dataset SFTP_Five9_CALL_SEGMENT is most likely pointing to a file in your data lake. In the connection blade, look at the File path’s [container]/[directory]/[file name] values. That file path is the information Spark needs to read the data.

In a notebook you can establish your filepath using abfss (Azure blob file system secure). To read this data into a dataframe set the filepath and read the data in using spark.read.csv.
```
Five9_FILEPATH = 'abfss:/<container>@<datalake_name>.dfs.core.windows.net/<rest_of_file_path>'
```
Then create the dataframe using spark.read.csv()
```
df = spark.read.csv(f'{Five9_FILEPATH}', header=True)
```
Make sure you format abfss properly

The container name is to the left of the @ symbol. **Do not include it to the right of the @ symbol. Only include the directories you need to descend through after the container.

abfss://[email protected]/< directories within container to the file >

In DileeprajnarayanThumula’s example, they are using a file path leading directly to the delimited file in the data lake:

abfss://[email protected]/jan19.csv
- synpgen is the data lake name
- folder02 is the container within the data lake
- *jan19th.csv is the file inside of that container.
You could also have a folder within the container say one for the month of January (jan). The path would then be ‘abfss://[email protected]/jan/jan19th.csv’

You could read all the files in a given directory into one dataframe:
```
jan_filepath = 'abfss://[email protected]/jan/*'
jan_df = spark.read.csv(f'{jan_filepath}', header=True)
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.