I uploaded parquet files to a blobstorage and created a data asset via the Azure ML GUI. The steps are precise and clear and the outcome is as desired. For future usage I would like to use the CLI to create the data asset and new versions of it.
The base command would be az ml create data -f <file-name>.yml
. The docs provide a minimal example of a MLTable file which should reside next to the parquet files.
# directory in blobstorage
├── data
│ ├── MLTable
│ ├── file_1.parquet
.
.
.
│ ├── file_n.parquet
I am still not sure how to properly specify those files in order to create a tabular dataset with column conversion.
Do I need to specify the full path or the pattern in the yml
file?
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
type: mltable
name: Test data
description: Basic example for parquet files
path: azureml://datastores/workspaceblobstore/paths/*/*.parquet # pattern or path to dir?
I have more confusion about the MLTable file:
type: mltable
paths:
- pattern: ./*.parquet
transformations:
- read_parquet:
# what comes here?
E.g. I have a column with dates with format %Y-%m%d %H:%M:%S
which should be converted to a timestamp. (I can provide this information at least in the GUI!)
Any help on this topic or hidden links to documentation would be great.
2
Answers
A working MLTable file to convert string columns from parquet files looks like this:
(By the way, at the moment of writing this specifying multiple columns as array did not work, e.g.
columns: [column_a, column_b]
)To perform this operation, we need to check with installations and requirements for the experiment. We need to have valid subscription and workspace.
Install the required mltable library.
There are 4 different supported paths as the parameters in Azure ML
• Local computer path
• Path on public server like HTTP/HTTPS
• Path on azure storage (Like blob in this case)
• Path on datastore
Create a YAML file in the folder which was created as an assert
Filename can be anything (filename.yml)
To create a specific file as the data asset
All the paths need to be mentioned according to your workspace credentials.
To create MLTable file as the data asset.
Create a yml file with the data pattern like below with the data in your case
Use the below python code to use the MLTable
To create MLTable data asset. Use the below code block.
Blob is the storage mechanism in the current requirement.
The same procedure is used to create a data asset of MLTable