Azure ML CLI v2 create data asset with MLTable

KenJiiii
September 30, 2022
262 views
0 votes
2 Answers

I uploaded parquet files to a blobstorage and created a data asset via the Azure ML GUI. The steps are precise and clear and the outcome is as desired. For future usage I would like to use the CLI to create the data asset and new versions of it.

The base command would be az ml create data -f <file-name>.yml. The docs provide a minimal example of a MLTable file which should reside next to the parquet files.

# directory in blobstorage
├── data
│   ├── MLTable
│   ├── file_1.parquet
.
.
.
│   ├── file_n.parquet

I am still not sure how to properly specify those files in order to create a tabular dataset with column conversion.

Do I need to specify the full path or the pattern in the yml file?

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

type: mltable
name: Test data
description: Basic example for parquet files
path: azureml://datastores/workspaceblobstore/paths/*/*.parquet # pattern or path to dir?

I have more confusion about the MLTable file:

type: mltable

paths:
  - pattern: ./*.parquet
transformations:
  - read_parquet:
      # what comes here?

E.g. I have a column with dates with format %Y-%m%d %H:%M:%S which should be converted to a timestamp. (I can provide this information at least in the GUI!)

Any help on this topic or hidden links to documentation would be great.

Answers

Chosen as BEST ANSWER
- KenJiiii
- November 15, 2022 at 3:23 pm
- 0 votes
0
A working MLTable file to convert string columns from parquet files looks like this:
```
--- 
type: mltable
paths: 
  - pattern: ./*.parquet
transformations: 
  - read_parquet: 
      include_path_column: false
  - convert_column_types:
      - columns: column_a
        column_type:
          datetime:
            formats:
              - "%Y-%m-%d %H:%M:%S"
  - convert_column_types:
    - columns: column_b
      column_type:
        datetime:
          formats:
            - "%Y-%m-%d %H:%M:%S"
```
(By the way, at the moment of writing this specifying multiple columns as array did not work, e.g. columns: [column_a, column_b])

(Edit)

To perform this operation, we need to check with installations and requirements for the experiment. We need to have valid subscription and workspace.

Install the required mltable library.

There are 4 different supported paths as the parameters in Azure ML

• Local computer path

• Path on public server like HTTP/HTTPS

• Path on azure storage (Like blob in this case)

• Path on datastore

Create a YAML file in the folder which was created as an assert
Filename can be anything (filename.yml)

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
type: uri_folder
name: <name_of_data>
description: <description goes here>
path: <path>
to create the data assert using CLI. 
az ml data create -f filename.yml

To create a specific file as the data asset

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
# Supported paths include:
# local: ./<path>/<file>
# blob:  https://<account_name>.blob.core.windows.net/<container_name>/<path>/<file>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>/<file>
type: uri_file
name: <name>
description: <description>
path: <uri>

All the paths need to be mentioned according to your workspace credentials.

To create MLTable file as the data asset.

Create a yml file with the data pattern like below with the data in your case

type: mltable
paths:
  - pattern: ./*.filetypeextension
transformations:
  - read_delimited:
      delimiter: ,
      encoding: ascii
      header: all_files_same_headers

Use the below python code to use the MLTable

import mltable
table1 = mltable.load(uri="./data")
df = table1.to_pandas_dataframe()

To create MLTable data asset. Use the below code block.

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
# path must point to **folder** containing MLTable artifact (MLTable file + data
# Supported paths include:
# blob:  https://<account_name>.blob.core.windows.net/<container_name>/<path>
type: mltable
name: <name_of_data>
description: <description goes here>
path: <path>

Blob is the storage mechanism in the current requirement.

The same procedure is used to create a data asset of MLTable

az ml data create -f <file-name>.yml

Please signup or login to give your own answer.

Click here to cancel reply.