skip to Main Content

I uploaded parquet files to a blobstorage and created a data asset via the Azure ML GUI. The steps are precise and clear and the outcome is as desired. For future usage I would like to use the CLI to create the data asset and new versions of it.

The base command would be az ml create data -f <file-name>.yml. The docs provide a minimal example of a MLTable file which should reside next to the parquet files.

# directory in blobstorage
├── data
│   ├── MLTable
│   ├── file_1.parquet
.
.
.
│   ├── file_n.parquet

I am still not sure how to properly specify those files in order to create a tabular dataset with column conversion.

Do I need to specify the full path or the pattern in the yml file?

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

type: mltable
name: Test data
description: Basic example for parquet files
path: azureml://datastores/workspaceblobstore/paths/*/*.parquet # pattern or path to dir?

I have more confusion about the MLTable file:

type: mltable

paths:
  - pattern: ./*.parquet
transformations:
  - read_parquet:
      # what comes here?

E.g. I have a column with dates with format %Y-%m%d %H:%M:%S which should be converted to a timestamp. (I can provide this information at least in the GUI!)

Any help on this topic or hidden links to documentation would be great.

2

Answers


  1. Chosen as BEST ANSWER

    A working MLTable file to convert string columns from parquet files looks like this:

    --- 
    type: mltable
    paths: 
      - pattern: ./*.parquet
    transformations: 
      - read_parquet: 
          include_path_column: false
      - convert_column_types:
          - columns: column_a
            column_type:
              datetime:
                formats:
                  - "%Y-%m-%d %H:%M:%S"
      - convert_column_types:
        - columns: column_b
          column_type:
            datetime:
              formats:
                - "%Y-%m-%d %H:%M:%S"
    

    (By the way, at the moment of writing this specifying multiple columns as array did not work, e.g. columns: [column_a, column_b])


  2. To perform this operation, we need to check with installations and requirements for the experiment. We need to have valid subscription and workspace.

    Install the required mltable library.

    There are 4 different supported paths as the parameters in Azure ML

    • Local computer path

    • Path on public server like HTTP/HTTPS

    • Path on azure storage (Like blob in this case)

    • Path on datastore

    Create a YAML file in the folder which was created as an assert
    Filename can be anything (filename.yml)

    $schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
    type: uri_folder
    name: <name_of_data>
    description: <description goes here>
    path: <path>
    to create the data assert using CLI. 
    az ml data create -f filename.yml
    

    To create a specific file as the data asset

    $schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
    # Supported paths include:
    # local: ./<path>/<file>
    # blob:  https://<account_name>.blob.core.windows.net/<container_name>/<path>/<file>
    # ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file>
    # Datastore: azureml://datastores/<data_store_name>/paths/<path>/<file>
    type: uri_file
    name: <name>
    description: <description>
    path: <uri>
    

    All the paths need to be mentioned according to your workspace credentials.

    To create MLTable file as the data asset.

    Create a yml file with the data pattern like below with the data in your case

    type: mltable
    paths:
      - pattern: ./*.filetypeextension
    transformations:
      - read_delimited:
          delimiter: ,
          encoding: ascii
          header: all_files_same_headers
    

    Use the below python code to use the MLTable

    import mltable
    table1 = mltable.load(uri="./data")
    df = table1.to_pandas_dataframe()
    

    To create MLTable data asset. Use the below code block.

    $schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
    # path must point to **folder** containing MLTable artifact (MLTable file + data
    # Supported paths include:
    # blob:  https://<account_name>.blob.core.windows.net/<container_name>/<path>
    type: mltable
    name: <name_of_data>
    description: <description goes here>
    path: <path>
    

    Blob is the storage mechanism in the current requirement.

    The same procedure is used to create a data asset of MLTable

    az ml data create -f <file-name>.yml
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search