skip to Main Content

I get a disk full error while running a Model training job using Azure ML SDK launched from Azure DevOps. I created a custom environment inside the Azure ML Workspace and used it.

I am using azure CLI tasks in Azure DevOps to launch these training jobs. How can I resolve the disk full issue?

Error Message shown in the DevOps Training Task:

"error": {
        "code": "UserError",
        "message": "{"Compliant":"Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU. Total space: 14045 MB, available space: 1103 MB."}n{n  "code": "DiskFullError",n  "target": "",n  "category": "UserError",n  "error_details": []n}",
        "messageParameters": {},
        "details": []
    },

The .runconfig file for the training job:

 framework: Python
 script: cnn_training.py
 communicator: None
 autoPrepareEnvironment: true
 maxRunDurationSeconds:
 nodeCount: 1
 environment:
   name: cnn_training
   python:
     userManagedDependencies: true
     interpreterPath: python
   docker:
     enabled: true
     baseImage: 54646eeace594cf19143dad3c7f31661.azurecr.io/azureml/azureml_b17300b63a1c2abb86b2e774835153ee
     sharedVolumes: true
     gpuSupport: false
     shmSize: 2g
     arguments: []
 history:
   outputCollection: true
   snapshotProject: true
   directoriesToWatch:
   - logs
 dataReferences:
   workspaceblobstore:
     dataStoreName: workspaceblobstore
     pathOnDataStore: dataname
     mode: download
     overwrite: true
     pathOnCompute:

Is there an additional configuration to be done for the disk full issue? Any Changes to be made in the .runconfig file?

2

Answers


  1. According to your error message below, we suppose that your issue is resulted from the storage space lacking with your Compute Cluster or VM Sku.

    Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU. Total space: 14045 MB, available space: 1103 MB.

    I suggest that you could consider the three steps below, and then test again.

    1.Clear the storage cache,

    2.Upgrade your cluster storage size

    3.Optimize your machine learning resource size

    =========================

    Updated 11/10

    Hi L_Jay
    You could refer to Azure Machine Learning to upgrade your subscription for better performance instance.

    Login or Signup to reply.
  2. I have a suspicion your disk full is due to memory leaking into swap. Double check you are not making extraneous objects in your code. And that you are not loading too much training data without clearing it out.

    I have made this mistake on a local machine, front loading my data into my ML script and maxing out my memory. As opposed to loading data piecewise then deleting it after a training iteration.

    Also this is a guess but have you tried modifying your shmSize: 2g parameter? https://docs.docker.com/engine/reference/run/#runtime-constraints-on-resources

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search