I get a disk full error while running a Model training job using Azure ML SDK launched from Azure DevOps. I created a custom environment inside the Azure ML Workspace and used it.
I am using azure CLI tasks in Azure DevOps to launch these training jobs. How can I resolve the disk full issue?
Error Message shown in the DevOps Training Task:
"error": {
"code": "UserError",
"message": "{"Compliant":"Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU. Total space: 14045 MB, available space: 1103 MB."}n{n "code": "DiskFullError",n "target": "",n "category": "UserError",n "error_details": []n}",
"messageParameters": {},
"details": []
},
The .runconfig file for the training job:
framework: Python
script: cnn_training.py
communicator: None
autoPrepareEnvironment: true
maxRunDurationSeconds:
nodeCount: 1
environment:
name: cnn_training
python:
userManagedDependencies: true
interpreterPath: python
docker:
enabled: true
baseImage: 54646eeace594cf19143dad3c7f31661.azurecr.io/azureml/azureml_b17300b63a1c2abb86b2e774835153ee
sharedVolumes: true
gpuSupport: false
shmSize: 2g
arguments: []
history:
outputCollection: true
snapshotProject: true
directoriesToWatch:
- logs
dataReferences:
workspaceblobstore:
dataStoreName: workspaceblobstore
pathOnDataStore: dataname
mode: download
overwrite: true
pathOnCompute:
Is there an additional configuration to be done for the disk full issue? Any Changes to be made in the .runconfig file?
2
Answers
According to your error message below, we suppose that your issue is resulted from the storage space lacking with your Compute Cluster or VM Sku.
Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU. Total space: 14045 MB, available space: 1103 MB.
I suggest that you could consider the three steps below, and then test again.
1.Clear the storage cache,
2.Upgrade your cluster storage size
3.Optimize your machine learning resource size
=========================
Updated 11/10
Hi L_Jay
You could refer to Azure Machine Learning to upgrade your subscription for better performance instance.
I have a suspicion your disk full is due to memory leaking into swap. Double check you are not making extraneous objects in your code. And that you are not loading too much training data without clearing it out.
I have made this mistake on a local machine, front loading my data into my ML script and maxing out my memory. As opposed to loading data piecewise then deleting it after a training iteration.
Also this is a guess but have you tried modifying your
shmSize: 2g
parameter? https://docs.docker.com/engine/reference/run/#runtime-constraints-on-resources