I am following this documentation page to understand SageMaker’s distributed training feature.
It says here that:-
The SageMaker distributed training libraries are available only through the AWS deep learning containers for the TensorFlow, PyTorch, and HuggingFace frameworks within the SageMaker training platform.
Does this mean that we cannot use SageMaker distributed training to train machine learning models with traditional machine learning algorithms such as linear regression, random forest or XGBoost?
I have a use cases where the data set is very large and distributed training can help with model parallelism and data parallelism. What other options can be recommended to avoid bringing in large amounts of data in memory on a training instance?
2
Answers
SageMaker training offers various tabular built-in algorithms, the KNN, XGBoost and linear learner, and factorization machines algorithms are parallelizable (can run on multiple devices), and supports data streaming (no limit of data size).
Beyong built-in algorithms, SageMaker also supports bringing your own training script.
I guess the parallel implementation of Sagemaker Xgboost algorithm can be called a "data-parallel" approach since the model is copied over to multiple instances and data is distributed across those instances ( when used with
distribution = "ShardedByS3Key"
insagemaker.TrainingInput
) .The "model parallel" approaches are probably more applicable to Neural Networks . It uses
smdistributed.modelparallel.torch as smp
to create a decorator@smp.step
that shall inform the model about how to split the layers of the Neural Network across instances.