Why is Amazon Sagemaker custom sklearn model upload & endpoint creation failing due to ModuleNotFoundError?

Sharhad
May 15, 2023
139 views
1 vote
2 Answers

I trained a sklearn model and stored it as a .joblib file. This is a large model, about 13.5 gb big. You can download it here

This is my script to train the model:

import os
import pickle
import pandas as pd
import joblib

from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

class Train:
    def __init__(self, clean_filename = None, model_filename = None, data_path = None, model_path = None):

        self.clean_filename = clean_filename if clean_filename else 'podcasts_en_cleaned.csv'
        self.model_filename = model_filename if model_filename else 'model.joblib'
        self.data_path = data_path if data_path else '../data/'
        self.model_path = model_path if model_path else '../model/'

        if not os.path.isdir(self.model_path):
            os.makedirs(self.model_path)

        self.train(clean_filename = self.clean_filename, model_filename = self.model_filename)
        
    def get_data(self, clean_filename):
        print('Starting Training')
        df = shuffle(pd.read_csv(os.path.join(self.data_path, clean_filename)).dropna())
        X = df['name_title']
        y = df['target']
        return X, y
    
    def train(self, clean_filename , model_filename):
        X, y = self.get_data(clean_filename = clean_filename)
        clf = Pipeline([
             ('vect', CountVectorizer(stop_words = 'english')),
             ('tfidf', TfidfTransformer()),
             ('clf', RandomForestClassifier()
        )])
        model = clf.fit(X, y)
        with open(os.path.join(self.model_path, model_filename), 'wb') as file:
            joblib.dump(model, file)
        print('Trained Model saved at {}'.format(os.path.join(self.model_path, model_filename)))

I want to upload this model to sagemaker and crete an endpoint to access it. To do so, I have been following this tutorial, with a few changes

My inference.py file is as follows:

import joblib
import os
import json

"""
Deserialize fitted model
"""
def model_fn(model_dir):
    model = joblib.load(os.path.join(model_dir, "model.joblib"))
    return model

"""
input_fn
    request_body: The body of the request sent to the model.
    request_content_type: (string) specifies the format/variable type of the request
"""
def input_fn(request_body, request_content_type):
    if request_content_type == 'application/json':
        request_body = json.loads(request_body)
        inpVar = request_body['Input']
        return inpVar
    else:
        raise ValueError("This model only supports str input")
"""
predict_fn
    input_data: returned array from input_fn above
    model (sklearn model) returned model loaded from model_fn above
"""
def predict_fn(input_data, model):
    return model.predict(input_data)

"""
output_fn
    prediction: the returned value from predict_fn above
    content_type: the content type the endpoint expects to be returned. Ex: JSON, string
"""

def output_fn(prediction, content_type):
    res = int(prediction[0])
    respJSON = {'Output': res}
    return respJSON

My main.py file is as follows. I had to change instance_type in image_uri and endpoint_config_responseto ml.m5.2xlarge to accomodate the size of the model, and updated image_uri version to version="1.2-1"

import boto3
import json
import os
import joblib
import pickle
import tarfile
import sagemaker
from sagemaker.estimator import Estimator
import time
from time import gmtime, strftime
import subprocess


#Setup
client = boto3.client(service_name="sagemaker")
runtime = boto3.client(service_name="sagemaker-runtime")
boto_session = boto3.session.Session()
s3 = boto_session.resource('s3')
region = boto_session.region_name
print(region)
sagemaker_session = sagemaker.Session()
role = 'arn role'

#Build tar file with model data + inference code
bashCommand = "tar -cvpzf model.tar.gz model.joblib inference.py"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

# retrieve sklearn image
image_uri = sagemaker.image_uris.retrieve(
    framework="sklearn",
    region=region,
    version="1.2-1",
    py_version="py3",
    instance_type='ml.m5.2xlarge',
)

#Bucket for model artifacts
default_bucket = 'bucketname'
print(default_bucket)

#Upload tar.gz to bucket
model_artifacts = f"s3://{default_bucket}/model.tar.gz"
response = s3.meta.client.upload_file('model.tar.gz', default_bucket, 'model.tar.gz')

#Step 1: Model Creation
model_name = "sklearn-test" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + model_name)
create_model_response = client.create_model(
    ModelName=model_name,
    Containers=[
        {
            "Image": image_uri,
            "Mode": "SingleModel",
            "ModelDataUrl": model_artifacts,
            "Environment": {'SAGEMAKER_SUBMIT_DIRECTORY': model_artifacts,
                           'SAGEMAKER_PROGRAM': 'inference.py'} 
        }
    ],
    ExecutionRoleArn=role,
)
print("Model Arn: " + create_model_response["ModelArn"])


#Step 2: EPC Creation
sklearn_epc_name = "sklearn-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName=sklearn_epc_name,
    ProductionVariants=[
        {
            "VariantName": "sklearnvariant",
            "ModelName": model_name,
            "InstanceType": 'ml.m5.2xlarge',
            "InitialInstanceCount": 1
        },
    ],
)
print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])


#Step 3: EP Creation
endpoint_name = "sklearn-local-ep" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('endpoint name', endpoint_name)
create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=sklearn_epc_name,
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])


#Monitor creation
describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
    print(describe_endpoint_response["EndpointStatus"])
    time.sleep(15)
print(describe_endpoint_response)

When i run the code, i get to the last step, where it tries to create the model for 30 min and then fails. Looking at the cloudwatch logs, i see two errors:

[2023-05-07 11:16:56 +0000] [71] [ERROR] Error handling request /ping
Traceback (most recent call last):
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_containers/_functions.py", line 93, in wrapper
    return fn(*args, **kwargs)
  File "/opt/ml/code/inference.py", line 9, in model_fn
    model = joblib.load(os.path.join(model_dir, "model.joblib"))
  File "/miniconda3/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 658, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/miniconda3/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 577, in _unpickle
    obj = unpickler.load()
  File "/miniconda3/lib/python3.8/pickle.py", line 1212, in load
    dispatch[key[0]](self)
  File "/miniconda3/lib/python3.8/pickle.py", line 1537, in load_stack_global
    self.append(self.find_class(module, name))
  File "/miniconda3/lib/python3.8/pickle.py", line 1579, in find_class
    __import__(module, level=0)
ModuleNotFoundError: No module named 'scipy.sparse._csr'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/miniconda3/lib/python3.8/site-packages/gunicorn/workers/base_async.py", line 55, in handle
    self.handle_request(listener_name, req, client, addr)
  File "/miniconda3/lib/python3.8/site-packages/gunicorn/workers/ggevent.py", line 143, in handle_request
    super().handle_request(listener_name, req, sock, addr)
  File "/miniconda3/lib/python3.8/site-packages/gunicorn/workers/base_async.py", line 106, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_sklearn_container/serving.py", line 140, in main
    user_module_transformer, execution_parameters_fn = import_module(serving_env.module_name,
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_sklearn_container/serving.py", line 126, in import_module
    user_module_transformer.initialize()
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_containers/_transformer.py", line 185, in initialize
    self._model = self._model_fn(_env.model_dir)
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_containers/_functions.py", line 95, in wrapper
    six.reraise(error_class, error_class(e), sys.exc_info()[2])

Why are these errors happening?

These errors are in cloudwatch, not my terminal. So please tell me how to install these libraries in sagemaker.

And how do i fix it?

I trained the model and used this deployments script in python 3.9.16

And if theres no fix, how do i upload my model, which is large and trained outside of sagemaker, on to sagemaker and create an endpoint so I can use it for my webaps

Answers

Chosen as BEST ANSWER
- Sharhad
- May 15, 2023 at 7:40 am
- 0 votes
0
Fixed it myself

Step 1: I made sure model and sagemaer environment were both the same python version. For me both was 3.8.16

Step 2: I added a requirments.txt file. And added it in the tar file by changing this line in main.py:
```
bashCommand = "tar -cvpzf model.tar.gz model.joblib inference.py requirements.txt"
```
Make sure to only use libraries that are needed

I also added this line in inference.py to force sagemaker to install the libraries:
```
if __name__ == '__main__':
    os.system('pip install -r requirements.txt')
```
Finally, I had to use a larger inference type size of ml.m5.24xlarge To make it work

(Edit)

- RamVegiraju
- May 8, 2023 at 6:04 pm
- 0 votes
0
how are you locally creating this joblib file? What modules are you using to create it? Can you share the script for how you’re serializing this. Does local inference work, can we validate this first. This is also a large model artifact if it is 13GB, it might be beneficial to use Triton Inference Server Python or FIL Backend with SageMaker to handle this (default is MMS with SageMaker Single Model Endpoints).

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.