I have a servers cluster, each server gets real-time authentication events as requests, and returns a risk score for the incoming event, based on AI models that sits in S3.
This cluster serves multiple customers. Each customer has its own AI model in S3.
Each AI model file in S3 size is ~50MB in size.
The problem:
Let’s say this cluster consists of 10 servers, and it serves 20 customers. Respectively, there are 20 AI models in S3.
In a naive solution, each server in the cluster might end up loading all the 20 models from S3 to the server memory.
20(servers in the cluster)*50MB(model size in S3) = 1GB.
It takes long time to download the model and load it to memory, and the amount of memory is limited to the memory capacity of the server.
And of course – these problems get bigger with scale.
So what are my options?
I know that there are out of the box products for model life cycle management, such as: MlFlow, KubeFlow, …
Do these products have a solution to the problem I raised?
Maybe use Redis as a cache layer?
Maybe use Redis as a cache layer in combination with MlFlow and KubeFlow?
Any other solution?
Limitation:
I can’t have sticky session between the servers in that cluster, so I can’t ensure all the requests of the same customer will end up in the same server.
2
Answers
As far as I understand your problem, I’d use separate serving servers for each model. As a result, you’ll have 20 model serving servers which loads only 50MB model data in it and the server is going to serve for one model. You’ll also need to have 1 server which stores the model metadata and it’s responsible for sending the incoming request to the related model serving server. This metadata will have the information of "customer vs model serving server endpoint".
Essentially, Kubeflow offers the above solution as a package and it’s highly scalable since it’s using Kubernetes for orchestration. For example, someday if you want to add new customer, you can trigger a Kubeflow pipeline which trains your model, saves it to S3, deploys a separate model server within the Kubeflow cluster and updates the metadata. Kubeflow offers both automation using its pipeline approach and scalability using Kubernetes.
The cons of Kubeflow at the moment, in my opinion, are the community is not big and the product has been improving.
I haven’t used MlFlow before, so I cannot give details about it.
As far as I understand your problem, this can not be solved by any model serving library/framework. The server instance that gets the request for the risk score, must load the according model.
To solve this you should requests dependent on the tenant to a specific server instance.
The "Deployment stamps" pattern could help you in this case. See https://learn.microsoft.com/en-us/azure/architecture/patterns/deployment-stamp for more details.
As a front door (see pattern) a NGINX or Spring Cloud Gateway could be a good solution. Just look at the request header (authorization header) to get the tenant/user and determine the appropriate server instance.