I can install and run Ollama service with GPU in an EC2 instance and make API calls to it from a web app in the following way:
First I need to create a docker network, so that the Ollama service and my web app share the same docker network:
docker network create my-net
Then I run the official Ollama docker container to run the service:
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama --net my-net ollama/ollama
Then I need to serve the model (LLM) with Ollama:
docker exec ollama ollama run <model_name> # like llama2, mistral, etc
And then I need to find out the public IP address of the Ollama service on this network, and export it as an API endpoint URL:
export OLLAMA_API_ENDPOINT=$(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' ollama)
And finally, I can pass this endpoint URL to my web app to make calls with:
docker run -d -p 8080:8080 -e OLLAMA_API_ENDPOINT --rm --name my-web-app --net my-net app
With this, if you go to the following URL:
http://<PUBLIC_IP_OF_THE_EC2_INSTANCE>:8080
You can see the web app (chatbot) running and able to make API calls (chat) with the LLM.
Now I want to deploy this app in our AWS Kubernetes cluster (EKS). For that, I wrote the following inference.yaml
manifest to run Ollama and serve the LLM:
apiVersion: v1
kind: PersistentVolume
metadata:
name: ollama-charlie-pv
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/ollama
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-charlie-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-charlie
spec:
replicas: 1
selector:
matchLabels:
app: ollama-charlie
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: ollama-charlie
spec:
nodeSelector:
ollama-charlie-key: ollama-charlie-value
initContainers:
- name: download-llm
image: ollama/ollama
command: ["ollama", "run", "kristada673/solar-10.7b-instruct-v1.0-uncensored"]
volumeMounts:
- name: data
mountPath: /root/.ollama
containers:
- name: ollama-charlie
image: ollama/ollama
volumeMounts:
- name: data
mountPath: /root/.ollama
livenessProbe:
tcpSocket:
port: 80
initialDelaySeconds: 120 # Adjust based on your app's startup time
periodSeconds: 30
failureThreshold: 2 # Pod is restarted after 2 consecutive failures
volumes:
- name: data
persistentVolumeClaim:
claimName: ollama-charlie-pvc
restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
name: ollama-charlie-service
spec:
selector:
app: ollama-charlie
ports:
- protocol: TCP
port: 11434
targetPort: 11434
Here, ollama-charlie-key: ollama-charlie-value
comes from the node group I created with a GPU (g4dn.xlarge
), and these are the key and value I gave to the node group.
But there’s some problem because when I do kubectl apply -f inference.yaml
, the pod shows as pending and I get the following error:
Back-off restarting failed container download-llm in pod ollama-charlie-7745b595ff-5ldxt_default(57c6bba9-7d92-4cf8-a4ef-3b19f19023e4)
To diagnose it, when I do kubectl logs <pod_name> -c download-llm
, I get:
Error: could not connect to ollama app, is it running?
This means that the Ollama service is not getting started. Could anyone help me figure out why, and edit the inference.yaml
accordingly?
P.S.: Earlier, I tried with the following spec
in inference.yaml
:
spec:
initContainers:
- name: download-llm
image: ollama/ollama
command: ["ollama", "run", "kristada673/solar-10.7b-instruct-v1.0-uncensored"]
volumeMounts:
- name: data
mountPath: /root/.ollama
containers:
- name: ollama-charlie
image: ollama/ollama
volumeMounts:
- name: data
mountPath: /root/.ollama
resources:
limits:
nvidia.com/gpu: 1
Where I do not specify the node group I created and ask it to use a generic Nvidia GPU. That gave me the following error:
That’s why I moved to specifying the key-value pair for the node group I created specifically for this deployment, and removed the instruction to use a generic Nvidia GPU.
2
Answers
I can’t comment otherwise I would make this a comment since I believe its an unsatisfying answer.
From the Nvidia docs on eks it appears you need a service inside of kubernetes to handle managing the Nvidia gpu for it to be available.
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html
I don’t believe that you need the full blown operator, but the device plugin appears to be required by default. This appears to be similar to how eks/ec2 works with the CNI and EBS configurations.
Typically device plugins can be installed with a kubernetes manifest, if you dig around the docs you should be able to find one.
In EKS if you deploy a cluster with Nvidia GPU nodes the device plugin will be enabled by default. I assume in your case you must be using CPU node only but here you are requesting the resources with Nvidia gpus.
So your
inference.yaml
to work, you need GPU based node. For this you can this guide to create a gpu nodegroup. Additionally if you want scale your deployment you have to use something called cluster autoscaler or KarpenterIt can be overwhelming to setup all in EKS. I suggest you stick with ec2 if the traffic or usage is limited.
But if you are looking for a solution for EKS, you can use this helm chart to deploy ollama-helm chart