Amazon web services - How to install and run Ollama server in AWS Kubernetes cluster (EKS)?

Kristada673
July 3, 2024
261 views
0 votes
2 Answers

I can install and run Ollama service with GPU in an EC2 instance and make API calls to it from a web app in the following way:

First I need to create a docker network, so that the Ollama service and my web app share the same docker network:

docker network create my-net

Then I run the official Ollama docker container to run the service:

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama --net my-net ollama/ollama

Then I need to serve the model (LLM) with Ollama:

docker exec ollama ollama run <model_name> # like llama2, mistral, etc

And then I need to find out the public IP address of the Ollama service on this network, and export it as an API endpoint URL:

export OLLAMA_API_ENDPOINT=$(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' ollama)

And finally, I can pass this endpoint URL to my web app to make calls with:

docker run -d -p 8080:8080 -e OLLAMA_API_ENDPOINT --rm --name my-web-app --net my-net app

With this, if you go to the following URL:

http://<PUBLIC_IP_OF_THE_EC2_INSTANCE>:8080

You can see the web app (chatbot) running and able to make API calls (chat) with the LLM.

Now I want to deploy this app in our AWS Kubernetes cluster (EKS). For that, I wrote the following inference.yaml manifest to run Ollama and serve the LLM:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: ollama-charlie-pv
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/ollama

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-charlie-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-charlie
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama-charlie
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: ollama-charlie
    spec:
      nodeSelector:
        ollama-charlie-key: ollama-charlie-value
      initContainers:
      - name: download-llm
        image: ollama/ollama
        command: ["ollama", "run", "kristada673/solar-10.7b-instruct-v1.0-uncensored"]
        volumeMounts:
        - name: data
          mountPath: /root/.ollama
      containers:
      - name: ollama-charlie
        image: ollama/ollama
        volumeMounts:
        - name: data
          mountPath: /root/.ollama
        livenessProbe:
          tcpSocket:
            port: 80
          initialDelaySeconds: 120  # Adjust based on your app's startup time
          periodSeconds: 30
          failureThreshold: 2  # Pod is restarted after 2 consecutive failures
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: ollama-charlie-pvc
      restartPolicy: Always

---
apiVersion: v1
kind: Service
metadata:
  name: ollama-charlie-service
spec:
  selector:
    app: ollama-charlie
  ports:
    - protocol: TCP
      port: 11434
      targetPort: 11434

Here, ollama-charlie-key: ollama-charlie-value comes from the node group I created with a GPU (g4dn.xlarge), and these are the key and value I gave to the node group.

But there’s some problem because when I do kubectl apply -f inference.yaml, the pod shows as pending and I get the following error:

Back-off restarting failed container download-llm in pod ollama-charlie-7745b595ff-5ldxt_default(57c6bba9-7d92-4cf8-a4ef-3b19f19023e4)

To diagnose it, when I do kubectl logs <pod_name> -c download-llm, I get:

Error: could not connect to ollama app, is it running?

This means that the Ollama service is not getting started. Could anyone help me figure out why, and edit the inference.yaml accordingly?

P.S.: Earlier, I tried with the following spec in inference.yaml:

spec:
      initContainers:
      - name: download-llm
        image: ollama/ollama
        command: ["ollama", "run", "kristada673/solar-10.7b-instruct-v1.0-uncensored"]
        volumeMounts:
        - name: data
          mountPath: /root/.ollama
      containers:
      - name: ollama-charlie
        image: ollama/ollama
        volumeMounts:
        - name: data
          mountPath: /root/.ollama
        resources:
          limits:
            nvidia.com/gpu: 1

Where I do not specify the node group I created and ask it to use a generic Nvidia GPU. That gave me the following error:

That’s why I moved to specifying the key-value pair for the node group I created specifically for this deployment, and removed the instruction to use a generic Nvidia GPU.

Answers

- DanielWasserlauf
- May 6, 2024 at 4:05 pm
- 0 votes
0
I can’t comment otherwise I would make this a comment since I believe its an unsatisfying answer.

From the Nvidia docs on eks it appears you need a service inside of kubernetes to handle managing the Nvidia gpu for it to be available.

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html

You must deploy the NVIDIA device plugin and you assume responsibility for upgrading the plugin.

I don’t believe that you need the full blown operator, but the device plugin appears to be required by default. This appears to be similar to how eks/ec2 works with the CNI and EBS configurations.

Typically device plugins can be installed with a kubernetes manifest, if you dig around the docs you should be able to find one.

Login or Signup to reply.

- Surya
- July 3, 2024 at 2:13 am
- 0 votes
0
In EKS if you deploy a cluster with Nvidia GPU nodes the device plugin will be enabled by default. I assume in your case you must be using CPU node only but here you are requesting the resources with Nvidia gpus.
```
resources:
    limits:
       nvidia.com/gpu: 1
```
So your inference.yaml to work, you need GPU based node. For this you can this guide to create a gpu nodegroup. Additionally if you want scale your deployment you have to use something called cluster autoscaler or Karpenter

It can be overwhelming to setup all in EKS. I suggest you stick with ec2 if the traffic or usage is limited.
But if you are looking for a solution for EKS, you can use this helm chart to deploy ollama-helm chart
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Amazon web services – How to install and run Ollama server in AWS Kubernetes cluster (EKS)?

Answers