skip to Main Content

I’m running pytorch on an ubuntu server 18.04 LTS. I have an nvidia gpu with 8 GB or ram. I’d like to experiment with the new llma2-70B-chat model. I’m trying to use peft and bitsandbytes to reduce the hardware requirements as described in the link below:

https://www.youtube.com/watch?v=6iHVJyX2e50

is it possible to work with the llama-2-70B-chat model on a single gpu with 8GB of ram? I don’t care if it’s quick, I just want experiment and see what kind of quality responses I can get out of it.

2

Answers


  1. To run the 70B model on 8GB VRAM would be difficult with even quantization. May be you can try running it in Huggingface? You have get quota for single A100 large instance. The smaller models are also fairly capable, give them a shot. For quantization look at llama.cpp project (ggml), works with llama v2 too.

    With 8GB VRAM you can try running the newer LlamaCode model and also the smaller Llama v2 models. Try the OobaBogga Web UI (its on Github) as a generic frontend with chat interface. But in my experience is a bit slow in generating inference.

    Good luck!

    Login or Signup to reply.
  2. Short answer: No

    There is no way to run a Llama-2-70B caht model entirely on an 8 GB GPU alone.
    Not even with quantization. (File sizes/ memory sizes of Q2 quantization see below)

    Your best bet to run Llama-2-70 b is:

    Long answer: combined with your system memory, maybe

    Try out Llama.cpp, or any of the projects based on it, using the .gguf quantizations.

    With Llama.cpp you can run models and offload parts of it to the gpu, with the rest of it running on CPU.
    Even then, with the highest available quantization of Q2, which will cause signifikant quality loss of the model, you are required a total of 32 GB memory, which is combined of GPU and system ram – but keep in mind, your system needs up ram too.

    It may run anyhow, with your system starting to swap, which will make the answers incredibly slow. And i mean incredibly, because swapping will occur for every single run through the models neural network, resulting in something like several minutes per generated token at least (if not worse).

    Without swapping, depending on the cpabilities of your system, expect something about 0.5 token /s or slightly above, maybe worse.
    Here is the Model-card of the gguf-quantized llama-2-70B chat model, it contains further information how to run it with different software:
    TheBloke/Llama-2-70B-chat-GGUF

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search