skip to Main Content

I have been unable to get any output from the model even waiting 10 minutes. Running on a Azure Notebook with a Compute instance Standard_E4ds_v4, 4 core, 32GB.
Any assistance is appreciated.

Code:

!source activate llm_env

%pip install conda
import conda
%conda install cudatoolkit

%pip install torch
%pip install einops
%pip install accelerate
%pip install transformers==4.27.4
%pip install huggingface-hub
%pip install chardet
%pip install cchardet

from transformers import AutoTokenizer, AutoModelForCausalLM, TFAutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-7b"
rrmodel = AutoModelForCausalLM.from_pretrained(model, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",)
tokenizer = AutoTokenizer.from_pretrained(model)

input_text = "What is a giraffe?"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

attention_mask = torch.ones(input_ids.shape)
output = rrmodel.generate(input_ids, 
            attention_mask=attention_mask, 
            max_length=2000,
            do_sample=True,
            pad_token_id = 50256,
            top_k=10,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,)
#Never goes into this section
print(f"Got output: {output}")
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(output_text)

2

Answers


  1. Chosen as BEST ANSWER

    The problem i believe was how i prompted the model, It is a text generation model so in my case i was giving a transcript and i had to write in the end "Summary: ".

    So this DID NOT work: "Summarize this transcript. Transcript: ..."

    This WORKED: "Transcript: .... , Summary: "

    Full Working code below:

    model_path="tiiuae/falcon-7b"
    
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto")
    
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    input_text = "What is a giraffe?"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    attention_mask = torch.ones(input_ids.shape)
    
    outputs = model.generate(input_ids,
                attention_mask=attention_mask,
                max_length=2000,
                do_sample=True,
                top_k=10,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id,)
    
    for output in outputs:
      output_text = tokenizer.decode(output, skip_special_tokens=True)
      print("GENERATED TEXT: ------------")
      print(output_text)
    

  2. I tried your code and I changed the max-length to 100 to check its run time.

    I create a VM size of 140GB with CPU as below,

    enter image description here

    I made small changes with your code as below,

    Code:

    import conda
    %conda install cudatoolkit
    
    %pip install torch
    %pip install einops
    %pip install accelerate
    %pip install transformers==4.27.4
    %pip install huggingface-hub
    %pip install chardet
    %pip install cchardet
    
    from transformers import AutoTokenizer, AutoModelForCausalLM, TFAutoModelForCausalLM
    import transformers
    import torch
    print("Done1")
    
    model = "tiiuae/falcon-7b"
    rrmodel = AutoModelForCausalLM.from_pretrained(model, 
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model)
    
    print("Done2")
    input_text = "What is a giraffe?"
    
    input_ids = tokenizer.encode(input_text, return_tensors='pt')
    
    attention_mask = torch.ones(input_ids.shape)
    output = rrmodel.generate(input_ids, 
                attention_mask=attention_mask, 
                max_length=100,
                do_sample=True,
                pad_token_id = 50256,
                top_k=10,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id)
    
    print(f"Got output: {output}")
    output_text = tokenizer.decode(output[0], skip_special_tokens=True)
    
    print(output_text)
    

    Then, I started running the code in my ML Studio like below,

    enter image description here

    enter image description here

    enter image description here

    It almost took 3:30 hrs to run like below,

    enter image description here

    enter image description here

    Output:

    It runs successfully as below,

    enter image description here

    With 100 max-length it took 3:30 hrs, it will take much time for 2000 max-length. Deploy the VM that runs your notebook with higher GPU size.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search