How to Run a LLaMA Chatbot on the Cluster

In our previous tutorial, we covered how to set up the environment and run a classification model using sbatch.
This time, we will run a Meta LLaMA 3.1-8B-Instruct chatbot interactively via srun on the cluster.
We will start with a minimal terminal chatbot, then upgrade it to a web interface.

Environment Setup

Create a conda environment Make sure you are working under the /project directory (to avoid /home quota issues).
Then create a dedicated conda environment and install the required libraries:

cd /project/your_username/
mkdir -p llama-chatbot
cd llama-chatbot
conda create -n llama_chat python=3.10 -y
conda activate llama_chat

Install Hugging Face ecosystem and Gradio pip install -U “transformers>=4.44.0” “accelerate>=0.33.0” “tokenizers>=0.19.0” safetensors gradio

Hugging Face Access

Meta’s LLaMA models require a license agreement and authentication.

Accept the license

Visit: meta-llama/Llama-3.1-8B-Instruct Fill the form and click Submit

Create an access token

Log in with your Huggingface account.
Go to Hugging Face Access Tokens Page.
Create a read token (hf_…), make sure the Llama-3.1-8B-Instruct model is selected and copy the token.
Log in on the cluster huggingface-cli login Paste your token when prompted. You should see: “Login successful. The current active token is: XXX”.

Run a Chatbot in your terminal

Prepare the script

Save the following script in your prject directory, e.g. /project/your_username/llama-chatbot/llama_chat.py:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, sys

MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID, use_auth_token=True)
mdl = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
    attn_implementation="sdpa",
    use_auth_token=True,
)

print("💬 LLaMA Lab Assistant — type 'exit' to quit")
history = []
while True:
    user = input("You: ").strip()
    if user.lower() == "exit":
        sys.exit(0)

    messages = history + [{"role": "user", "content": user}]
    prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tok(prompt, return_tensors="pt").to(mdl.device)

    with torch.no_grad():
        out = mdl.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.7,
            do_sample=True,
            eos_token_id=tok.eos_token_id,
        )
    text = tok.decode(out[0], skip_special_tokens=True)
    reply = text.split(prompt, 1)[-1].strip()
    print(reply)
    history.append({"role": "user", "content": user})
    history.append({"role": "assistant", "content": reply})

Run it interactively on a GPU node:

srun --gres=gpu:1 --cpus-per-task=4 --mem=64G --time=2:00:00 --partition=bigTiger \
     python /project/your_username/tutorial/llama-chatbot/llama_chat.py

Start Conversations:

Once the chatbot starts, you can interact in the terminal, for example:

💬 LLaMA Lab Assistant — type 'exit' to quit

You: Can you explain what a GPU is?
Assistant: A Graphics Processing Unit (GPU) is a specialized electronic circuit designed to quickly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device...

exit

Upgrade to a Web Interface (Gradio)

For a more user-friendly interface, we can use Gradio to run the chatbot in a browser.

Prepare the script

Save this script as /project/your_username/tutorial/llama-chatbot/llama_gradio.py, and change line 6 to your actual directory:

import os, socket, gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Put HF cache under /project to avoid home quota and deprecation warning
os.environ.setdefault("HF_HOME", "/your_username/wliu9/.cache/huggingface")

MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
PORT = 7861  # change if the port is busy

# You already did `huggingface-cli login`, no explicit token needed here
tok = AutoTokenizer.from_pretrained(MODEL_ID)
mdl = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
    attn_implementation="sdpa",
)

def chat_fn(message, history):
    """
    Gradio passes:
      - message: current user input (str)
      - history: list of (user, assistant) tuples
    Convert to LLaMA chat template messages.
    """
    msgs = []
    for u, b in (history or []):
        if u:
            msgs.append({"role": "user", "content": u})
        if b:
            msgs.append({"role": "assistant", "content": b})
    msgs.append({"role": "user", "content": message})

    prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    inputs = tok(prompt, return_tensors="pt").to(mdl.device)

    with torch.no_grad():
        out = mdl.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.7,
            do_sample=True,
            eos_token_id=tok.eos_token_id,
        )

    text = tok.decode(out[0], skip_special_tokens=True)
    reply = text.split("assistant\n", 1)[-1].strip()
    return reply

print(f"[Gradio] compute node = {socket.gethostname()}, port = {PORT}", flush=True)

demo = gr.ChatInterface(fn=chat_fn, title="LLaMA Lab Assistant")
demo.launch(server_name="0.0.0.0", server_port=PORT, show_error=True)

Run it on a GPU node

srun --gres=gpu:1 --cpus-per-task=4 --mem=64G --time=2:00:00 --partition=bigTiger \
     python /project/your_username/tutorial/llama-chatbot/llama_gradio.py

This will start the LLaMA chatbot on a GPU node. In the terminal you should see output like:

[Gradio] compute node = itiger01, port = 7861
Running on local URL:  http://0.0.0.0:7861

Connect from your local machine

Open a new terminal on your local computer and create an SSH tunnel. Replace itiger01 with the actual compute node name printed above, and and you_username as your actual username:

ssh -Nf -L 17861:itiger01:7861 you_username@itiger.memphis.edu

Now open a browser on your local machine and go to: http://localhost:17861

You should see the LLaMA Lab Assistant Gradio interface, where you can start chatting with the model interactively.