Launch Llama70B Inference on Your Own Kubernetes Cluster

The Llama-70B is one of the more powerful open source models, but still easily fits on a single node of GPUs - even 8 x L4s. Of course, inference speed will improve if you change to A100s or H100s on your cloud provider, but the L4s are cost-effective for experimentation or low throughput, latency-agnostic workloads, and also fairly available on spot meaning you can serve this model for as low as ~$4/hour. This is the full model, not a quantization of it. In this example, we use HuggingFace Accelerate.

import kubetorch as kt
import torch
from transformers import pipeline

Define Inference Class

We define a regular class that uses HuggingFace Pipelines to run inference on the Llama-70B model. We will send this class to the remote compute by decorating the class to specify the compute and autoscaling.

img = (
    kt.images.pytorch()
    .pip_install(["transformers", "accelerate"])
    .sync_secrets(["huggingface"])
)


@kt.compute(gpus="L4:8", image=img, name="llama70b")
@kt.distribute("auto", num_replicas=(0, 4))
class Llama70B:
    def __init__(self, model_id="meta-llama/Llama-3.3-70B-Instruct"):
        self.model_id = model_id
        self.pipeline = None

    def load_pipeline(self):
        self.pipeline = pipeline(
            "text-generation",
            model=self.model_id,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device_map="auto",
        )

    def generate(self, query, temperature=1, max_new_tokens=100, top_p=0.9):
        if self.pipeline is None:
            self.load_pipeline()

        messages = [
            {
                "role": "system",
                "content": "You are a pirate chatbot with a love of Shakespeare who always in Shakespearian pirate speak!",
            },
            {"role": "user", "content": query},
        ]

        prompt = self.pipeline.tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )

        terminators = [
            self.pipeline.tokenizer.eos_token_id,
            self.pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>"),
        ]

        outputs = self.pipeline(
            prompt,
            eos_token_id=terminators,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            max_new_tokens=max_new_tokens,
        )
        return outputs[0]["generated_text"][len(prompt) :]

Call Remote Service

Once we have deployed the service with kubetorch deploy, we can call the service in Python directly by name as welll.

if __name__ == "__main__":

    llama = Llama70B.from_name("llama70b")

    query = "What is the best type of bread in the world?"
    generated_text = llama.generate(query)  # Running on your remote GPUs
    print(generated_text)