Robot upgrading itself

How to Deploy Llama 3.1 To Your Own Infrastructure (AWS Example, Released July 2024)

Llama 3.1 is the newest in the family of Facebook Large Language Models (LLMs). Its performance is comparable to best in class closed source models, and this guide will show you how to run it on your own infrastructure.

Photo of Paul Yang
Paul Yang

ML @ 🏃‍♀️Runhouse🏠

July 23, 2024

On Tuesday, July 23rd, Facebook released the Llama 3.1 open-source large language model (LLM). This model represents the next half-generation of Facebook's LLM model family, and was released to the excitement of the AI community who continues to benefit from rapid improvements in foundational models.

Performance-wise, it's not expected that there will be huge gains versus the similar-sized model of the Llama 3 generation, though it may perform better on math problems by early reports. A lot of news was made about the release of a very large 405B parameter model, which has no open-source analog.

But a more practical, yet notable change versus previous generations includes a greatly expanded context window. The previous version of Llama 3 only had a 8,192 token context window, but many first-class models including OpenAI's GPT-4o and most other closed-source models had expanded to wider context windows. Now Llama 3.1 has a 128K token context window, which is directly comparable to GPT4-o and many others.

In this post, we will show you how to deploy the Llama 3.1 model with 8B parameters, which can run on an AWS machine with a single A10 GPU. The only cost is the cost of compute directly at the price sold by the cloud providers, Runhouse does not sell hardware or charge for use. We bring up an on-demand cluster, which has a pricing at ~$1 / hour on AWS, which makes it a reasonable cluster to do experimentation with. Deployment should be done based on the speed of output tokens, the size of the model you ultimately choose.

Deployment Example

Credentials and Dependencies

  • As with most models, the easiest way to download the model is with a HuggingFace token. Make sure to accept terms and conditions on the Hugging Face model page so that you can access it.
  • Access to compute - we will use AWS as an example here, but you can check the Llama 3 deployment example on GCP instead if preferred.

You will only need to install the Runhouse and PyTorch packages locally. You need to configure your AWS credentials, have SkyPilot which launches an instance on demand check that it is valid, and save your credential - the following can be run in Terminal.

$ pip install "runhouse[aws]" torch $ aws configure $ sky check $ export HF_TOKEN=<your huggingface token>

Write and Run Python

Runhouse lets you write Python code, send it as a module to remote compute for execution, and then call that locally.

We first define a HFChatModel class based on Runhouse's rh.Module. This will take a few minutes on the first run as you must first download the model and load it, but results should generate very quickly after the first run.

  • A function to load the model
  • And a function to "predict" or generate results based on user input.

Then, we define the code that will run locally below.

  • Bring up a cluster and define requirements for the worker that will run
  • Send the previously defined module to the cluster with .get_or_to()
  • Call the predict function in a loop until we say exit

The code that runs "locally" can be run from any setting, and you can call remote_hf_chat_model.predict(prompt) as if it were a local function, except it is fully in the cloud.

import runhouse as rh import torch class HFChatModel: def __init__(self, model_id="meta-llama/Meta-Llama-3.1-8B-Instruct", **model_kwargs): super().__init__() self.model_id, self.model_kwargs = model_id, model_kwargs self.pipeline = None def load_model(self): import transformers self.pipeline = transformers.pipeline( "text-generation", model=self.model_id, model_kwargs=self.model_kwargs, device="cuda", ) def predict(self, prompt_text, **inf_kwargs): if not self.pipeline: self.load_model() messages = [ { "role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!", }, {"role": "user", "content": prompt_text}, ] prompt = self.pipeline.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) terminators = [ self.pipeline.tokenizer.eos_token_id, self.pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>"), ] outputs = self.pipeline( prompt, max_new_tokens=256, eos_token_id=terminators, do_sample=True, temperature=0.6, top_p=0.9, ) return outputs[0]["generated_text"][len(prompt) :] if __name__ == "__main__": cluster = rh.ondemand_cluster(name="rh-a10x", instance_type="A10G:1", provider="aws").up_if_not() env = rh.env( name="test", secrets=["huggingface"], reqs=[ "torch", "transformers", "accelerate", "bitsandbytes", "safetensors", "scipy", ], working_dir="/.") RemoteChatModel = rh.module(HFChatModel).to(cluster, env=env, name="ChatModel") remote_hf_chat_model = RemoteChatModel( torch_dtype=torch.bfloat16, name="llama-8b-model" ) while True: prompt = input( "\n\n... Enter a prompt to chat with the model, and 'exit' to exit ...\n" ) if prompt.lower().strip() == "exit": break output = remote_hf_chat_model.predict(prompt) print("\n\n... Model Output ...\n") print(output)

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.