Low Rank Adaptaion or LoRA is a fine-tuning method. A head with gears in it to represent fine tuning

What is LoRA, or Low-Rank Adaptation? How is it applied to LLMs?

LoRA, or Low-Rank Adaptation, is a technique used to fine-tune large language models more efficiently. To understand LoRA at a CS 101 level, let's break down the concept first, before sharing a coding example that lets you get started with fine-tuning LoRA on any compute.

Photo of Paul Yang
Paul Yang

ML @ 🏃‍♀️Runhouse🏠

Photo of Matt Kandler
Matt Kandler

Engineer @ 🏃‍♀️Runhouse🏠

July 19, 2024

Defining Large Language Models and Fine Tuning

We presume that you already know what Large Language Models are – models that receive a text input and return the probability of the next words in the sequence. Language models have millions or even billions of parameters that need to be adjusted to make the machine work correctly. Adjusting all these parameters requires a lot of computational power and data, and large language models are trained on trillions of input words.

I am at McDonalds, I would like to order a - and a list of words

Fine-tuning is taking a model that already has been pre-trained, and teaching it something more specific. For instance, common use cases include:

  • Changing the tone and voice of the outputs, to more consistently reflect a way of talking without having to prompt in detail.
  • Informing the output structure, helping the model return results in consistent ways like a JSON, or ensuring the response contains exactly the fields and specifications required.
  • Control over model behavior, ensuring specific response qualities like preferring to be always concise, or matching the language of the input request instead of responding in English when the input was non-English.

Remember that other than fine-tuning, many other approaches exist to receive the style and structure of response that you want.

  • One- and Few-Shot Learning which just involves including one or a few examples in the input prompt itself. For instance, including a couple of sentences written by Shakespeare
  • Retrieval Augmented Generation which involves using a secondary system to retrieve relevant information or context and adding it to the function call.

Low-Rank Adaptation (LoRA)

LoRA is a clever shortcut for fine-tuning. Instead of adjusting all the parameters, LoRA proposes to only add and adjust a small set of new parameters.

Linear Algebra Reminder:

  • Multiplying a `m x k` matrix with a `k x n` matrix yields a matrix of size `m x n` - so you can multiply a 17 x 2 and a 2 x 23 matrix together to get a 17 x 23 matrix.
  • The rank of an `m x n` matrix is equal to min(m, n)

Imagine we have a pre-trained neural network layer that transforms input data `x` into output data `y` using a weight matrix `W` and a bias vector `b`. The transformation looks like this:

y = W * x + b

Here, `W` is a matrix of size `m x n`, where `m` is the number of output features and `n` is the number of input features, and `b` is a vector of size `m`.

Now, we want to fine-tune this layer for a new task using LoRA. Instead of updating the entire weight matrix `W`, we will introduce two smaller matrices `A` and `B` such that `A` is of size `m x k` and `B` is of size `k x n`, where `k` is much smaller than both `m` and `n` (this is the low-rank part).

Then, it is easy to apply, as the LoRA adaptation updates the weight matrix:

W' = W + A * B

This example shows the key idea is to introduce a low-rank structure that captures the changes needed for the new task, thereby reducing the number of parameters that need to be trained.

Benefits of LoRA

- Efficiency: Since only a small number of parameters are trained, LoRA is much faster and requires less memory than traditional fine-tuning methods. It is much more memory efficient, allowing for faster training on lower powered machines.

- Flexibility: Using low-rank matrices allows selective adaptation of different parts of the model. For example, you may decide to adapt only certain layers or components, providing a modular approach to model personalization.

Coding Example - How to Apply LoRA to Llama 3

Runhouse lets you write regular Python code, and send it to any cloud compute in just a few extra lines of Python, with no need for domain-specific languages (DSLs) or complex packaging. After you are done fine-tuning the model, you can use it immediately as a production endpoint for your own hosted LLM.

Set up credentials and dependencies

For this exercise, you will need:

  • A HuggingFace token to download the base Llama3 model. Make sure to accept terms and conditions on the Hugging Face model page so that you can access it.
  • Access to compute - we will use an AWS EC2 instance, but Runhouse can be used with any cloud compute you have access to, or even local compute if you have enough resources.

Run the following in Terminal or by command line to install Runhouse, and configure AWS. You can change AWS to your cloud of choice.

$ pip install "runhouse[aws]" $ aws configure $ sky check $ export HF_TOKEN=<your huggingface token>

Create a model class

We import runhouse, the only required library we need locally. Then, we define a rh.Module class that will hold the various methods needed to fine-tune the model.

These methods are just normal Python that encompass data loading and training. No custom coding or domain-specific language is necessary to use Runhouse to deploy LoRA training to the cloud machine.

Learn more in the Runhouse docs on functions and modules.

import runhouse as rh DEFAULT_MAX_LENGTH = 200 class FineTuner(rh.Module): def __init__( self, dataset_name="Shekswess/medical_llama3_instruct_dataset_short", base_model_name="meta-llama/Meta-Llama-3-8B-Instruct", fine_tuned_model_name="llama-3-8b-medical", ): super().__init__() self.dataset_name = dataset_name self.base_model_name = base_model_name self.fine_tuned_model_name = fine_tuned_model_name self.tokenizer = None self.base_model = None self.fine_tuned_model = None self.pipeline = None def load_base_model(self): import torch from transformers import AutoModelForCausalLM, BitsAndBytesConfig # configure the model for efficient training quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=False, ) # load the base model with the quantization configuration self.base_model = AutoModelForCausalLM.from_pretrained( self.base_model_name, quantization_config=quant_config, device_map={"": 0} ) self.base_model.config.use_cache = False self.base_model.config.pretraining_tp = 1 def load_tokenizer(self): from transformers import AutoTokenizer self.tokenizer = AutoTokenizer.from_pretrained( self.base_model_name, trust_remote_code=True ) self.tokenizer.pad_token = self.tokenizer.eos_token self.tokenizer.padding_side = "right" def load_pipeline(self, max_length: int): from transformers import pipeline # Use the new fine-tuned model for generating text self.pipeline = pipeline( task="text-generation", model=self.fine_tuned_model, tokenizer=self.tokenizer, max_length=max_length, ) def load_dataset(self): from datasets import load_dataset return load_dataset(self.dataset_name, split="train") def load_fine_tuned_model(self): import torch from peft import AutoPeftModelForCausalLM if not self.new_model_exists(): raise FileNotFoundError( "No fine tuned model found on the cluster. " "Call the `tune` method to run the fine tuning." ) self.fine_tuned_model = AutoPeftModelForCausalLM.from_pretrained( self.fine_tuned_model_name, device_map={"": "cuda:0"}, # Loads model into GPU memory torch_dtype=torch.bfloat16, ) self.fine_tuned_model = self.fine_tuned_model.merge_and_unload() def new_model_exists(self): from pathlib import Path return Path(f"~/{self.fine_tuned_model_name}").expanduser().exists() def training_params(self): from transformers import TrainingArguments return TrainingArguments( output_dir="./results_modified", num_train_epochs=1, per_device_train_batch_size=4, gradient_accumulation_steps=1, optim="paged_adamw_32bit", save_steps=25, logging_steps=25, learning_rate=2e-4, weight_decay=0.001, fp16=False, bf16=False, max_grad_norm=0.3, max_steps=-1, warmup_ratio=0.03, group_by_length=True, lr_scheduler_type="constant", report_to="tensorboard", ) def sft_trainer(self, training_data, peft_parameters, train_params): from trl import SFTTrainer # Set up the SFTTrainer with the model, training data, and parameters to learn from the new dataset return SFTTrainer( model=self.base_model, train_dataset=training_data, peft_config=peft_parameters, dataset_text_field="prompt", # Dependent on your dataset tokenizer=self.tokenizer, args=train_params, ) def tune(self): import gc import torch from peft import LoraConfig if self.new_model_exists(): return # Load the training data, tokenizer and model to be used by the trainer training_data = self.load_dataset() if self.tokenizer is None: self.load_tokenizer() if self.base_model is None: self.load_base_model() # Use LoRA to update a small subset of the model's parameters peft_parameters = LoraConfig( lora_alpha=16, lora_dropout=0.1, r=8, bias="none", task_type="CAUSAL_LM" ) train_params = self.training_params() trainer = self.sft_trainer(training_data, peft_parameters, train_params) # Force clean the pytorch cache gc.collect() torch.cuda.empty_cache() trainer.train() # Save the fine-tuned model's weights and tokenizer files on the cluster trainer.model.save_pretrained(self.fine_tuned_model_name) trainer.tokenizer.save_pretrained(self.fine_tuned_model_name) # Clear VRAM from training del trainer del train_params del training_data self.base_model = None gc.collect() torch.cuda.empty_cache() print("Saved model weights and tokenizer on the cluster.") def generate(self, query: str, max_length: int = DEFAULT_MAX_LENGTH): if self.fine_tuned_model is None: # Load the fine-tuned model saved on the cluster self.load_fine_tuned_model() if self.tokenizer is None: self.load_tokenizer() if self.pipeline is None or max_length != DEFAULT_MAX_LENGTH: self.load_pipeline(max_length) # Format should reflect the format in the dataset_text_field in SFTTrainer output = self.pipeline( f"<|start_header_id|>system<|end_header_id|> Answer the question truthfully, you are a medical professional.<|eot_id|><|start_header_id|>user<|end_header_id|> This is the question: {query}<|eot_id|><|start_header_id|>assistant<|end_header_id|>" ) return output[0]["generated_text"]

Send the Code to the Cluster and Run

Now, we define code that will run locally to set up our Runhouse module on a remote cluster and execute the training.

We write the code into an if __name__ == "__main__": block, as shown below. We will in the blocks:

  1. Define the cluster we will run the fine-tuning job - learn more in the Runhouse docs on clusters.
  2. Define the environment for that cluster - learn more in the Runhouse docs on envs.
  3. Send our LoRA class to the remote compute

First, we create a cluster with the desired instance type and provider using rh.cluster(). Our instance_type here is defined as A10G:1, which is the accelerator type and count that we need. We could alternatively specify a specific AWS instance type, such as p3.2xlarge or g4dn.xlarge.

Next, we define the environment for our module. This includes the required dependencies that need to be installed on the remote machine, as well as any secrets that need to be synced up from local to remote. Passing huggingface to the secrets parameter will load the Hugging Face token we set up earlier.

Then, we define our module and run it on the remote cluster. We construct it normally and then call get_or_to to run it on the remote cluster. Using get_or_to allows us to load the exiting Module by the name llama3-medical-model if it was already put on the cluster.

Finally, we can call the tune method on the model class instance as if it were running locally. This will run the function on the remote cluster and return the response to our local machine automatically. Further calls will also run on the remote machine, and maintain state that was updated between calls, like self.fine_tuned_model. Once the base model is fine-tuned, we save this new model on the cluster and use it to generate our text predictions.

if __name__ == "__main__": # Bring up the cluster - here we specify it should have 1 A10G with 32GB of memory or more, on AWS cluster = rh.cluster( name="rh-a10x", instance_type="A10G:1", memory="32+", provider="aws", ).up_if_not() # Define the requirements and pass the secrets to the remote cluster env = rh.env( name="ft_env", reqs=[ "torch", "tensorboard", "scipy", "peft==0.4.0", "bitsandbytes==0.40.2", "transformers==4.31.0", "trl==0.4.7", "accelerate", ], secrets=["huggingface"], # Needed to download Llama 3 from Hugging Face ) # Send the FineTuner module to the remote machine, with the environment env we defined above. Runhouse will make sure it works on any GPU enabled compute. fine_tuner_remote = FineTuner().get_or_to( cluster, env=env, name="llama3-medical-model" ) # Executes the fine tuning job by calling tune() on fine_tuner_remote which was returned in the step above fine_tuner_remote.tune()

Generate Text with Tuned Model

Now that we have fine-tuned our model, we can generate text by calling the generate method with our query:

query = "What's the best treatment for sunburn?" generated_text = fine_tuner_remote.generate(query) print(generated_text)

You have now instantly deployed a microservice. If you use Runhouse Den, you can save this resource and run it from anywhere.

Conclusion

In summary, LoRA is like a smart hack in the world of machine learning that allows us to update giant language models quickly and efficiently by focusing on a small, manageable set of changes, all of which can be understood through the lens of linear algebra and matrix operations.

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.