tool wall

Why Should I Train and Tune My AI Models?

We will demonstrate the necessity of building learning loops for your AI models to continually refine the components of your production systems.

Photo of Paul Yang
Paul Yang

ML @ šŸƒā€ā™€ļøRunhousešŸ 

February 26, 2025

For a review of component models that comprise a compound AI system, check out this post. Stay tuned for our next blog post which will review training and fine-tuning methods that you can employ.

Every AI/ML System Is Tuned with First-Party Data

ML systems have always relied on first-party data to perform and improve. Imagine walking into a meeting with a bankā€™s Chief Risk Officer and suggesting that your third-party risk or fraud model is so good they should fire their entire risk team and stop training models. In fact, you claim that your pre-trained fraud model is so good that it works for a regional bank in Milwaukee primarily serving farmers and Chaseā€™s premium travel cards. This message would not be well received even if your model demo=ed well, no organization would surrender its ownership and ability to develop mission-critical ML. The only correct reaction would be, ā€œSure, weā€™ll combine your model with our data to produce an even better internal model.ā€

In every system where AI/ML is employed to drive real business value, the system is continuously adapted with data. TikTok and Instagram retrain their recommender systems every few minutes, using likes, views, shares, and fast scrolls to find viral content. User interactions are fed directly back to improve the next round of predictions, leading to the highly addictive and personalized feeds you can scroll through. Netflix adds and reranks its content at least daily. Uber Eats customizes its recommended restaurants for each user and every city for every time of day.

TikTok learning loop

It is identical for AI systems. Especially if you build complex compound AI systems that mobilize multiple models, you must drive continuous improvement loops by feeding first-party data into each underlying model. A production AI agent might rely on six or more models in concert ā€“ 1) to classify user intent, 2) route the query, 3) retrieve additional factual data with retrieval augmented generation (RAG), 4) rerank the retrieved data to identify most relevant context, 5) generate the result, and 6) to check for factualness and toxicity. For successful production operation, you would obviously add observability to each of these steps, and for any cases in which you are imperfectly returning a result, feed that data into your continuous learning loop.

pretrained model learning

Language Models Arenā€™t Magic

It is tempting to accept the marketing that large language models (LLMs) or the AI systems they power represent an entirely different class of machine learning, where pre-trained models are so good at tasks generically that you do not need to tune them any further. This is false.

As an example, letā€™s examine embedding model quality for RAG-type use cases. As noted above, these are also language models commonly used in a compound AI system, and are used to determine what piece of information to retrieve to introduce knowledge into AI systems. Their out-of-the-box performance is quite poor. Pre-trained embedding model quality ranges between 75% correct for general QA tasks (benchmarked by MLQARetrieval) and sub-40% for domain-specific tasks like legal context retrieval (benchmarked by AILAStatues). This is far too low for reliable production use. Worse, no matter how advanced LLMs become, retrieval failures will render them ineffective; even GPT-7.5 won't help if the wrong proprietary factual information is retrieved from your vector databaseā€”the final result will still be wrong.

RAG has been so popular since it's so effective in demos. But this is because embedding models are trained on web data and tested on web crawl data, and so correctly retrieving Wikipedia pages is easy. But that embedding model may perceive a legal firmā€™s warehouse of legal and financial documents as all highly similar, especially when two legal documentsā€™ similarity is in the context of the modelā€™s diverse pre-trained knowledge about medicine, sharks, Saturn, etc. In this case, legal training for the model will improve domain and vocabulary understanding.

This is a microcosm of a broader issue: AI labs are incentivized to maximize quality along every dimension and measure it based on popular benchmarks (see: Goodhart's Law). But despite every AI lab claiming their latest model is a landmark release, these improvements often have little correlation with the metrics that matter for your use case. Without your own training capabilities, youā€™re left waiting for AI labs to accidentally improve your application. For instance, DeepSeek R1ā€™s strength in coding and math, due to its reinforcement learning training, may do nothing to improve your customer service botā€™s response quality. Worse, it may introduce so much latency that you must distill its behavior into a smaller model for real-time interactions.

Are You Afraid of Training?

If training and tuning language models were approximately as easy as calling APIs with pre-trained models, I assume there is no broad ideological reason to stick to the pre-trained models. Instead, the hope that pre-trained models will suffice often stems from an unspoken hesitation about training itself.

There is a marketing component to this. It seems as if only AI labs at giants like Google, Meta, or OpenAI, or some uber-smart hedge fund traders, are able to deploy techniques to train state-of-the-art models. Luckily, we donā€™t need to introduce a novel frontier model to outperform OpenAIā€™s latest model in every use case. Your definition of success is narrow and targetable, and you can incrementally improve different components of your overall system.

This is data that belongs exclusively to you; it is your moat. There are two broad approaches. First, you can take all of your first-party documents and train that domain knowledge into your models. Second, if you have usage data, you should be able to create clear success/failure labels that directly guide your system toward better and better behavior. You can further accelerate the learning process by employing techniques like LLM-as-judge to prepare and label your user traces into fine-tuning datasetsā€”without wasting human effort manually labeling all the pieces.

Another hesitation stems from a wariness of the complexity of ā€œdoing research.ā€ Fortunately, it is now fairly easy for data scientists and engineers to set up training. Despite the heterogeneity of models available to AI engineers and their various roles in a compound system, almost all models used today are architecturally similar to traditional ML or Torch-based neural networks. This is critical because their shared foundational training methods mean there is a very narrow yet rich ecosystem of libraries designed to abstract and simplify the tuning of these models. If architectures were more heterogeneous, it would create significant additional work, but in reality, the code required to continue training both an embedding model and a 70B-parameter Llama model can look almost identical.

However, there is one catch, which is an infrastructure trap. Even if the code for tuning models is easy to write and leverages well-documented libraries, getting that code to run reproducibly and regularly still requires effort. For instance, when working with large models, you may need to experiment and then productionalize on distributed, multi-GPU setups. You might also want to launch hyperparameter sweeps across multiple GPU nodes.

This is where Runhouse comes in. You can start experimenting with models within your existing data science stack, but the promise of Runhouse is to make the infrastructure required to run these trainings dead simple. With simple APIs, you can launch distributed workloads in your own cloud. You can also start with Runhouseā€™s pre-built automations for collecting and rerunning training for language models, further iterating from those methods as a starting point (book a meeting to learn more, or if you just want to chat about model fine-tuning).

As a minimal example, if you write a regular, undecorated Tuner class with Hugging Faceā€™s Trainer library, you can scale it to run over multiple nodes and GPUs with just a few lines of code.

num_nodes = 3 # Requires access to a cloud account with the necessary permissions to launch compute. cluster = rh.compute( name=f"rh-L4x{num_nodes}", num_nodes=num_nodes, instance_type="L4:4", provider="aws", image=img, use_spot=False, autostop_mins=1000, ).up_if_not() fine_tuner_remote = rh.cls(FineTuner).to(cluster, name="ft_model") fine_tuner = fine_tuner_remote(name="ft_model_instance").distribute( "pytorch", num_replicas=num_nodes, replicas_per_node=1 ) fine_tuner.tune()

Stay up to speed šŸƒā€ā™€ļøšŸ“©

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.

Read More