Why Should I Train and Tune My AI Models?

We will demonstrate the necessity of building learning loops for your AI models to continually refine the components of your production systems.

Paul Yang

ML @ 🏃‍♀️Runhouse🏠

February 26, 2025

Every AI/ML System Is Tuned with First-Party Data
Language Models Aren’t Magic
Are You Afraid of Training?

For a review of component models that comprise a compound AI system, check out this post. Stay tuned for our next blog post which will review training and fine-tuning methods that you can employ.

Every AI/ML System Is Tuned with First-Party Data

ML systems have always relied on first-party data to perform and improve. Imagine walking into a meeting with a bank’s Chief Risk Officer and suggesting that your third-party risk or fraud model is so good they should fire their entire risk team and stop training models. In fact, you claim that your pre-trained fraud model is so good that it works for a regional bank in Milwaukee primarily serving farmers and Chase’s premium travel cards. This message would not be well received even if your model demo=ed well, no organization would surrender its ownership and ability to develop mission-critical ML. The only correct reaction would be, “Sure, we’ll combine your model with our data to produce an even better internal model.”

In every system where AI/ML is employed to drive real business value, the system is continuously adapted with data. TikTok and Instagram retrain their recommender systems every few minutes, using likes, views, shares, and fast scrolls to find viral content. User interactions are fed directly back to improve the next round of predictions, leading to the highly addictive and personalized feeds you can scroll through. Netflix adds and reranks its content at least daily. Uber Eats customizes its recommended restaurants for each user and every city for every time of day.

It is identical for AI systems. Especially if you build complex compound AI systems that mobilize multiple models, you must drive continuous improvement loops by feeding first-party data into each underlying model. A production AI agent might rely on six or more models in concert – 1) to classify user intent, 2) route the query, 3) retrieve additional factual data with retrieval augmented generation (RAG), 4) rerank the retrieved data to identify most relevant context, 5) generate the result, and 6) to check for factualness and toxicity. For successful production operation, you would obviously add observability to each of these steps, and for any cases in which you are imperfectly returning a result, feed that data into your continuous learning loop.

Language Models Aren’t Magic

It is tempting to accept the marketing that large language models (LLMs) or the AI systems they power represent an entirely different class of machine learning, where pre-trained models are so good at tasks generically that you do not need to tune them any further. This is false.

As an example, let’s examine embedding model quality for RAG-type use cases. As noted above, these are also language models commonly used in a compound AI system, and are used to determine what piece of information to retrieve to introduce knowledge into AI systems. Their out-of-the-box performance is quite poor. Pre-trained embedding model quality ranges between 75% correct for general QA tasks (benchmarked by MLQARetrieval) and sub-40% for domain-specific tasks like legal context retrieval (benchmarked by AILAStatues). This is far too low for reliable production use. Worse, no matter how advanced LLMs become, retrieval failures will render them ineffective; even GPT-7.5 won't help if the wrong proprietary factual information is retrieved from your vector database—the final result will still be wrong.

HuggingFace Leaderboard, on 2/20/2025

RAG has been so popular since it's so effective in demos. But this is because embedding models are trained on web data and tested on web crawl data, and so correctly retrieving Wikipedia pages is easy. But that embedding model may perceive a legal firm’s warehouse of legal and financial documents as all highly similar, especially when two legal documents’ similarity is in the context of the model’s diverse pre-trained knowledge about medicine, sharks, Saturn, etc. In this case, legal training for the model will improve domain and vocabulary understanding.

This is a microcosm of a broader issue: AI labs are incentivized to maximize quality along every dimension and measure it based on popular benchmarks (see: Goodhart's Law). But despite every AI lab claiming their latest model is a landmark release, these improvements often have little correlation with the metrics that matter for your use case. Without your own training capabilities, you’re left waiting for AI labs to accidentally improve your application. For instance, DeepSeek R1’s strength in coding and math, due to its reinforcement learning training, may do nothing to improve your customer service bot’s response quality. Worse, it may introduce so much latency that you must distill its behavior into a smaller model for real-time interactions.

Are You Afraid of Training?

If training and tuning language models were approximately as easy as calling APIs with pre-trained models, I assume there is no broad ideological reason to stick to the pre-trained models. Instead, the hope that pre-trained models will suffice often stems from an unspoken hesitation about training itself.

There is a marketing component to this. It seems as if only AI labs at giants like Google, Meta, or OpenAI, or some uber-smart hedge fund traders, are able to deploy techniques to train state-of-the-art models. Luckily, we don’t need to introduce a novel frontier model to outperform OpenAI’s latest model in every use case. Your definition of success is narrow and targetable, and you can incrementally improve different components of your overall system.

This is data that belongs exclusively to you; it is your moat. There are two broad approaches. First, you can take all of your first-party documents and train that domain knowledge into your models. Second, if you have usage data, you should be able to create clear success/failure labels that directly guide your system toward better and better behavior. You can further accelerate the learning process by employing techniques like LLM-as-judge to prepare and label your user traces into fine-tuning datasets—without wasting human effort manually labeling all the pieces.

Another hesitation stems from a wariness of the complexity of “doing research.” Fortunately, it is now fairly easy for data scientists and engineers to set up training. Despite the heterogeneity of models available to AI engineers and their various roles in a compound system, almost all models used today are architecturally similar to traditional ML or Torch-based neural networks. This is critical because their shared foundational training methods mean there is a very narrow yet rich ecosystem of libraries designed to abstract and simplify the tuning of these models. If architectures were more heterogeneous, it would create significant additional work, but in reality, the code required to continue training both an embedding model and a 70B-parameter Llama model can look almost identical.

However, there is one catch, which is an infrastructure trap. Even if the code for tuning models is easy to write and leverages well-documented libraries, getting that code to run reproducibly and regularly still requires effort. For instance, when working with large models, you may need to experiment and then productionalize on distributed, multi-GPU setups. You might also want to launch hyperparameter sweeps across multiple GPU nodes.

This is where Kubetorch by Runhouse comes in. You can start experimenting with models within your existing data science stack, but the promise of Kubetorch is to make the infrastructure required to run these trainings dead simple. With simple APIs, you can launch distributed workloads in your own cloud. You can also start with Kubetorch's pre-built automations for collecting and rerunning training for language models, further iterating from those methods as a starting point (book a meeting to learn more, or if you just want to chat about model fine-tuning).

As a minimal example, if you write a regular, undecorated Tuner class with Hugging Face’s Trainer library, you can scale it to run over multiple nodes and GPUs with just a few lines of code.

import kubetorch as kt 

num_nodes = 3
num_gpus_per_node = 4

# For a Kubernetes cluster with KT installed
compute = kt.compute(gpus=f"L4:{num_gpus_per_node}", image=kt.images.pytorch())


fine_tuner_remote = kt.cls(FineTuner).to(compute, num_nodes = num_nodes).distribute("pytorch", num_replicas=num_nodes, replicas_per_node=num_gpus_per_node)

fine_tuner.tune()

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.

Why Should I Train and Tune My AI Models?

Contents

Every AI/ML System Is Tuned with First-Party Data

Language Models Aren’t Magic

Are You Afraid of Training?

Stay up to speed 🏃‍♀️📩

Read More

Will LLMs Replace ML Training (No, here's why)

Components of a Compound AI System

MLOps is Dead, Long Live ML Platform Engineering