MLOps is Dead, Long Live ML Platform Engineering

Platform engineering teams prioritize self-service and grade themselves on platform velocity. In ML, we should transition away from MLOps and manual enablement to a fully end-to-end system of self-service.

Paul Yang

ML @ 🏃‍♀️Runhouse🏠

Published April 17, 2025

In software, the consensus has abandoned DevOps for platform engineering. Platform engineering enables developer self-service and productizes the production process. As with any trending concept, Reddit and Twitter users will point out that very little has changed other than marketing; however, the philosophical orientation change means that a platform engineering team has a way to measure its success and prioritizes new features based on development velocity, experience, and time-to-production.

This transition has not yet occurred in machine learning. While the term MLOps can be somewhat overloaded, we use it here to to describe the practice of applying DevOps-inspired systems and processes to move ML workloads from research to production.

In real-world practice, MLOps at most organizations still relies heavily on manual effort from dedicated teams to push models into production or support new use cases. Cloud platforms like Sagemaker or solutions like Kubeflow offer a collection of tooling that, through DevOps-style enablement, is able to deploy new ML training or inference. However, having a series of tools in place is not equivalent to having a platform. This tools-and-ops approach becomes a major drag on experimentation, moving to production, and ultimately, the pace of innovation with ML. In many cases, the friction introduced acts as a filter as well, blocking the experiments and features altogether.

Data scientists and ML researchers deserve a simple, self-service paved path to take any idea and develop it to production.

What is the Difference between DevOps and Platform Engineering?

Briefly, DevOps and platform engineering are related but distinct approaches to software delivery. The latter emerged as a modern evolution of DevOps and focuses on productizing what was formerly DevOps to enable developer self-service. It’s worth being precise about the difference in philosophy.

DevOps exists to unsilo development and operations teams, emphasizing shared responsibility to ship code. Within the purview of DevOps is setting up CI/CD, infrastructure-as-code (IaC), alerts, logging, and security. Developers directly access these tools, but day-to-day, a DevOps engineer might be embedded or directly enable a development team. As a collaboration example, a DevOps engineer helps a team set up Jenkins pipelines for their specific project, teaches the team how to modify the Jenkinsfile, and is available for support when issues arise. Some of these activities are even performed on demand to unblock development and deployment.

Platform engineering focuses on building and maintaining the correct abstractions for the underlying infrastructure. These abstractions, like internal developer portals or defined golden paths, rather than the base infrastructure, are what development teams engage with to deploy and run their applications. Developers have self-service pathways to production; this might entail provisioning standardized environments, CI template generators, and unified alerting and reporting that work without configuration. Instead of enabling each development team individually, the platform team creates thoughtful standardization across the organization.

While some of the distinction is artificial, there’s a spiritual difference between the enablement in the DevOps case and the empowerment in the platform engineering case. In platform engineering, self-service is available to developers and engineers, which increases team velocity and improves developer experience. Infrastructure-owning engineers do not aim to spend more time with individual developer teams to unblock them manually, but spend their time being highly leveraged by deploying a product. Spotify’s Backstage is often viewed as a complete example of a “complete platform.”

Data Engineering Teams Transitioned to Platforms

A decade ago, data teams generally did not have good data estates or a data platform that enabled developers to access and transform data end-to-end independently. Instead, data engineering ran on fragmented infrastructure. There were single-node servers with some flavor of SQL installed, with a finite amount of disk space and CPU / memory resources. Then, if you worked with “big data,” you relied on a team of a few people who “managed” a Hadoop cluster; you would write different code (map-reduce) to run on these clusters. To be effective across all data systems, you needed to a) be aware of where your code was going to run, b) be aware of the resources you would consume, and c) be dependent on some specialist team to manually unblock you if you crossed a hard boundary. There was no universal self-service for all data operations.

Enter systems like BigQuery, Snowflake, and Databricks. These initially were not platforms themselves, but a core infrastructure component representing a universal data execution runtime around which platforms were built. The abstraction they provide is incredible: whether you're working with 10 rows of data or 10 billion, you can write the same standard SQL (or Spark) workloads, and have them magically execute. This scalability and universal execution of these platforms subsequently also abstracted the physical storage location of data. Data was no longer siloed to specific hardware; instead, it could flexibly reside across cloud vendors, blob storage, or virtually anywhere, while still being accessed and processed through the same infrastructure. And whether you're writing this code in a browser-based session, scheduling it with a pipeline orchestrator like Airflow, or calling it in another setting, it runs identically within the same observable, governed environment.

Now that data and execution were unified, it was finally possible to deliver the full platform. Observability, lineage, governance, and cataloging are easy to implement, while additional tools like DBT simply plug into the “platform.” Execution is fully self-serve, and the time to first query for any new engineer joining the team has dropped to essentially zero. Data engineers and analytics engineers write SQL to extract, transform, load, process, and report, but never have to consider the heterogeneity in compute and storage required to deliver the result of the queries.

ML Development Looks Different from Software Development

We see significant heterogeneity in the ML development process across organizations, but generally, ML teams cannot enjoy the modern software platform already available in other domains. ML development significantly diverged from traditional software about a decade ago because of large datasets and accelerated (GPU) compute; this created an inability to do local development and confirm that code is already working before containerizing and deploying. This departure meant ML development and deployment were forked to be done in an ML-specific way, and many anti-patterns followed.

An early standard for MLOps was offering notebooks for research, plus pipeline orchestrators for production training. To quickly unblock research work and enable fast development, developers are given hosted notebooks or devboxes directly on remote compute. But using these systems introduces irregular code in unreproducible environments. Subsequently, a multi-week “research-to-production” process must follow, where research notebooks must be translated into containers or workflow orchestrator nodes to run properly. This is obviously not an ideal development lifecycle; it would be somewhat unthinkable to have a different team translate my frontend application or my Snowflake query to run in production.

Frustrated with slow productionalization, teams will invariably abandon the prior approach, in favor of developing with regular code and containerizing it with Docker to execute. While this is theoretically a “platform” with self-service, it’s at a steep cost to developer productivity. However, this leads to “developing through deployment,” where all developers sacrifice all iteration speed; now developers need to push code, wait for CI to build a new Docker image, push, redeploy, and finally, for Kubernetes to load and run the updated code. Execution latency takes 30+ minutes to add a print statement.

Alongside this transition, other tooling is also often introduced into the mix that mediates access to compute, whether adopting Kubeflow as a platform-in-a-box on Kubernetes to Kuberay or other operators to enable distributed execution frameworks. Rather than being good platform abstractions, these actually place an additional burden of learning Docker or Kubernetes and working with YAML on the end-user. The new challenges of scaling, configuring, and debugging their applications through a stack designed for infrastructure and platform engineers means the overall system is being brought back towards “human-based enablement” instead of fast developer velocity.

Defining a Platform for ML

To have a good platform, we must first define good metrics to optimize. For instance, Meta measures their ML training experience by time to first batch, which describes the amount of time it takes from the end of coding to the first batch of a training run in production. Because ML is fundamentally a research discipline, the ability to launch training experiments quickly directly ties to the amount of value that the ML research can create.

Your team can copy Meta’s definition of time to first batch (for either training or inference) and understand how long it takes to convert regular Python into a run over production-scale data and compute. Alternatively, you can consider minimizing the number of ideas that cannot be tested because the effort of launching the experiment is higher than the perceived value. The better the platform, the lower the barrier to entry.

To be more tactical, we have a few opinionated perspectives on implementation. First, workloads should run on Kubernetes. Kubernetes is a near-default assumption for generic cloud workloads at scale and in practice, already fully adopted by mature ML teams. To just name a few of the benefits, Kubernetes offers portability, scalability, a rich ecosystem for observability and management, and is fundamentally familiar to core platform teams. For inference, autoscaling from zero to infinity is extremely useful. For training, Kubernetes is a good place to launch many replicas for distributed workloads or hyperparameter optimization. Significant fringe benefits include offloading management to core platforms and plugging into existing auth, management, and logging.

Second, everything on the developer’s end, from the program to execution, should be able to work entirely in regular Python. Generally, developers and data scientists know Python best and don’t know Kubernetes nearly as well. Debugging and developing with Python is significantly more ergonomic and eliminates onboarding trauma. This extends to more subtle features as well, such as propagating Python errors or streaming Python logs. Using Python to interact with infrastructure to execute regular Python ensures that data scientists and ML engineers are happy, and do not need manual enablement from a Kubernetes expert.

Finally, not only should code always be production-ready code, but there should also never be a sacrifice on iteration speed. Code is written and pushed to a monorepo, and this should be immediately executable by anyone (with permissions). As an example, on their first day of work, a new ML teammate is asked to make a small change or bugfix. They check out code, iteratively make edits while interactively testing the changes in a production-like setting to confirm it works, make a pull request, and once merged, have that code running in production that night. In this process, there is no “translation” to production, and iterative execution is fast. This is reasonable because it is the expectation in software.

Kubetorch Was Built to Be Your Platform’s Compute Foundation

Thus far, we’ve been fairly high-level when reasoning about ML platforms. There is no single tactical “right answer,” and there are many ways to slowly homegrow your tooling into better abstractions and offer a good platform. However, it is not a simple task, and this was the motivating reason for why we built Kubetorch. Kubetorch is designed to be the foundational compute component of a great ML platform, by offering an interface into compute that offers both flexibility and robustness:

A pythonic API that is friendly to ML developers, but automatically scales, distributes, and hardens their code as production Kubernetes applications.
Extremely fast iteration loops, hot reloading in development to avoid wasted developer productivity.
Fully production-ready execution of regular code, avoiding any reproducibility gap or additional effort to turn finalized research work into production execution.
Installation on Kubernetes to plug into the robust ecosystem already managed by the core platform team.

If you’d like to chat more about the evolution of ML development and ML platforms, learn more about Kubetorch, or anything else, shoot us a quick note at hello@run.house.

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.

MLOps is Dead, Long Live ML Platform Engineering

What is the Difference between DevOps and Platform Engineering?

Data Engineering Teams Transitioned to Platforms

ML Development Looks Different from Software Development

Defining a Platform for ML

Kubetorch Was Built to Be Your Platform’s Compute Foundation

Stay up to speed 🏃‍♀️📩

Read More

Please separate orchestration and execution (Or: Packaging hell is dragging ML)

How to Adopt Ray into Your ML Platform (Gently)

Why Should I Train and Tune My AI Models?