AI Platforms: Past, Present, and Future

Machine learning development is stuck in slow, fragmented cycles, far removed from the fast-paced, iterative nature of traditional software development. This blog will explore how we got here, define the essential requirements for a modern ML platform, and conclude by introducing Runhouse as a solution that was built to accelerate ML development.

Paul Yang

ML @ 🏃‍♀️Runhouse🏠

Donny Greenberg

CEO @ 🏃‍♀️Runhouse🏠

Published October 10, 2024

The motivation for this article is a series of conversations we recently had with teams as they first committed to launching and deploying ML projects at scale. The general maturation of machine learning and deep learning, alongside the momentum from generative AI, has intensified the focus on implementing ML. But as teams enter the fray, they discover that developer enablement is challenging in such a fast-moving space.

In traditional software development, platform engineering has key goals: improving developer productivity, ensuring standardization, enabling self-service, and managing resources efficiently. An ML platform must do exactly the same. Cloud platforms offer dozens of buttons to unblock every ML feature, but it is important not to conflate “many tools” with a platform that unblocks and even democratizes development in general. The average team operates with slow and fragmented development cycles.

In this article, we will trace why we arrived at the current state of ML tooling and how it has diverged from software development, before defining a series of requirements for an effective ML platform. Finally, we will make a case for Runhouse as the right choice for teams getting started with serious ML development.

We Forked the Development Lifecycle

Approaches in machine learning (ML) that require GPU acceleration and even distributed training have become significantly more approachable for teams to implement. Most frameworks are ergonomic and well-documented. As a result, it’s become clear that infrastructure rather than expertise or frameworks is the biggest impediment to the ML flywheel. Development and release cycles of most ML teams remain extremely long, on the order of weeks or months.

Early in enterprise ML’s lifecycle, before 2015, most organizations approached ML from the perspective of “algorithms-as-a-service” rather than actual development. There was little customization or ownership of infrastructure. But as business value was realized, at sophisticated tech companies from Google (Sibyl) to Facebook (FB Learner Flow), you saw an increased focus on pipelines with an eye towards scalability, customization, and continuous integration.

Within this shift towards pipelines and the growth of ML as a functional area, the original sin of forking ML development from regular software development also happened. When dataset size and training methods exceeded the limits of local machines, local development became impossible. The obvious solution teams still reach for first is simply to do “local development” remotely – for many teams, the vendor-supported pathway is launching a Jupyter notebook on a VM. This restores local-like iterability and debuggability necessary for the early stages of a project, or “research.” But researchers write code not intended for production, often using a snapshot of data due to hosted notebook limitations.

What else could you need?

To move a pipeline to production, teams move out of notebooks to ensure reproducible execution and layer in software best practices. Code is packaged into an orchestrator, whether Airflow, Kubeflow or another flavor of flow. These are a class of tools already familiar for data teams, who could attest to how they enable observable scheduled execution of pipelines. However, rather than a thin scheduler for regular code, orchestrator pipelines are code, configs and DSLs all tightly bundled together.

This dramatically slows down speed to production. Production orchestrator pipelines are finished via a process best described as “development through deployment,” where the way to test iterative changes is to rebuild and resubmit the production pipeline. Iteration is glacial. Further improvements or debugging of errors in recurring production runs follow the same slow iteration loops since edits are made via changes to an orchestrator pipeline rather than just editing code with interactive debug.

Research is notebooks; production is Airflow

To be clear, the notebooks-plus-DAGs approach is a "good" one; it's the modern standard. Though slow, it eventually enables recurrent, reproducible execution. It has replaced (or continues to battle against) worse practices. For example, a worse alternative is saving a pickled model in the last cell of a notebook, with no realistic plan for retraining or ensuring reproducibility in the future. Another poor practice is scheduling a chain of notebooks to run recurrently, without tests or portability. However, the notebooks-plus-DAGs approach has now become a significant drag on productivity and robustness for teams conducting large-scale or recurring training across heterogeneous hardware and methods.

In recent years, several disparate trends have emerged. First, many organizations were distracted by brief forays into "citizen data science," which led to the boom and bust of AutoML solutions. This was a full-scale attempt to break from ML-as-software development, but it was largely unsuccessful. Second, teams have been pulled away from focusing on ML enablement toward generative AI and large language models. While this shift adds value for some companies, it has, from a platform perspective, placed greater emphasis on inference and application building.

Within actual ML platform infrastructure, however, the major story has been the adoption of Ray by teams like Uber, Pinterest, and Shopify. Ray not only simplifies the execution of distributed programs but also significantly addresses the research-to-production gap we discussed earlier. Teams can run the same Ray code on Ray clusters across both research and production environments. Instead of relying on notebooks-plus-DAGs, teams that switch to Ray can run ML program code on ephemeral clusters. The willingness of sophisticated teams to adopt Ray, despite Ray being highly disruptive to existing codebases and infrastructure, suggests we've vastly underestimated the correct approach to building ML platforms.

Shopify Merlin, 2022

Defining the Requirements of an ML Platform

Over the last decade, ML platforms have meant extremely different things as teams matured. If we pause, and now define the requirements of an ML-enabling platform from scratch, these are the priority:

Identical Execution in Research and Production: There are no distinctions or translations between research and production; code written by researchers can be identically executed in production. Version control and release procedures govern production, not packaging.

Fast, Iterable Development: Engineers can instantly execute and test local code changes without long delays or waiting for deployment. This is useful for both research and debugging production.

Regular Programs: Regular code is used without modifications, DSLs, or decorators to customize it for the ML platform. The platform imposes no lock-in and has no overhead for adoption.

Minimized Compute Management: Compute is abstracted away from developers, who simply request available resources in their code and execute on the returned compute object. This abstraction also enables platform teams to achieve significantly higher utilization of compute.

Well-Integrated Governance: Existing authentication, lineage, and logging systems are well-integrated, streamlining management for platform teams and enabling developer access. There should be no issues working within existing platform engineering governance.

The Pitch for Runhouse

Runhouse was designed to be the lightest-weight and least opinionated platform for ML.

Identical, reproducible execution in research and production: Runhouse’s principal feature is enabling dispatch of regular code as-is to compute, and calling it for remote execution. Runhouse also lets you define the compute requirements and the environment for execution (any combination of Docker image, conda env, pip installs, bash commands) in code rather than additional config files. If the code is the same, and the dispatch is the same, then execution will be the same across research and production. This allows for fast time to production, but also enables the reverse of production-to-local for further improvements or debugging.

Fast, iterable development from local, while executing on powerful remote compute: Dispatch of code for remote execution takes <2s, so any local iteration over code feels fully interactive. Whether prototyping a training for the first time or debugging a production run, engineers have a fully iterable, debuggable development experience.

Regular code with no decorators or DSLs: Runhouse is fully agnostic to what is being run, and you do not need to decorate your applications. Dispatch everything as-is, whether a basic function, Ray code, or a PyTorch distributed training. New teammates also become productive more quickly, as there’s nothing proprietary to learn, and no complex configs to figure out. This also means that Runhouse will never interfere with other decorators – for instance, go ahead and use MyPy.

MLOps is eliminated: Engineers think about what compute they need rather than where it comes from, and experience “serverless” execution for training pipelines. Runhouse works over existing clusters, VMs, or elastic compute. Meanwhile, the platform team has a single pane of observability over the entire compute, including usage, logging, and telemetry.

Well-integrated into your existing tooling: Runhouse builds a platform with an extremely lightweight and unopinionated framework that plugs and plays with your existing VPCs, IAMs, catalogs, and any other tool.

Code from research to production is identical

There’s a Difference between Enablement and Empowerment

Regardless of where your team is in the lifecycle, from launching your first ML projects to expanding toward multi-node training, each capability increase can be approached through either enablement or empowerment. We view developer "enablement" as the process where an ML platform adds a specific feature to unblock one additional use case. For instance, launching hosted notebooks is a simple way to start with ML, but it often creates a permanent fork in the research codebase. Notebooks can still be used, but ideally as an interactive shell for dispatching and executing regular code.

“Empowerment” takes a broader view, requiring a platform that supports regular software development practices. Systems are captured in code, reused, and shared, while heavy compute is made self-service and accessible. In this setting, the ML platform eliminates barriers that have long plagued ML teams – slow iteration cycles, complex pipelines, and fragmented environments. This is exactly what we built Runhouse to do: unify research and production while giving teams control over execution to accelerate the time from experimentation to impact.

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.

AI Platforms: Past, Present, and Future

We Forked the Development Lifecycle

Defining the Requirements of an ML Platform

The Pitch for Runhouse

There’s a Difference between Enablement and Empowerment

Stay up to speed 🏃‍♀️📩

Read More

Please separate orchestration and execution (Or: Packaging hell is dragging ML)

A PyTorch Approach to ML Infrastructure

SLURM, SUNK, SLINKY, SLONK: Chasing Speed, Stability, and Scale for Bleeding-Edge ML