Announcing Kubetorch!

Read the launch post

Kubetorch: Accelerated AI/ML on Kubernetes

Build, scale, and optimize your ML workloads on Kubernetes with 100x faster iteration & debugging and industry-leading fault-tolerance.

Eager Pythonic APIs
Thousands of Nodes
Magical Instant Iteration
Comprehensive Observability

Fast iteration on scaled compute

Develop on Kubernetes interactively and redeploy code changes in seconds with hot syncing.

import kubetorch as kt # Any training function def train_ddp(args): ... # Define distributed compute for training gpus = kt.Compute( gpus=8, ).distribute("pytorch", workers=4) # Hot sync local code changes and restart training in <2 seconds remote_train = kt.fn(train_ddp).to(gpus) results = remote_train(epochs=args.epochs, batch_size=args.batch_size)

Fault-tolerant

Stream logs and catch exceptions during code execution. Go ahead, resize your running training; we'll handle any failure.

while completed_epochs < args.epochs: try: trainer.train() completed_epochs += 1 # Catch pod and node preemptions in code and resume except kt.WorkerMembershipChanged: new_worker_num = len(trainer.compute.pod_names()) print(f"World size changed, continuing with {new_worker_num} workers") gpus.distribute( "pytorch", workers=new_worker_num, )

Flexible for any workflow

Run training, inference, and data processing (or mix all three as actors for RL).

# Create separate actors on Kubernetes for all my tasks inference_coro = kt.cls(vLLM).to_async(inference_compute) train_coro = kt.cls(GRPOTrainer).to_async(train_compute) sandbox_coro = kt.cls(EvalSandbox).to_async(eval_compute) # Deploy services in parallel - Kubetorch handles the orchestration inf_service, train_service, sandbox_service = await asyncio.gather( inference_coro, train_coro, sandbox_coro ) # Run the GRPO training loop calling into my services await simple_async_grpo( dataset, train_service, inf_service, sandbox_service num_epochs=15, )
import kubetorch as kt # Any training function def train_ddp(args): ... # Define distributed compute for training gpus = kt.Compute( gpus=8, ).distribute("pytorch", workers=4) # Hot sync local code changes and restart training in <2 seconds remote_train = kt.fn(train_ddp).to(gpus) results = remote_train(epochs=args.epochs, batch_size=args.batch_size)

Use Cases

Training

PyTorchRayFine-tuningHPO

PyTorch Multi-Node Distributed Training

Online Inference

DistributedLLMsMulti-step

Multi-GPU DeepSeek-R1 with vLLM

Reinforcement Learning

PipelinesMulti-taskEvaluation

Async GRPO Training

Batch Processing

RayDaskSparkData Apps

Offline Batch Embeddings

How to get started

Kubetorch Serverless

Try out Kubetorch on our managed, serverless platform. Youโ€™ll get instant access to a Kubernetes cluster with Kubetorch installed and ready to use.

Request Access

Your Own Cluster

Visit our Kubernetes Installation Guide for a walkthrough on how to install Kubetorch on your own cluster.

Installation Guide

Kubetorch Managed

Available on any cloud or on-prem Kubernetes cluster
Inquire