Kubetorch: Accelerated AI/ML on Kubernetes
Build, scale, and optimize your ML workloads on Kubernetes with 100x faster iteration & debugging and industry-leading fault-tolerance.
Eager Pythonic APIs
Thousands of Nodes
Magical Instant Iteration
Comprehensive Observability
Fast iteration on scaled compute
Develop on Kubernetes interactively and redeploy code changes in seconds with hot syncing.
import kubetorch as kt # Any training function def train_ddp(args): ... # Define distributed compute for training gpus = kt.Compute( gpus=8, ).distribute("pytorch", workers=4) # Hot sync local code changes and restart training in <2 seconds remote_train = kt.fn(train_ddp).to(gpus) results = remote_train(epochs=args.epochs, batch_size=args.batch_size)
Fault-tolerant
Stream logs and catch exceptions during code execution. Go ahead, resize your running training; we'll handle any failure.
while completed_epochs < args.epochs: try: trainer.train() completed_epochs += 1 # Catch pod and node preemptions in code and resume except kt.WorkerMembershipChanged: new_worker_num = len(trainer.compute.pod_names()) print(f"World size changed, continuing with {new_worker_num} workers") gpus.distribute( "pytorch", workers=new_worker_num, )
Flexible for any workflow
Run training, inference, and data processing (or mix all three as actors for RL).
# Create separate actors on Kubernetes for all my tasks inference_coro = kt.cls(vLLM).to_async(inference_compute) train_coro = kt.cls(GRPOTrainer).to_async(train_compute) sandbox_coro = kt.cls(EvalSandbox).to_async(eval_compute) # Deploy services in parallel - Kubetorch handles the orchestration inf_service, train_service, sandbox_service = await asyncio.gather( inference_coro, train_coro, sandbox_coro ) # Run the GRPO training loop calling into my services await simple_async_grpo( dataset, train_service, inf_service, sandbox_service num_epochs=15, )
import kubetorch as kt # Any training function def train_ddp(args): ... # Define distributed compute for training gpus = kt.Compute( gpus=8, ).distribute("pytorch", workers=4) # Hot sync local code changes and restart training in <2 seconds remote_train = kt.fn(train_ddp).to(gpus) results = remote_train(epochs=args.epochs, batch_size=args.batch_size)
Use Cases
How to get started
Kubetorch Serverless
Try out Kubetorch on our managed, serverless platform. Youโll get instant access to a Kubernetes cluster with Kubetorch installed and ready to use.
Request AccessYour Own Cluster
Visit our Kubernetes Installation Guide for a walkthrough on how to install Kubetorch on your own cluster.
Installation Guide