banner

TL;DR

  • xAI talk was not that useful.
  • Kuberay has a bunch of cool alpha/experimental features.
  • Ray Direct Transport is really useful and seems like an obvious drop-in improvement for training.
  • SkyRL is cool but it’s only a month old so will be a while before it’s stable or has a reasonable feature set.

1:00: Scaling Image and Video Processing with Ray (xAI)

Grok imagine basics

  • xAI image+video generation: T2I, T2V with audio
  • Emphasis on dynamic motion data for video training
  • The two goals xAi has for their training infra:
    • 2x compute should equal 2x throughput.
    • minimize human time needing to work with infra. (auto-restart, etc)

Fault tolerance

This is for data curation and pretraining mostly.

  • Uses redis queue as single point of failure + only stateful component
  • Ray actors should be idempotent
  • Redis Reliable Queue
  • What happens if the raycluster crashes?
    • Health check: gcs ready, raylet, ray dashboard, agent processes
  • observability
    • ray+kuberay metrics
    • k8s events
  • Make almost everything preemptible
    • they somehow got backoffLimit to work, allegedly. so it is possible.

Ad break: xAi is hiring lol.

Q&A

  • How do you dedup video? how is this different from text?
    • Video is way more compute and storage heavy
    • non-answer.
  • What do you do if the raycluster head dies?
    • don’t care if jobs fail. auto-restart them.
    • store state in shared queue. just resume.
  • How to make redis fault-tolerant?
    • lots of replicas
    • redis clusters
  • how to horizontally scale?
    • a single raycluster can handle up to 20k actors
    • address the bottleneck (sounds like bandwidth)
  • what library do they use for video decoding?
    • ffmpeg
    • decode
    • torchcodec
  • why rayactors instead of k8s Job?
    • better retry behavior
    • specify fractional gpu

takeaways

  • seems mostly like common sense. they’re obviously not sharing any secret sauce.

1:45: Advancing Kuberay

what is kuberay?

upcoming enhancements

1.5

  • Lift resources and labels as structured k8s fields. (could be useful but just seems like refactoring?)
  • Atomic multi-host pods
    • TPU webhooks
    • Treat every replica group as one object to do atomic things like patch/delete
  • Rayservice incremental upgrade strategy (NewClusterWithIncrementalUpgrade)
  • Rayjob sidecar submission mode (seems useful)
    • Launch the job submission container within the ray head pod
    • Avoid cross-pod network communication
    • Avoids case where RayCluster is waiting for k8s job to be scheduled
  • Rayjob DeletionPolicy (very very useful)
    • fine grained controls over when to delete workers
    • need to enable RayJobDeletionPolicy feature gate
    • Keep rayclusters around at a much smaller scale for a long time, which allows for easier debugging
  • support for PodGroup from k8s-sigs

ecosystem

  • kubectl plugin
    • can look into for gkr but seems like more of a QoL thing than a feature
    • kubectl ray create and kubectl ray scale commands
  • APIServer
    • custom authentication+authorization (also very useful for gkr but not critical)
  • Dashboard
    • view and manage kuberay resources.
    • we should get this up
  • kuberay now emits prometheus metrics
    • this is huge for the program

upcoming

  • in-place pod resizing
    • fast (2s) vertical scaling of single k8s pod resources
  • History server
    • access to ray logs, events, and dashboard after termination
    • super useful
    • basically a better version of gkr-ui

2:30: JIT Embedding

went to this talk since i don’t know how this works and am curious what the differences are between ours and adobe.

  • highest meme-to-content ratio of any talk so far.. also speaker was talking at 2x speed. i did my best.

background

  • VAE: variational auto-encoder: maps data to latents
    • Generate embeddings on-the-fly to produce max flexibility (one training iteration = compute VAE + forward pass + backward pass)
    • Alternatively, do offline compute
    • VAEs are prone to breaking changes over evolution
  • JIT embedding 2x’ed their training speed

architecture

  • VAEs are small models, running them on H200s is wasteful.
  • Main idea: dedicated service for on-the-fly VAE calculation w/ Ray Serve, deployed on A100s or other cheaper gpus
  • Clientside Torch.DataLoader integration
  • Rust-based serde
  • Batch prefetching

challenges

  • Multimodality
    • fast rust serialization + lossy codecs for image/video + low-level optimization (MJPEG/zero-copy deserialization/simd)
  • Latency
    • encodings are not crud + very high latency. model shouldn’t be waiting for seconds for a response
    • use order determinism to allow for massive prefetching
      • start VAE computation far ahead of time and offload scheduling problems to separate load balancer
  • Imbalanced gpu utilization
    • how do we run on heterogeneous hardware?
    • same solution as latency: rely on LB
    • note: we don’t have this problem (yet). all of our training workloads are fairly homogeneous

efficient serde

(note: this section will likely become irrelevant once Direct Transport becomes a thing.)

misc

  • adobe is also planning on implementing ray direct transport (see below)

3:15: Ray Direct Transport

The netflix mako talk I wanted to go to instead was crazy oversubscribed (I got there 10mins early and they were already turning ppl back due to capacity). They’re supposed to release a video of it soon so we should watch that.

alpha in ray 2.51.1.

why

  • RL requires composing training with inference (use model to generate training data and feed into itself).
    • Main challenge: transfer rollout data and weights between training+inference engines
  • Ray’s object store is inefficient for GPU-GPU transfer
    • Previously, the Ray object store lived in CPU memory.
    • With DT, we’d like to directly use RDMA to perform 1000x faster transfer between GPU memory without needing to go through CPU.
  • RDT is built for:
    • large objects (TB/s of data over network)
    • specialized data transport (RDMA over IB, NVLink)
  • Goal: use Ray Core API with a custom data transport so we’re no longer limited to the Ray Object Store.
  • Main benefits:
    • Keep data unserialized in GPU memory
    • Pick your own data transport
    • Let ray manage data transfer, GC, failure tolerance

how

  • Gloo, NCCL, and NIXL support
  • Supports Ray actors and torch.Tensors
  • RDT objects are mutable
    • RDT is copy-by-reference and not by-value. RDT automatically handles write concurrency.
  • Can either do collective (NCCL) transports or point-to-point (NIXL) via ray.put+ray.get

use cases

  • RDMA-based weights sync between trainer and inference engine for RL

future

  • verl + skyRL
  • asyncio support
  • CUDA IPC + BYO transport

Q&A

  • differences between single-node and multi-node transport?
    • decorators can be specified at runtime to choose a transport based on placement

4:00: SkyRL tx

needed to leave early to make it to next meeting.

what

  • enables use of tinker API on our own hardware
  • Tinker API basics
    • run simple loop on CPU execute on GPUs (abstract away idea of compute acceleration)
    • Four composable core methods: forward_backward(), optim_step(), sample(), save_state()
    • Multiplex GPU resources for many individual users
  • “let AI researchers ignore infrastructure”
  • there should be a shared framework for training infra
  • really new project. (1ish month old)

common engine for training and inference

  • built on top of jax, pytorch soon.
  • “an inference engine that also happens to do backward passes”
  • unsure if this will be the ultimate form; maybe using existing training/inference frameworks will end up being more sustainable

architecture

  • FastAPI with relational DB
  • Batch scheduling (CPU-based engine)
  • Performance optimizations
    • JIT padding tensors (avoid frequent recompilation)
    • KV cache
    • Microbatches
    • Gradient checkpointing
  • Future
    • Paged attention
    • Prefix cachine
    • Custom kernels
  • Sharding
    • Currently tensor parallel via jax primitives
    • Exploring FSDP