TL;DR

xAI talk was not that useful.
Kuberay has a bunch of cool alpha/experimental features.
Ray Direct Transport is really useful and seems like an obvious drop-in improvement for training.
SkyRL is cool but it’s only a month old so will be a while before it’s stable or has a reasonable feature set.

1:00: Scaling Image and Video Processing with Ray (xAI)

Grok imagine basics

xAI image+video generation: T2I, T2V with audio
- Grok imagine video announcement https://x.com/xai/status/1975607901571199086
Emphasis on dynamic motion data for video training
The two goals xAi has for their training infra:
- 2x compute should equal 2x throughput.
- minimize human time needing to work with infra. (auto-restart, etc)

Fault tolerance

This is for data curation and pretraining mostly.

Uses redis queue as single point of failure + only stateful component
Ray actors should be idempotent
Redis Reliable Queue
What happens if the raycluster crashes?
- Health check: gcs ready, raylet, ray dashboard, agent processes
observability
- ray+kuberay metrics
- k8s events
Make almost everything preemptible
- they somehow got backoffLimit to work, allegedly. so it is possible.

Ad break: xAi is hiring lol.

Q&A

How do you dedup video? how is this different from text?
- Video is way more compute and storage heavy
- non-answer.
What do you do if the raycluster head dies?
- don’t care if jobs fail. auto-restart them.
- store state in shared queue. just resume.
How to make redis fault-tolerant?
- lots of replicas
- redis clusters
how to horizontally scale?
- a single raycluster can handle up to 20k actors
- address the bottleneck (sounds like bandwidth)
what library do they use for video decoding?
- ffmpeg
- decode
- torchcodec
why rayactors instead of k8s Job?
- better retry behavior
- specify fractional gpu

takeaways

seems mostly like common sense. they’re obviously not sharing any secret sauce.

1:45: Advancing Kuberay

what is kuberay?

k8s operator that sits between k8s clusters and ray core
RayClusters have “GCS fault tolerance”. wonder what that means…
- oh. gcs doesn’t mean google cloud storage lol. https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html
RayJobs are literally just ephemeral rayclusters

upcoming enhancements

1.5

Lift resources and labels as structured k8s fields. (could be useful but just seems like refactoring?)
Atomic multi-host pods
- TPU webhooks
- Treat every replica group as one object to do atomic things like patch/delete
Rayservice incremental upgrade strategy (NewClusterWithIncrementalUpgrade)
Rayjob sidecar submission mode (seems useful)
- Launch the job submission container within the ray head pod
- Avoid cross-pod network communication
- Avoids case where RayCluster is waiting for k8s job to be scheduled
Rayjob DeletionPolicy (very very useful)
- fine grained controls over when to delete workers
- need to enable RayJobDeletionPolicy feature gate
- Keep rayclusters around at a much smaller scale for a long time, which allows for easier debugging
support for PodGroup from k8s-sigs

ecosystem

kubectl plugin
- can look into for gkr but seems like more of a QoL thing than a feature
- kubectl ray create and kubectl ray scale commands
APIServer
- custom authentication+authorization (also very useful for gkr but not critical)
Dashboard
- view and manage kuberay resources.
- we should get this up
kuberay now emits prometheus metrics
- this is huge for the program

upcoming

in-place pod resizing
- fast (2s) vertical scaling of single k8s pod resources
History server
- access to ray logs, events, and dashboard after termination
- super useful
- basically a better version of gkr-ui

2:30: JIT Embedding

went to this talk since i don’t know how this works and am curious what the differences are between ours and adobe.

highest meme-to-content ratio of any talk so far.. also speaker was talking at 2x speed. i did my best.

background

VAE: variational auto-encoder: maps data to latents
- Generate embeddings on-the-fly to produce max flexibility (one training iteration = compute VAE + forward pass + backward pass)
- Alternatively, do offline compute
- VAEs are prone to breaking changes over evolution
JIT embedding 2x’ed their training speed

architecture

VAEs are small models, running them on H200s is wasteful.
Main idea: dedicated service for on-the-fly VAE calculation w/ Ray Serve, deployed on A100s or other cheaper gpus
Clientside Torch.DataLoader integration
Rust-based serde
Batch prefetching

challenges

Multimodality
- fast rust serialization + lossy codecs for image/video + low-level optimization (MJPEG/zero-copy deserialization/simd)
Latency
- encodings are not crud + very high latency. model shouldn’t be waiting for seconds for a response
- use order determinism to allow for massive prefetching
  - start VAE computation far ahead of time and offload scheduling problems to separate load balancer
Imbalanced gpu utilization
- how do we run on heterogeneous hardware?
- same solution as latency: rely on LB
- note: we don’t have this problem (yet). all of our training workloads are fairly homogeneous

efficient serde

(note: this section will likely become irrelevant once Direct Transport becomes a thing.)

misc

adobe is also planning on implementing ray direct transport (see below)

3:15: Ray Direct Transport

The netflix mako talk I wanted to go to instead was crazy oversubscribed (I got there 10mins early and they were already turning ppl back due to capacity). They’re supposed to release a video of it soon so we should watch that.

alpha in ray 2.51.1.

why

RL requires composing training with inference (use model to generate training data and feed into itself).
- Main challenge: transfer rollout data and weights between training+inference engines
Ray’s object store is inefficient for GPU-GPU transfer
- Previously, the Ray object store lived in CPU memory.
- With DT, we’d like to directly use RDMA to perform 1000x faster transfer between GPU memory without needing to go through CPU.
RDT is built for:
- large objects (TB/s of data over network)
- specialized data transport (RDMA over IB, NVLink)
Goal: use Ray Core API with a custom data transport so we’re no longer limited to the Ray Object Store.
Main benefits:
- Keep data unserialized in GPU memory
- Pick your own data transport
- Let ray manage data transfer, GC, failure tolerance

how

Gloo, NCCL, and NIXL support
Supports Ray actors and torch.Tensors
RDT objects are mutable
- RDT is copy-by-reference and not by-value. RDT automatically handles write concurrency.
Can either do collective (NCCL) transports or point-to-point (NIXL) via ray.put+ray.get

use cases

RDMA-based weights sync between trainer and inference engine for RL

future

verl + skyRL
asyncio support
CUDA IPC + BYO transport

Q&A

differences between single-node and multi-node transport?
- decorators can be specified at runtime to choose a transport based on placement

4:00: SkyRL tx

needed to leave early to make it to next meeting.

what

enables use of tinker API on our own hardware
Tinker API basics
- run simple loop on CPU → execute on GPUs (abstract away idea of compute acceleration)
- Four composable core methods: forward_backward(), optim_step(), sample(), save_state()
- Multiplex GPU resources for many individual users
“let AI researchers ignore infrastructure”
there should be a shared framework for training infra
really new project. (1ish month old)

common engine for training and inference

built on top of jax, pytorch soon.
“an inference engine that also happens to do backward passes”
unsure if this will be the ultimate form; maybe using existing training/inference frameworks will end up being more sustainable

architecture

FastAPI with relational DB
Batch scheduling (CPU-based engine)
Performance optimizations
- JIT padding tensors (avoid frequent recompilation)
- KV cache
- Microbatches
- Gradient checkpointing
Future
- Paged attention
- Prefix cachine
- Custom kernels
Sharding
- Currently tensor parallel via jax primitives
- Exploring FSDP

Explorer

Explorer

Ray summit notes

Table of Contents

TL;DR

1:00: Scaling Image and Video Processing with Ray (xAI)

Grok imagine basics

Fault tolerance

Q&A

takeaways

1:45: Advancing Kuberay

what is kuberay?

upcoming enhancements

1.5

ecosystem

upcoming

2:30: JIT Embedding

background

architecture

challenges

efficient serde

misc

3:15: Ray Direct Transport

why

how

use cases

future

Q&A

4:00: SkyRL tx

what

common engine for training and inference

architecture

Graph View

Links to this page

Links to this page

Graph View

Table of Contents