About

This document contains a collection of recommended readings for computer science students and systems engineers looking to work in the domain of Machine Learning Systems Engineering. Specifically, this knowledge will prepare you to:

Build and discuss industry-standard infrastructure for building platforms to train, serve, and evaluate AI models
Intuitively understand the basic theoretical mechanisms for model architecture (e.g., transformers, gemms, diffusion)
Work with N-dimensional parallelism techniques for state-of-the-art data and compute scaling (billion-scale datasets, thousands of colocated GPU/TPU accelerators)

As as a disclaimer, I haven’t (yet) read every single one of the resources on this list! I’m also going through these myself as I continue to learn how to become a better engineer. This syllabus is WIP and I will continue to add or remove resources in accordance with frontier advancements.

I’ll put a star next to my most highly recommended resources. If you’re short on time or just want the high-level overview of what ML systems work looks like, you can start with these.

There is no prescribed order. I’ve attempted to loosely categorize resources into higher-level categories, but the resources themselves will frequently reference the same concepts in varying situations and depths. I would recommend starting with whichever section seems most interesting or pertinent to you, and jumping around as you discover what knowledge gaps you’re most looking to fill.

Prerequisites

This document is built with the following assumptions about your background. These are by no means a hard requirement, but you may want to seek out learning opportunities from other resources before returning here if the following points don’t describe you well yet!

You have prior education or experience in computer science fundamentals. (most of these resources assume you can understand arbitrary code snippets, reason through distributed systems design problems, and make connections between the hardware and software worlds.)
- Here are my notes from my CS undergrad to give you an idea of what that looks like in my mind.
- In Berkeley course terms as another frame of reference: CS61A, CS61B, and CS61C or equivalent are a non-negotiable requirement (read+write multiple languages, reason around abstraction + recursion + data structures, understand low-level computer architecture). CS162 (OS) is probably next most important, followed by the other big systems courses: CS168 (networking), CS186 (databases), and CS161 (security) in that order. CS184 (graphics) will be a very helpful foundation for computer vision problems / diffusion models. Finally, CS188 + CS189 (intro AI/ML theory) are helpful but not as useful as they may seem, given how much has changed since the traditional AI/ML intro curriculums of earlier years.
You have read technical research papers before.
- Even so, most of the ML papers are quite hard to read and will probably take a few passes (and/or a Youtube explainer) to internalize.
- Here’s an Andrew Ng lecture on how to read research papers.
You are comfortable with multivariable calculus and linear algebra.
- There will be a lot of both!

Out of Scope

This syllabus does NOT focus on resources pertaining to ML research.

This syllabus will help you build the compute and data infrastructure to train existing models (or models that researchers give you to run), but you won’t be able to develop novel model architectures of your own without further learning.
I conceptualize ML Systems as a branch-off from traditional computer systems engineering, whereas ML Research feels more like a branch-off from theoretical mathematics. There is a great deal of overlap between the two poles (broadly construed as the general field of “ML Engineering”), and it’s quite murky.
- Moving from one pole to another effectively means training yourself in a related, but functionally different, field (like going from court interpreter to language historian; or from practicing physician to neuroscientist). There are a handful of people in this world who are good at both, through some substantial combination of academic and industry experience.

AI capability discourse and ethics

Goals:

Understand the current leading set of predictions on how AI capability will improve and impact us/society over the next few years.
IMO it’s crucial to start formulating your own set of predictions and values based on the provided evidence. It’ll greatly influence what you’ll aim to prioritize in terms of leveraging LLM’s to do work, choosing what to work on and learn, and planning for future career paths once most present-day software engineering tasks may become fully automated.

METR: Task-Completion Time Horizons of Frontier AI Models (and the corresponding blog post, Measuring AI Ability to Complete Long Tasks)

Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.
Takeaways: the METR time horizon is quickly becoming the most widely-accepted measure of AI progress. This introductory post will give you the context necessary to quantitatively track and understand AI progress in the coming months/years.
Related: Some other benchmarks and performance evaluations to track alongside METR’s work include Humanity’s Last Exam, ARC-AGI, and SWE-Bench. There are also some fun ones like Vending-Bench.

AI 2027

Summary: We predict that the impact of superhuman AI over the next decade will be enormous, exceeding that of the Industrial Revolution. We wrote a scenario that represents our best guess about what that might look like. It’s informed by trend extrapolations, wargames, expert feedback, experience at OpenAI, and previous forecasting successes.
Takeaways: Probably healthiest to treat this as a compelling science-fiction essay. But it’s written by some of the leading AI safety experts, and has nailed the predicted timeline up to today. So it’s a good frame of reference w.r.t. understanding how researchers are extrapolating the AI trajectory past what we can definitively predict (i.e., in the years-to-decades time frame).
Related: Dario Amodei’s (CEO of Anthropic) essay Machines of Loving Grace touches upon the same theme. And, if you enjoyed that, you can also read the sequel The Adolescence of Technology.

Claude 4.5 Opus’ Soul Document

Summary: As far as I understand and uncovered, a document for the character training for Claude is compressed in Claude’s weights. The full document can be found at the “Anthropic Guidelines” heading at the end.
Takeaways: Some of the values that Anthropic instills into Claude.
Related: A version of the Soul Document has since been published official as the Claude Consitution.

Introductory courses

Goals:

Get a high-level overview of what ML Systems looks like at the depth of a standard undergraduate level survey course.
Collect several definitions of what “ML Systems” entails to practicing educators, and start to formulate a definition of your own.

3Blue1Brown Neural Networks playlist

Summary: Neural networks, gradient descent, backpropagation, large language models, transformers, attention, diffusion models.
Takeaways: Gain a visual intuitive understanding of basic ML concepts.
Related: need a linear algebra refresher? Watch the 3b1b linear algebra playlist.

Stanford CS329S: Machine Learning Systems Design

Summary: This course aims to provide an iterative framework for developing real-world machine learning systems that are deployable, reliable, and scalable. It starts by considering all stakeholders of each machine learning project and their objectives. Different objectives require different design choices, and this course will discuss the tradeoffs of those choices. Students will learn about data management, data engineering, feature engineering, approaches to model selection, training, scaling, how to continually monitor and deploy changes to ML systems, as well as the human side of ML projects such as team structure and business metrics. In the process, students will learn about important issues including privacy, fairness, and security.
Takeaways: Really great comprehensive notes, especially focused on data engineering and real-world examples/frameworks.
Related: This class got turned into a textbook. It costs money to buy, which is maybe worth it? But the notes are probably good enough.

Harvard 15-442: Machine Learning Systems

Summary: The goal of this course is to provide students an understanding and overview of elements in modern machine learning systems. Throughout the course, the students will learn about the design rationale behind the state-of-the-art machine learning frameworks and advanced system techniques to scale, reduce memory, and offload heterogeneous compute resources. For this semester, we will also run case studies on modern large language model (LLM) training and serving systems used in practice today. This course offers the necessary background for students who would like to pursue research in the area of machine learning systems or continue to take a job in machine learning engineering.
Takeaways: One more example of the breadth of topics that ML Systems entails.
Related: The CS249r textbook has a more complete curriculum with labs and exercises. This resource seems extremely AI-generated so perhaps proceed with a small dose of skepticism?

Hardware

Goals:

Get a basic understanding of how GPUs work and how they translate ML operations into basic matrix multiplications and vice versa.

Matrix Multiplication on Blackwell

Summary: In Part 1 (this blog post) we cover what a Matrix Multiplication (matmul) is, its importance for LLMs, and why we need to optimize it. Then we explain what a GPU is, GPU history since Ampere, and finally how to write a simple (not super performant) implementation of matmul on a GPU in 4 lines of Mojo.
Related: Chapter 2 of How to Scale Your Model covers the TPU version of this.

Software

Goals:

Get a basic understanding of the ML infra stack, from Kubernetes to Pytorch.

Composer 2 Technical Report

Takeaways: See a real example of the infrastructure stack used by Cursor to train Composer 2. (they mostly post-train Kimi, so there’s a lot of emphasis on RL techniques here, which is useful to see.)

Pytorch Basics

Train a basic model with pytorch.
Related:
- Train a classifier instead: https://docs.pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html.
- If you have a LOT of free time, also can look at Jax: https://docs.jax.dev/en/latest/

Getting Started with DeviceMesh

Understand the details of DeviceMesh, ProcessGroup, FSDP+HSDP, and other wrappers for distributed communication.

Ray: A Distributed Framework for Emerging AI Applications

Summary: The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray—a distributed system to address them. Ray implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine. To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system’s control state. In our experiments, we demonstrate scaling beyond 1.8 million tasks per second and better performance than existing specialized systems for several challenging reinforcement learning applications.
Takeaways: Understand the heterogeneity problem + how Ray proposes to solve temporal + resource heterogeneity.
Related: kuberay docs for how this is deployed IRL on a k8s cluster, or spelunk around the ray docs in general.

Training

Goals:

Understand the problems and solutions in ML training, especially pre-training, where current methods require thousands of co-located GPUs with workload MTBF’s measured in minutes to hours.

The Ultra-Scale Playbook: Training LLMs on GPU Clusters

Summary: Thousands of GPUs humming in perfect harmony. That’s what it takes to train today’s most powerful AI models – a symphony of computing power that until recently was the exclusive domain of elite research labs. Open source has transformed this landscape, but not completely. Yes, you can download the latest Llama or DeepSeek models. Yes, you can read their technical and experiment reports. But the most challenging part – the training code, the knowledge and techniques necessary to coordinate GPUs to train these massive systems – remains shrouded in complexity and spread around in a series of disconnected papers and often private codebases. This open source book is here to change that. Starting from the basics, we’ll walk you through the knowledge necessary to scale the training of large language models (LLMs) from one GPU to tens, hundreds, and even thousands of GPUs, illustrating theory with practical code examples and reproducible benchmarks.
Takeaways: The most comprehensive and actionable guide to ML training infrastructure I’m aware of so far. Will probably take a few days to read and internalize fully, but it’ll be well worth it. If there’s only one thing on this list to read, this one’s it. (The appendix is quite important imo and should not be skipped.)
Related:
- How to Train Really Large Models on Many GPUs? is a more concise version.
- How to Scale Your Model is the TPU version. It also comes with a few architecture-agnostic chapters, plus some useful interactive exercises, so it’s worth a read if you have extra time.

Networking and Communication

Goals:

Understand how NCCL works + why we need it. (What is the software layer that enables GPUs to talk to each other in a way that’s useful for model training?)

Demystifying NCCL

Summary: In this paper, we present a thorough and systematic exploration of NCCL’s internal architecture. Our analysis specifically targets four primary aspects of NCCL’s implementation:
- (1) a general overview, including API structure and communication channel management;
- (2) a detailed examination of communication protocols (Simple, LL, LL128);
- (3) an analysis of its data-transfer models; and
- (4) comprehensive analysis of its collective communication algorithms.
Takeaways: A few things to understand—
- Simple vs LL vs LL128
- Ring (ReduceScatter and AllGather) for bandwidth-sensitive vs double tree for latency-sensitive operations
- AllReduce = ReduceScatter + AllGather
Related: my notes, and a medium article, for slightly more digestible versions of this paper.

Transformers

In order of conceptual depth:

Watch the 3b1b series on attention + transformers (linked above, and also directly here for transformers and here for attention).

The Illustrated Transformer by Jay Alammar

and Visualizing A Neural Machine Translation Model

The Transformer Family Version 2.0 by Lilian Weng

and, Attention? Attention!

Attention is All You Need, the foundational paper of modern model architectures.

Less talked about, but almost as important, is its predecessor, Neural Machine Translation by Jointly Learning to Align and Translate. (attention wasn’t yet all you needed; just something you probably wanted to use!)

Diffusion Models

Goals:

Gain a basic intuitive understanding of diffusion models, plus a high-level appreciation for the math.
Understand how the process of diffusion model training compares to LLM/autoregressive model training.

What are Diffusion Models? by Lilian Weng

Diffusion Models Beat GANs on Image Synthesis: https://arxiv.org/pdf/2105.05233

The ‘Attention is all you need’ moment for diffusion.

Classifier-free diffusion guidance: https://arxiv.org/pdf/2207.12598

The important part is that you know CFG exists, and that it works (so you’re not surprised when you see inference forking off a weird subprocess to generate the same image with masked prompt tokens…)

Explorer

Explorer

ML Systems Engineering Syllabus

Table of Contents

About

Prerequisites

Out of Scope

AI capability discourse and ethics

Introductory courses

Hardware

Software

Training

Networking and Communication

Transformers

Diffusion Models

Graph View

Links to this page

Links to this page

Graph View

Table of Contents