Samuel WiLLiams, Andrew Waterman, David Patterson. Roofline: An insightful Visual Performance model for multicore Architectures. Communications of the ACM. 2009.
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. 2023.
5
Oct 24 (Week 6)
Modern RNNs
Variants of Attention (GQA, MLA, SWA, CLA)
From Linear RNNs to Mamba
From Linear Attention to DeltaNet
How GPUs Perform Matrix Multiplication (Tensor Core & Warps)
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP 2023.
DeepSeek. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. 2024.
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis. Efficient Streaming Language Models with Attention Sinks. ICLR 2024.
OpenAI. gpt-oss-120b & gpt-oss-20b Model Card. 2025.
William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly. Reducing Transformer Key-Value Cache Size with Cross-Layer Attention. NeurIPS 2024.
Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei. You Only Cache Once: Decoder-Decoder Architectures for Language Models. NeurIPS 2024.
Eric Martin, Chris Cundy. Parallelizing Linear Recurrent Neural Nets Over Sequence Length. ICLR 2018.
Albert Gu, Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. COLM 2024.
Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, Christopher Ré. Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers. NeurIPS 2021.
Albert Gu, Karan Goel, Christopher Ré. Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022.
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim. Gated Linear Attention Transformers with Hardware-Efficient Training. ICML 2024.
Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher Ré. Zoology: Measuring and Improving Recall in Efficient Language Models. ICLR 2024.
Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021.
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, Yoon Kim. Parallelizing Linear Transformers with the Delta Rule over Sequence Length. NeurIPS 2024.
Songlin Yang, Jan Kautz, Ali Hatamizadeh. Gated Delta Networks: Improving Mamba2 with Delta Rule. ICLR 2025.
6
Oct 31 (Week 7)
Optimization 1
Neural Scaling Laws
How to Understand the Learning Rate
Learning Rate Schedule
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei. Scaling Laws for Neural Language Models. 2020.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch et al. Training Compute-Optimal Large Language Models. NeurIPS 2022.
Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon. Resolving Discrepancies in Compute-Optimal Scaling of Language Models. NeurIPS 2024.
Tim Pearce, Jinyeop Song. Reconciling Kaplan and Chinchilla Scaling Laws. TMLR 2024.
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. COLM 2024.
Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Kaifeng Lyu, Wenguang Chen. A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules. ICLR 2025.
7
Nov 7 (Week 8)
Optimization 2
How Gradient Descent Escapes Saddle Points
River Valley Loss Landscape
Gradient Noise Scale and Large-Batch Training
SDE Approximation of the Training Process
Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, Benjamin Recht. First-order Methods Almost Always Avoid Saddle Points. 2017
Rong Ge, Furong Huang, Chi Jin, Yang Yuan. Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition. JMLR 2015.
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, Andrew Gordon Wilson. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. NeurIPS 2018.
Rong Ge. Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets. Talk at the Simons Institute, 2019.
Ivan Skorokhodov, Mikhail Burtsev. Loss Landscape Sightseeing with Multi-Point Optimization. NeurIPS 2019.
Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, Tengyu Ma. Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective. ICLR 2025.
Elad Hoffer, Itay Hubara, Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. NeurIPS 2017.
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. 2017.
Samuel L. Smith, Erich Elsen, Soham De. On the Generalization Benefit of Noise in Stochastic Gradient Descent. ICML 2020.
Qianxiao Li, Cheng Tai, Weinan E. Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations. JMLR 2019.
Zhiyuan Li, Sadhika Malladi, Sanjeev Arora. On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs). NeurIPS 2021.
Sam McCandlish, Jared Kaplan, Dario Amodei, OpenAI Dota Team. An Empirical Model of Large-Batch Training. 2018.
William Merrill, Shane Arora, Dirk Groeneveld, Hannaneh Hajishirzi. Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training. 2025.
Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, Sanjeev Arora. On the SDEs and Scaling Rules for Adaptive Gradient Algorithms. NeurIPS 2022.
8
Nov 14 (Week 9)
Model Scaling 1
Standard Parameterization of Neural Networks
NTK and NTK Parameterization
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015.
Arthur Jacot, Franck Gabriel, Clément Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. NeurIPS 2018.
Simon S. Du, Xiyu Zhai, Barnabás Póczos, Aarti Singh. Gradient Descent Provably Optimizes Over-parameterized Neural Networks. ICLR 2019.
Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song. A Convergence Theory for Deep Learning via Over-Parameterization. ICML 2019.
Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang. On Exact Computation with an Infinitely Wide Neural Net. NeurIPS 2019.
Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma. Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel. NeurIPS 2019.
9
Nov 21 (Week 10)
Model Scaling 2
Maximal Update Parameterization
Tensor Program
Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, Jianfeng Gao. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. NeurIPS 2021.
Greg Yang, Edward J. Hu. Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks. ICML 2021.
Lucas Lingle. An Empirical Study of μP Learning Rate Transfer. 2024.
Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, Jeffrey Pennington. Scaling Exponents Across Parameterizations and Optimizers. ICML 2024.
Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, Cengiz Pehlevan. Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit. ICLR 2024.
Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, Joel Hestness. Don't be lazy: CompleteP enables compute-efficient deep transformers. NeurIPS 2025.
10
Nov 28 (Week 11)
Optimizers
AdaGrad, RMSProp, Adam and AdamW
Adam Can Fail to Converge
Why Transformers Need Adam
Muon Optimizer
Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, Sham Kakade. Deconstructing What Makes a Good Optimizer for Autoregressive Language Models. ICLR 2025.
John Duchi, Elad Hazan, Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR 2011.
Diederik P. Kingma, Jimmy Lei Ba. Adam: A Method for Stochastic Optimization. ICLR 2015.
Ilya Loshchilov, Frank Hutter. Decoupled Weight Decay Regularization. ICLR 2019.
Sashank J. Reddi, Satyen Kale, Sanjiv Kumar. On the Convergence of Adam and Beyond. ICLR 2018.
Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, Zhi-Quan Luo. Adam Can Converge Without Any Modification On Update Rules. NeurIPS 2022.
Tianyue H. Zhang, Lucas Maes, Alan Milligan, Alexia Jolicoeur-Martineau, Ioannis Mitliagkas, Damien Scieur, Simon Lacoste-Julien, Charles Guille-Escuret. Understanding Adam Requires Better Rotation Dependent Assumptions. NeurIPS 2025.
Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhi-Quan Luo. Why Transformers Need Adam: A Hessian Perspective. NeurIPS 2024.
Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun. Adam-mini: Use Fewer Learning Rates To Gain More. ICLR 2025.
Kyle Lo, Akshita Bhagia, Nathan Lambert. Opening the Language Model Pipeline: A Tutorial on Data Preparation, Model Training, and Adaptation. NeurIPS 2024.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR 2020.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. Language Models are Few-Shot Learners. 2020.
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, et al. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. NeurIPS 2023, Datasets and Benchmarks Track.
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. NeurIPS 2024, Datasets and Benchmarks Track.
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, Vaishaal Shankar. DataComp-LM: In search of the next generation of training sets for language models. NeurIPS 2024, Datasets and Benchmarks Track.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. 2020.
Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin. RegMix: Data Mixture as Regression for Language Model Pre-training. ICLR 2025.
Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding. BiMix: A Bivariate Data Mixing Law for Language Model Pretraining. 2025.
Mustafa Shukor, Louis Béthune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin. Scaling Laws for Optimal Data Mixtures. NeurIPS 2025.
Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, Xipeng Qiu. Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance. ICLR 2025.
Xinran Gu, Kaifeng Lyu, Jiazheng Li, Jingzhao Zhang. Data Mixing Can Induce Phase Transitions in Knowledge Acquisition. NeurIPS 2025.
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. COLM 2024.
We thank Kaiyue Wen, Songlin Yang, Mingyu Gao, Kaichao You, Yuhang Cai, Zihan Qiu, Xingyu Dang, Zhuorui Ye, Xinran Gu, and Sadhika Malladi for their suggestions and feedback on the course.
Stanford CS336 is our primary inspiration for this course. We are deeply grateful to the CS336 team for open-sourcing their course materials.