my photo

Kaifeng Lyu 吕凯风

I am a tenure-track assistant professor at the Institute for Interdisciplinary Information Sciences (IIIS) at Tsinghua University. Before that, I was a postdoctoral research fellow at the Simons Institute at UC Berkeley. I obtained my Ph.D. in Computer Science from Princeton University in 2024, where I was very fortunate to be advised by Prof. Sanjeev Arora. I did my undergraduate study at Tsinghua University and received a B.Eng. in Computer Science and Technology in 2019. During my study at Tsinghua, I was a student of Yao Class headed by Prof. Andrew Chi-Chih Yao and I was very fortunate to be advised by Prof. Jian Li.

Research Interests

I am primarily interested in machine learning theory, AI safety/alignment, and optimization.

I believe that a good theory should come from practice and be able to advance practice: it should start from important real-world phenomena or problems, explain or solve them theoretically, and then be applied to guide the practice. Based on this philosophy, I conduct research with both theory and experiments, aiming to develop solid foundations for modern machine learning and make AI in the era of large models more efficient, safe, and reliable.

The following are some of the topics I am actively thinking about/working on:

  1. Science of Large-Scale Training: The training of large models is extremely complex, but are there any universal laws that we can predict? How can we make the training process more predictable? The large language model will not be the last large model that humans train, are there any universal laws that can be migrated to the next large model training?
  2. Principles in Data-Centric ML: The progress of large model capabilities comes from two aspects: the expansion of training scale and the improvement of data quality. Distillation is easy, and stacking engineering techniques can also get performance gains for sure. But besides that, can we summarize some basic principles that can be widely applied to help us better select, mix, and even generate data?
  3. Foundations of AI Safety/Alignment: I am also interested in AI safety and alignment. Machine learning usually optimizes the performance of a model in the “average case”, but AI safety issues can be exposed in extreme cases. What is the fundamental reason for model failures in the extreme case? What are the limitations of current AI alignment methods, and what are the security issues that cannot be completely avoided? In the long run, can we find a systematic method like cryptography to solve a large class of AI safety problems once and for all?

Sorry, please see the Chinese version for more details, especially if you are seeking for PhD recruitment information! [Chinese]

Conference Papers

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
  • Kairong Luo
  • Zhenbo Sun
  • Haodong Wen
  • Xinyu Shi
  • Jiarui Cui
  • Chenyi Dang
  • Kaifeng Lyu
  • Wenguang Chen
Oral Presentation (Top 1.2%).
Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression
  • Tingkai Yan*
  • Haodong Wen*
  • Binghui Li*
  • Kairong Luo
  • Wenguang Chen
  • Kaifeng Lyu
Also presented as an Oral Presentation at the OPT Workshop, NeurIPS 2025.
Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice
  • Jiachen T. Wang
  • Tong Wu
  • Kaifeng Lyu
  • James Zou
  • Dawn Song
  • Ruoxi Jia
  • Prateek Mittal
AISTATS 2025
Shift is Good: Mismatched Data Mixing Improves Test Performance
  • Marko Medvedev*
  • Kaifeng Lyu*
  • Zhiyuan Li
  • Nathan Srebro
Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
  • Xinran Gu*
  • Kaifeng Lyu*
  • Jiazheng Li
  • Jingzhao Zhang
Spotlight Presentation (Top 3.5%). Oral Presentation at the DATA-FM Workshop, ICLR 2025.
Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold
  • Xinghan Li*
  • Haodong Wen*
  • Kaifeng Lyu
How Far Are We from Optimal Reasoning Efficiency?
  • Jiaxuan Gao
  • Shu Yan
  • Qixin Tan
  • Lu Yang
  • Shusheng Xu
  • Wei Fu
  • Zhiyu Mei
  • Kaifeng Lyu
  • Yi Wu
Weak-to-Strong Generalization Even in Random Feature Networks, Provably
  • Marko Medvedev*
  • Kaifeng Lyu*
  • Dingli Yu
  • Sanjeev Arora
  • Zhiyuan Li
  • Nathan Srebro
A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules
  • Kairong Luo
  • Haodong Wen
  • Shengding Hu
  • Zhenbo Sun
  • Zhiyuan Liu
  • Maosong Sun
  • Kaifeng Lyu
  • Wenguang Chen
RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval
  • Kaiyue Wen*
  • Xingyu Dang*
  • Kaifeng Lyu
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
  • Xiangyu Qi
  • Ashwinee Panda
  • Kaifeng Lyu
  • Xiao Ma
  • Subhrajit Roy
  • Ahmad Beirami
  • Prateek Mittal
  • Peter Henderson
Oral Presentation (Top 1.8%). Outstanding Paper Award (Top 3/3827=0.08%).
Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks
  • Binghui Li*
  • Zhixuan Pan*
  • Kaifeng Lyu
  • Jian Li
Efficient Stagewise Pretraining via Progressive Subnetworks
  • Abhishek Panigrahi*
  • Nikunj Saunshi*
  • Kaifeng Lyu
  • Sobhan Miryoosefi
  • Sashank Reddi
  • Satyen Kale
  • Sanjiv Kumar
Towards Understanding Text Hallucination of Diffusion Models via Local Generation Bias
  • Rui Lu*
  • Runzhe Wang*
  • Kaifeng Lyu
  • Xitai Jiang
  • Gao Huang
  • Mengdi Wang
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
  • Kaifeng Lyu*
  • Haoyu Zhao*
  • Xinran Gu*
  • Dingli Yu
  • Anirudh Goyal
  • Sanjeev Arora
A Quadratic Synchronization Rule for Distributed Deep Learning
  • Xinran Gu*
  • Kaifeng Lyu*
  • Sanjeev Arora
  • Jingzhao Zhang
  • Longbo Huang
Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking
  • Kaifeng Lyu*
  • Jikai Jin*
  • Zhiyuan Li
  • Simon S. Du
  • Jason D. Lee
  • Wei Hu
DistillSpec: Improving Speculative Decoding via Knowledge Distillation
  • Yongchao Zhou
  • Kaifeng Lyu
  • Ankit Singh Rawat
  • Aditya Krishna Menon
  • Afshin Rostamizadeh
  • Sanjiv Kumar
  • Jean-François Kagy
  • Rishabh Agarwal
The Marginal Value of Momentum for Small Learning Rate SGD
  • Runzhe Wang
  • Sadhika Malladi
  • Tianhao Wang
  • Kaifeng Lyu
  • Zhiyuan Li
Understanding incremental learning of gradient descent: A fine-grained analysis of matrix sensing
  • Jikai Jin
  • Zhiyuan Li
  • Kaifeng Lyu
  • Simon S. Du
  • Jason D. Lee
Why (and When) does Local SGD Generalize Better than SGD?
  • Xinran Gu*
  • Kaifeng Lyu*
  • Longbo Huang
  • Sanjeev Arora
Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction
  • Kaifeng Lyu
  • Zhiyuan Li
  • Sanjeev Arora
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
  • Sadhika Malladi*
  • Kaifeng Lyu*
  • Abhishek Panigrahi
  • Sanjeev Arora
New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound
  • Arushi Gupta*
  • Nikunj Saunshi*
  • Dingli Yu*
  • Kaifeng Lyu
  • Sanjeev Arora
Oral Presentation (Top 1.9%).
Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias
  • Kaifeng Lyu*
  • Zhiyuan Li*
  • Runzhe Wang*
  • Sanjeev Arora
Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning
  • Zhiyuan Li
  • Yuping Luo
  • Kaifeng Lyu
(alphabetical order)
Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate
  • Zhiyuan Li*
  • Kaifeng Lyu*
  • Sanjeev Arora
Gradient Descent Maximizes the Margin of Homogeneous Neural Networks
  • Kaifeng Lyu
  • Jian Li
Oral Presentation (Top 1.9%).
Theoretical Analysis of Auto Rate-Tuning by Batch Normalization
  • Sanjeev Arora
  • Zhiyuan Li
  • Kaifeng Lyu
(alphabetical order)
Fine-grained complexity meets IP = PSPACE
  • Lijie Chen
  • Shafi Goldwasser
  • Kaifeng Lyu
  • Guy N Rothblum
  • Aviad Rubinstein
(alphabetical order)
Single-Source Bottleneck Path Algorithm Faster than Sorting for Sparse Graphs
  • Ran Duan
  • Kaifeng Lyu
  • Hongxun Wu
  • Yuanhang Xie
(alphabetical order)
Learning gradient descent: Better generalization and longer horizons
  • Kaifeng Lv*
  • Shunhua Jiang*
  • Jian Li
(Contribution order by default; Asterisk * stands for equal contribution.)
PhD Students:
  • Haodong Wen (2025–present)
  • Kexian Tang (2025–present)
  • Huaijie Wang (2023–present, joined our group in 2026)
  • Jinhan Li (incoming)
  • Tingkai Yan (incoming)
PhD Students in Close Research Collaboration:
  • Kairong Luo (2024–present, advised by Prof. Wenguang Chen)
  • Haofeng Huang (incoming, advised by Prof. Andrew Yao)
Master's Student:
  • Rui Chen (2025–present, co-advised by Prof. Shuran Zheng)
Alumni / Graduating Soon:
  • Xinghan Li (Undergraduate Class of 2026, joining the University of Washington as a PhD student)
  • Yiran Zhang (Undergraduate Class of 2026, joining UC Berkeley as a PhD student)
  • Xingyu Dang (Undergraduate Class of 2025, now PhD student at Princeton)
  • Kaiyue Wen (Undergraduate Class of 2024, now PhD student at Stanford)

Courses

  • Spring 2026 (upcoming). Mathematics for Computer Science and Artificial Intelligence, Tsinghua University.
  • Fall 2025. Large Language Models from Scratch: Theory and Practice, Tsinghua University (Top 5% of teaching evaluations).

Teaching Assistant Experience

Professional Services

  • Organizer, NeurIPS 2024 Workshop on Mathematics of Modern Machine Learning (M3L 2024).
  • Organizer, NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning (M3L 2023).
  • Conference Area Chair: NeurIPS (2025), ICLR (2026).
  • Conference Reviewer: ICML (2020-2025), NeurIPS (2020-2023), ICLR (2022-2025), TPAMI, COLT (2020,2025), AAAI (2020), KDD (2022).
  • Journal Reviewer: TMLR, JMLR, TPAMI, AIJ.
  • Organizer, Yao Class Seminar, Tsinghua University (Fall 2019, Fall 2020, Spring 2021).

Universal Online Judge

  • I founded the Universal Online Judge (UOJ) in 2014, a popular online judge system in China.
  • UOJ is capable of testing both traditional and non-traditional programming problems in OI (Olympiad in Informatics). A team of top OI players regularly hosts programming contests on UOJ.
  • [Link] [GitHub] [Docs]