my photo

Kaifeng Lyu 吕凯风

I am currently a final-year Ph.D. student in the Computer Science Department at Princeton University and I am very fortunate to be advised by Prof. Sanjeev Arora. I will be joining the Institute for Interdisciplinary Information Sciences (IIIS) at Tsinghua University as a Tenure-Track Assistant Professor in Fall 2025.

I did my undergraduate at Tsinghua University and received a B.Eng. in Computer Science and Technology in 2019. At Tsinghua, I was a student of Yao Class headed by Prof. Andrew Chi-Chih Yao and I was very fortunate to be advised by Prof. Jian Li.

Research Interests

I am primarily interested in machine learning theory, AI safety/alignment, and optimization.

I believe that a good theory should come from practice and be able to advance practice: it should start from important real-world phenomena or problems, explain or solve them theoretically, and then be applied to guide the practice. Based on this philosophy, I conduct research with both theory and experiments, aiming to develop solid foundations for modern machine learning and make AI in the era of large models more efficient, safe, and reliable.

Below is a list of topics I am actively thinking about/working on:

  1. Training Dynamics of Neural Networks: How do neural networks learn? How do we make the training of large models more efficient? How do training algorithms, model architectures, and training data interact with each other and affect the model's performance?
  2. Modern Paradigms of Generalization in Large Foundation Models: Generalization is the ability of a model to perform well beyond its training set. Classic machine learning theory has focused more on generalization in supervised learning, but large foundation models integrate various unsupervised and supervised learning paradigms, and many new paradigms of generalization emerge. How do we understand these new generalization paradigms? How do we improve algorithms, architectures, and data to enhance the model's capabilities (such as reasoning, retrieval, in-context learning, etc.)?
  3. AI Safety/Alignment: Machine learning typically optimizes a model's performance in "average cases". However, AI safety issues can manifest in extreme cases and can potentially be more catastrophic as the model becomes more intelligent. What new algorithms and paradigms do we need to mitigate or even solve the safety issues?

an image showing the black box of deep learning

我们正在招收 2025 年秋季入学的博士生,根据情况会招录1-2个博士生。

我们希望你有钻研精神,有志于为现代机器学习方法夯实理论基础。观察到深度学习中一系列神秘的现象时,对其怀有好奇心,愿意花时间去学习和研究。我们希望你学习成绩优异,或完成过高质量的科研项目。

我们需要复合型的人才来从事这项事业。面对神经网络这类复杂系统,做好理论不仅需要善用数学工具,还要 “挽起袖子” 做实验,去观察真正的实验现象。正如开普勒发现三定律,不单需要数学,还需要从第谷的恒星数据总结规律。另一方面,光看实验而不总结规律,也会难以触及机器学习的本质。

如果你具有如下背景之一

  1. 具有扎实的数学基础,喜欢数学,对深度学习有基本的了解,愿意在博士期间做理论研究,辅以实验;
  2. 具有出色的编程能力,上过深度学习相关课程或做过相关研究,愿意在博士期间基于实验发现本质,辅以理论。

欢迎感兴趣的同学与我邮件联系!请一并附上个人简历和成绩单。同时,请关注交叉信息研究院关于夏令营的信息(去年的网站)。

an image showing the black box of deep learning

Preprints

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
  • Kaifeng Lyu*
  • Haoyu Zhao*
  • Xinran Gu*
  • Dingli Yu
  • Anirudh Goyal
  • Sanjeev Arora
RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval
  • Kaiyue Wen*
  • Xingyu Dang*
  • Kaifeng Lyu
Efficient Stagewise Pretraining via Progressive Subnetworks
  • Abhishek Panigrahi*
  • Nikunj Saunshi*
  • Kaifeng Lyu
  • Sobhan Miryoosefi
  • Sashank Reddi
  • Satyen Kale
  • Sanjiv Kumar

Conference Papers

A Quadratic Synchronization Rule for Distributed Deep Learning
  • Xinran Gu*
  • Kaifeng Lyu*
  • Sanjeev Arora
  • Jingzhao Zhang
  • Longbo Huang
Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking
  • Kaifeng Lyu*
  • Jikai Jin*
  • Zhiyuan Li
  • Simon S. Du
  • Jason D. Lee
  • Wei Hu
DistillSpec: Improving Speculative Decoding via Knowledge Distillation
  • Yongchao Zhou
  • Kaifeng Lyu
  • Ankit Singh Rawat
  • Aditya Krishna Menon
  • Afshin Rostamizadeh
  • Sanjiv Kumar
  • Jean-François Kagy
  • Rishabh Agarwal
The marginal value of momentum for small learning rate SGD
  • Runzhe Wang
  • Sadhika Malladi
  • Tianhao Wang
  • Kaifeng Lyu
  • Zhiyuan Li
Understanding incremental learning of gradient descent: A fine-grained analysis of matrix sensing
  • Jikai Jin
  • Zhiyuan Li
  • Kaifeng Lyu
  • Simon S. Du
  • Jason D. Lee
Why (and When) does Local SGD Generalize Better than SGD?
  • Xinran Gu*
  • Kaifeng Lyu*
  • Longbo Huang
  • Sanjeev Arora
Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction
  • Kaifeng Lyu
  • Zhiyuan Li
  • Sanjeev Arora
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
  • Sadhika Malladi*
  • Kaifeng Lyu*
  • Abhishek Panigrahi
  • Sanjeev Arora
New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound
  • Arushi Gupta*
  • Nikunj Saunshi*
  • Dingli Yu*
  • Kaifeng Lyu
  • Sanjeev Arora
Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias
  • Kaifeng Lyu*
  • Zhiyuan Li*
  • Runzhe Wang*
  • Sanjeev Arora
Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning
  • Zhiyuan Li
  • Yuping Luo
  • Kaifeng Lyu
(alphabetical order)
Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate
  • Zhiyuan Li*
  • Kaifeng Lyu*
  • Sanjeev Arora
Gradient Descent Maximizes the Margin of Homogeneous Neural Networks
  • Kaifeng Lyu
  • Jian Li
Theoretical Analysis of Auto Rate-Tuning by Batch Normalization
  • Sanjeev Arora
  • Zhiyuan Li
  • Kaifeng Lyu
(alphabetical order)
Fine-grained complexity meets IP = PSPACE
  • Lijie Chen
  • Shafi Goldwasser
  • Kaifeng Lyu
  • Guy N Rothblum
  • Aviad Rubinstein
(alphabetical order)
Single-Source Bottleneck Path Algorithm Faster than Sorting for Sparse Graphs
  • Ran Duan
  • Kaifeng Lyu
  • Hongxun Wu
  • Yuanhang Xie
(alphabetical order)
Learning gradient descent: Better generalization and longer horizons
  • Kaifeng Lv*
  • Shunhua Jiang*
  • Jian Li
(Contribution order by default; Asterisk * stands for equal contribution.)

Professional Services

  • Organizer, NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning (M3L).
  • Conference Reviewer: ICML (2020-2023), NeurIPS (2020-2023), ICLR (2022-2024), TPAMI, COLT (2020), AAAI (2020), KDD (2022).
  • Journal Reviewer: TMLR, JMLR, TPAMI, AIJ.
  • Organizer, Yao Class Seminar, Tsinghua University (Fall 2019, Fall 2020, Spring 2021).

Universal Online Judge

  • I founded the Universal Online Judge (UOJ) in 2014, a popular online judge system in China.
  • UOJ is capable of testing both traditional and non-traditional programming problems in OI (Olympiad in Informatics). A team of top OI players regularly hosts programming contests on UOJ.
  • [Link] [GitHub] [Docs]