Papers of the day   All papers

Reconciling modern machine learning and the bias-variance trade-off


OpenAI: A surprising deep learning mystery: Contrary to conventional wisdom, performance of unregularized CNNs, ResNets, and transformers is non-monotonic: improves, then gets worse, then improves again with increasing model size, data size, or training time.

80 replies, 2035 likes

Ian Osband: Looking back over the year, the one paper that gave me the best "aha" moment was... Reconciling Modern Machine Learning and the Bias-Variance Tradeoff: The "bias-variance" you knew was just the first piece of the story!

17 replies, 1913 likes

Daniela Witten: The Bias-Variance Trade-Off & "DOUBLE DESCENT" 🧵 Remember the bias-variance trade-off? It says that models perform well for an "intermediate level of flexibility". You've seen the picture of the U-shape test error curve. We try to hit the "sweet spot" of flexibility. 1/🧵

19 replies, 1170 likes

Nando de Freitas: I agree. This was a phenomenal paper. I’m hoping it will inspire researchers to probe further.

1 replies, 478 likes

François Fleuret: One more toyish example in @pytorch: The double descent with polynomial regression. (thread)

3 replies, 338 likes

Oriol Vinyals: The paper "Understanding deep learning requires rethinking generalization" mostly asked questions. Glad to see some answers / new theories since then!

1 replies, 168 likes

Greg Yang: @OpenAI Isn't this the "double descent" phenomenon studied in and subsequent works?

3 replies, 131 likes

Gilles Louppe: Do you know of anyone who reproduced the double-U generalization curve of over-parameterized networks? Looking for a friend :-)

8 replies, 127 likes

Olivia Guest | Ολίβια Γκεστ: I was in a workshop that warned against overfitting without mentioning that it's just not the case in practice that many deep network models are overfit, so I'm mentioning it here: Preprint: Talk:

1 replies, 34 likes

halvarflake: @zacharylipton Not sure there's a "single" paper to note, but the entire discussion about double-descent has been the most interesting thing I read this year: 1) - "Reconciling modern machine learning practice and the bias-variance trade-off"

2 replies, 24 likes

halvarflake: To my great surprise, I found a few minutes of downtime today to read If you are into ML or statistics, I greatly recommend the paper; I will read the follow-ups but the empirical results showing double-descent risk curves are really fascinating.

1 replies, 23 likes

Wojciech Czarnecki: @ilyasut By unnoticed you mean published for just a year?

1 replies, 16 likes

Daisuke Okanohara: The bias-variance tradeoff shows that a model with appropriate complexity can generalize. Recent "double descent" indicates that a larger (than the necessary) model can generalize better in some situations.

1 replies, 16 likes

Vince Buffalo: I quite like this figure (from this paper:, which I think unites why both machine learning and parameter-rich Bayesian models are doing well across a variety of tasks (Murphy also makes this point in Chapter 17 of his book).

1 replies, 15 likes

Narges Razavian: Adding the double descent paper to lecture 1 of my introductory ML course.. Who would've thought? Reconciling modern machine learning practice and the bias-variance trade-off Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal (&

0 replies, 14 likes

halvarflake: Stats/ML followers: This paper argues that the risk curve when overparametrizing models is "w"-shaped vs u-shaped for many models. They provide some evidence from DNN and RFs. A fascinating claim; will need to mull the paper a bit. Worth a read.

1 replies, 12 likes

Jigar Doshi: From Classical Statistics to Modern Machine Learning. This attempts to explain why we don't overfit when we train for a very long time. Beautiful talk as well Paper: Talk:

0 replies, 10 likes

François Fleuret: So is the idea in Belkin's paper simply that when the training error is zero and you increase your model space, you can reduce even more *whatever measure of capacity you defined initially*?

2 replies, 8 likes

𝚄𝚕𝚞𝚐𝚋𝚎𝚔 𝚂. 𝙺𝚊𝚖𝚒𝚕𝚘𝚟: This was the paper you mentioned to me at BASP @mariotelfig?

2 replies, 7 likes

Kameron Decker Harris: Check out this paper: "Reconciling modern machine-learning practice and the classical bias–variance trade-off" by Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal

1 replies, 7 likes

Nil Adell Mill: It's great to see more work on the double descent phenomenon. It comes as a good reminder for me to re-visit Belkin et al. (

0 replies, 6 likes

Andreas Mueller: @jeremyphoward @reachtarunhere @OpenAI a theoretical explanation is given by y colleague Daniel Hsu here:

1 replies, 6 likes

Orestis Tsinalis: Very interesting paper with empirical observations of "double descent"/two-regime behaviour in test performance of complex ML models as a function of (L2 norm-based) model complexity. "Reconciling Modern Machine Learning and the Bias-Variance Tradeoff"

0 replies, 4 likes

Karandeep Singh: The “double-descent” observed in this paper doesn’t make any sense to me intuitively. As model complexity increases (⬇️EPV), out-of-sample performance worsens then improves for neural nets and RFs? Why?

0 replies, 3 likes Reconciling modern machine learning and the bias-variance trade-off "...boosting with decision trees and Random Forests also show similar generalization behavior as neural nets, both before and after the interpolation threshold" #ArtificialIntelligence

0 replies, 2 likes

Joshua Loftus: Question about #MachineLearning #DeepLearning #AI What's the "surprise" or thing that needs to be "reconciled" about the "double descent" or "double U shape" test error curves? (1/2)

1 replies, 2 likes

Luigi Freda: A new surprising perspective:  a "double descent" curve that subsumes the U-shaped bias-variance trade-off curve and shows how increasing model capacity beyond the point of interpolation results in improved performance.

1 replies, 1 likes

Dave Harris: Wow, this is a weird approach that would never be useful for training real models, but it’s perfect for gaining insight about what exactly is happening with over-parameterized models that don’t overfit. I’m really impressed.

0 replies, 1 likes

SHIMOMURA Takuji: We first consider a popular class of non-linear parametric models called Random Fourier Features (RFF) [30], which can be viewed as a class of two-layer neural networks with fixed weights in the first layer. #nextAI

0 replies, 1 likes


Found on Dec 05 2019 at

PDF content of a computer science paper: Reconciling modern machine learning and the bias-variance trade-off