OpenAI: A surprising deep learning mystery:
Contrary to conventional wisdom, performance of unregularized CNNs, ResNets, and transformers is non-monotonic: improves, then gets worse, then improves again with increasing model size, data size, or training time.
80 replies, 2035 likes
Ian Osband: Looking back over the year, the one paper that gave me the best "aha" moment was...
Reconciling Modern Machine Learning and the Bias-Variance Tradeoff:
The "bias-variance" you knew was just the first piece of the story! https://t.co/J24b0W8LDR
17 replies, 1939 likes
Nando de Freitas: I agree. This was a phenomenal paper. I’m hoping it will inspire researchers to probe further.
1 replies, 478 likes
Oriol Vinyals: The paper "Understanding deep learning requires rethinking generalization" mostly asked questions. Glad to see some answers / new theories since then!
1 replies, 168 likes
Greg Yang: @OpenAI Isn't this the "double descent" phenomenon studied in https://arxiv.org/abs/1812.11118 and subsequent works?
3 replies, 131 likes
Gilles Louppe: Do you know of anyone who reproduced the double-U generalization curve of over-parameterized networks? https://arxiv.org/pdf/1812.11118.pdf Looking for a friend :-) https://t.co/6GgKsyAiSm
8 replies, 127 likes
Olivia Guest | Ολίβια Γκεστ: I was in a workshop that warned against overfitting without mentioning that it's just not the case in practice that many deep network models are overfit, so I'm mentioning it here:
Talk: https://cbmm.mit.edu/video/fit-without-fear-over-fitting-perspective-modern-deep-and-shallow-learning https://t.co/vxWaDXlZ9N
1 replies, 34 likes
halvarflake: @zacharylipton Not sure there's a "single" paper to note, but the entire discussion about double-descent has been the most interesting thing I read this year:
1) https://arxiv.org/abs/1812.11118 - "Reconciling modern machine learning practice and the bias-variance trade-off"
2 replies, 24 likes
halvarflake: To my great surprise, I found a few minutes of downtime today to read https://arxiv.org/abs/1812.11118. If you are into ML or statistics, I greatly recommend the paper; I will read the follow-ups but the empirical results showing double-descent risk curves are really fascinating.
1 replies, 23 likes
Wojciech Czarnecki: @ilyasut By unnoticed you mean published for just a year?https://arxiv.org/abs/1812.11118
1 replies, 16 likes
Daisuke Okanohara: The bias-variance tradeoff shows that a model with appropriate complexity can generalize. Recent "double descent" indicates that a larger (than the necessary) model can generalize better in some situations. https://arxiv.org/abs/1812.11118 https://arxiv.org/abs/1903.07571 https://arxiv.org/abs/1909.11720
1 replies, 16 likes
Vince Buffalo: I quite like this figure (from this paper: https://arxiv.org/abs/1812.11118), which I think unites why both machine learning and parameter-rich Bayesian models are doing well across a variety of tasks (Murphy also makes this point in Chapter 17 of his book). https://t.co/Jkt4JYHBcG
1 replies, 15 likes
Narges Razavian: Adding the double descent paper to lecture 1 of my introductory ML course.. Who would've thought?
Reconciling modern machine learning practice and the bias-variance trade-off
Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal
(& https://openai.com/blog/deep-double-descent/) https://t.co/1TsrIatsBq
0 replies, 14 likes
halvarflake: Stats/ML followers: This paper https://arxiv.org/pdf/1812.11118.pdf argues that the risk curve when overparametrizing models is "w"-shaped vs u-shaped for many models. They provide some evidence from DNN and RFs. A fascinating claim; will need to mull the paper a bit. Worth a read.
1 replies, 12 likes
Jigar Doshi: From Classical Statistics to Modern Machine Learning. This attempts to explain why we don't overfit when we train for a very long time. Beautiful talk as well
Talk: https://www.youtube.com/watch?v=OBCciGnOJVs https://t.co/J93wvzuNJd
0 replies, 10 likes
François Fleuret: So is the idea in Belkin's paper simply that when the training error is zero and you increase your model space, you can reduce even more *whatever measure of capacity you defined initially*?
2 replies, 8 likes
Kameron Decker Harris: Check out this paper: "Reconciling modern machine-learning practice and the classical bias–variance trade-off" by Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal
1 replies, 7 likes
𝚄𝚕𝚞𝚐𝚋𝚎𝚔 𝚂. 𝙺𝚊𝚖𝚒𝚕𝚘𝚟: This was the paper you mentioned to me at BASP @mariotelfig?
2 replies, 7 likes
Andreas Mueller: @jeremyphoward @reachtarunhere @OpenAI a theoretical explanation is given by y colleague Daniel Hsu here: https://arxiv.org/abs/1812.11118
1 replies, 6 likes
Nil Adell Mill: It's great to see more work on the double descent phenomenon. It comes as a good reminder for me to re-visit Belkin et al. (https://arxiv.org/abs/1812.11118) https://t.co/sBCINqeKYc
0 replies, 6 likes
Orestis Tsinalis: Very interesting paper with empirical observations of "double descent"/two-regime behaviour in test performance of complex ML models as a function of (L2 norm-based) model complexity.
"Reconciling Modern Machine Learning and the Bias-Variance Tradeoff" https://arxiv.org/abs/1812.11118 https://t.co/A3kM6UKNPH
0 replies, 4 likes
Karandeep Singh: The “double-descent” observed in this paper doesn’t make any sense to me intuitively. As model complexity increases (⬇️EPV), out-of-sample performance worsens then improves for neural nets and RFs? Why? https://t.co/0LH1vMcIFc
0 replies, 3 likes
Joshua Loftus: Question about #MachineLearning #DeepLearning #AI
What's the "surprise" https://arxiv.org/abs/1903.08560 or thing that needs to be "reconciled" https://arxiv.org/abs/1812.11118 about the "double descent" or "double U shape" test error curves? (1/2)
1 replies, 2 likes
msb.ai: Reconciling modern machine learning and the bias-variance trade-off
"...boosting with decision trees and Random Forests also show similar generalization behavior as neural nets, both before and after the interpolation threshold"
0 replies, 2 likes
Luigi Freda: A new surprising perspective: a "double descent" curve that subsumes the U-shaped bias-variance trade-off curve and shows how increasing model capacity beyond the point of interpolation results in improved performance.
1 replies, 1 likes
SHIMOMURA Takuji: https://arxiv.org/pdf/1812.11118.pdf We first consider a popular class of non-linear parametric models
called Random Fourier Features (RFF) , which can be viewed as a class of two-layer neural
networks with fixed weights in the first layer. #nextAI https://t.co/RC6vxJi4pb
0 replies, 1 likes
Dave Harris: Wow, this is a weird approach that would never be useful for training real models, but it’s perfect for gaining insight about what exactly is happening with over-parameterized models that don’t overfit. I’m really impressed.
0 replies, 1 likes
Found on Dec 05 2019 at https://arxiv.org/pdf/1812.11118.pdf