Greg Yang: Training a neural network (NN) can suffer from bad local minima. But as the NN gets wider, its optimization landscape in *function space* converges & becomes convex; when width=∞, this convex landscape is described by Neural Tangent Kernel. https://arxiv.org/abs/2006.14548 https://t.co/2fGBldH3Ci
9 replies, 996 likes
Greg Yang: 1/ Gradients improve weights, so they better depend on the weights, right? Somehow, for calculating e.g. grad norm or NTK at init, grads might as well be backproped by random weights, independent from those used in forward pass. WTF? Let me explain (from https://arxiv.org/abs/2006.14548) https://t.co/NRa36a71ym
5 replies, 319 likes
Greg Yang: 1/ I reveal the evolution under gradient descent of neural network of *any architecture*, by showing how to compute its tangent kernel (NTK). This includes RNN, transformer, resnet, GANs, Faster RCNN, and more! Let's have theory catch up to practice!
4 replies, 237 likes
Greg Yang: 1/ A ∞-wide NN of *any architecture* is a Gaussian process (GP) at init. The NN in fact evolves linearly in function space under SGD, so is a GP at *any time* during training. https://arxiv.org/abs/2006.14548 With Tensor Programs, we can calculate this time-evolving GP w/o trainin any NN https://t.co/PxmWDBA9Po
1 replies, 215 likes
Greg Yang: Infinitely-wide recurrent networks (i.e. RNN Neural Tangent Kernel) are good at time series prediction with low data, whodvethought! https://arxiv.org/abs/2006.10246v1 Such calculation with infinite-width RNN wouldn't have been possible without Tensor Programs! https://arxiv.org/abs/2006.14548 https://t.co/MqdV6AL4OS
0 replies, 207 likes
Greg Yang: 1/ How to construct ∞-width neural networks of *any architecture* (e.g. RNN, transformers) with finite resources, ie compute their infinite-width Neural Tangent Kernels? https://arxiv.org/abs/2006.14548 explains it, but here's an overview thread https://t.co/20K7WmXP7I
1 replies, 163 likes
Greg Yang: 1/ It's remarkable how much of our current understanding of wide networks (Gaussian process, Tangent Kernel, etc) can be derived from 1 key intuition:
In every (pre)activation x of a randomly initialized NN, the coordinates of x are roughly iid. https://arxiv.org/abs/2006.14548 https://t.co/4SqhuT3vlS
1 replies, 146 likes
Greg Yang: @vincefort @PhilippMarkolin @_hylandSL well it's funny you mention this, since GPT3 is just a transformer and you can play with such a kernel on colab right here https://colab.research.google.com/github/thegregyang/GP4A/blob/master/colab/Transformer.ipynb :) The NTK is https://colab.research.google.com/github/thegregyang/NTK4A/blob/master/colab/Transformer-NTK.ipynb.
2 replies, 26 likes
Greg Yang: 6/ In fact, this "free independence phenomenon" implies the "gradient independence phenomenon" https://twitter.com/TheGregYang/status/1290290612588036096?s=20, where, during the first backpropagation, the gradient is the same whether you backprop using the forward propagation weights or a new set of iid weights.
0 replies, 4 likes
Greg Yang: 6/ The NTK for any architecture is calculated in https://arxiv.org/abs/2006.14548 and in this thread
0 replies, 4 likes
Greg Yang: @BrynElesedy @dohmatobelvis 3/ However, n^-1/2 dV dW = O(n^-1/2) so vanishes with n. Therefore, only the first order Taylor expansion at init, i.e. dV W and V dW, matter.
Just in case you do wonder about the NTK computation at init, take a look at this paper https://arxiv.org/abs/2006.14548 :)
1 replies, 2 likes
Found on Sep 09 2020 at https://arxiv.org/pdf/2006.14548.pdf