Papers of the day   All papers

Tensor Programs II: Neural Tangent Kernel for Any Architecture


Greg Yang: Training a neural network (NN) can suffer from bad local minima. But as the NN gets wider, its optimization landscape in *function space* converges & becomes convex; when width=∞, this convex landscape is described by Neural Tangent Kernel.

9 replies, 996 likes

Greg Yang: 1/ Gradients improve weights, so they better depend on the weights, right? Somehow, for calculating e.g. grad norm or NTK at init, grads might as well be backproped by random weights, independent from those used in forward pass. WTF? Let me explain (from

5 replies, 319 likes

Greg Yang: 1/ I reveal the evolution under gradient descent of neural network of *any architecture*, by showing how to compute its tangent kernel (NTK). This includes RNN, transformer, resnet, GANs, Faster RCNN, and more! Let's have theory catch up to practice!

4 replies, 237 likes

Greg Yang: 1/ A ∞-wide NN of *any architecture* is a Gaussian process (GP) at init. The NN in fact evolves linearly in function space under SGD, so is a GP at *any time* during training. With Tensor Programs, we can calculate this time-evolving GP w/o trainin any NN

1 replies, 215 likes

Greg Yang: Infinitely-wide recurrent networks (i.e. RNN Neural Tangent Kernel) are good at time series prediction with low data, whodvethought! Such calculation with infinite-width RNN wouldn't have been possible without Tensor Programs!

0 replies, 207 likes

Greg Yang: 1/ How to construct ∞-width neural networks of *any architecture* (e.g. RNN, transformers) with finite resources, ie compute their infinite-width Neural Tangent Kernels? explains it, but here's an overview thread

1 replies, 163 likes

Greg Yang: 1/ It's remarkable how much of our current understanding of wide networks (Gaussian process, Tangent Kernel, etc) can be derived from 1 key intuition: In every (pre)activation x of a randomly initialized NN, the coordinates of x are roughly iid.

1 replies, 146 likes

Greg Yang: @vincefort @PhilippMarkolin @_hylandSL well it's funny you mention this, since GPT3 is just a transformer and you can play with such a kernel on colab right here :) The NTK is Papers GP4A: NTK4A:

2 replies, 26 likes

Greg Yang: 6/ In fact, this "free independence phenomenon" implies the "gradient independence phenomenon", where, during the first backpropagation, the gradient is the same whether you backprop using the forward propagation weights or a new set of iid weights.

0 replies, 4 likes

Greg Yang: 6/ The NTK for any architecture is calculated in and in this thread

0 replies, 4 likes

Greg Yang: @BrynElesedy @dohmatobelvis 3/ However, n^-1/2 dV dW = O(n^-1/2) so vanishes with n. Therefore, only the first order Taylor expansion at init, i.e. dV W and V dW, matter. Just in case you do wonder about the NTK computation at init, take a look at this paper :)

1 replies, 2 likes


Found on Sep 09 2020 at

PDF content of a computer science paper: Tensor Programs II: Neural Tangent Kernel for Any Architecture