Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes


Greg Yang: 1/ Why do wide, random neural networks form Gaussian processes, *regardless of architecture*? Let me give an overview in case you are too lazy to check out the paper or the code The proof has two parts…

Greg Yang: 1/ I can't teach you how to dougie but I can teach you how to compute the Gaussian Process corresponding to infinite-width neural network of ANY architecture, feedforward or recurrent, eg: resnet, GRU, transformers, etc ... RT plz💪

Greg Yang: RNNs and batchnorm will be coming soon, but you can already play with them here The general theory for this is based on tensor programs Give Neural Tangents a try and let us know what you think!

Greg Yang: 1/ Neural networks are Gaussian Processes --- the Poster Edition from #NeurIPS2019 last week. In case you missed it, here’s a twitter version of the poster presentation, following the format of @colinraffel; and here’s the previous tweet thread

Greg Yang: 1/2 A wide NN w/ rand weights is a GP, aka Neural network-Gaussian process (NNGP) correspondence @G_Naveh @HSompolinsky et al show it also occurs when *training the NN w/ weight decay & grad noise* Neat!

Microsoft Research: Explore the open-source implementations of the Gaussian Process kernels of simple RNN, GRU, transformer, and batchnorm+ReLU network on GitHub:

Greg Yang: Hit me up @NeurIPSConf if you wanna learn more about wide neural networks and come to my poster session on Wednesday 5pm to 7pm, east exhibition hall B+C, poster #242

Greg Yang: @vincefort @PhilippMarkolin @_hylandSL well it's funny you mention this, since GPT3 is just a transformer and you can play with such a kernel on colab right here :) The NTK is Papers GP4A: NTK4A:

Andrey Kurenkov 🤖 @ Neurips: This Twitter thread by @TheGregYang, as well as the associated poster (which I stopped by today, hope you dont mind the not so grear pic 😅), is a great example of communicating tricky math stuff with both depth and accessible & concise clarity! We should all strive for this! :)

Greg Yang: 2/ This paper is 2nd in the *tensor programs* series, following that proves the architectural universality of NNGP correspondence. This series aims to systematically scale up theoretical insights in toy cases to neural networks in practice.

Greg Yang: Pairs best with the paper and previous discussion

Greg Yang: @andrewgwils 1/2 This prior for DNNs has been studied recently (extending Neal's work) in the limit of infinite width in particular shows this prior is a GP for *any* DNN architecture

Greg Yang: 5/ So it remains to calculate the NNGP kernel and NT kernel for any given architecture. The first is described in and in this thread

Greg Yang: @sschoenholz @stormtroper1721 @alanyttian Thanks for ping, Sam! Here is, for example, a thread on why all NNs look like Gaussian Processes at initialization.

Nicole Radziwill: this is super cool. thanks @BruceTedesco for RTing it

Charles 🎉 Frye: @FeiziSoheil Strong recommendation for covering the work of @yasamanbb, @jaschasd , @TheGregYang, and others on the gaussian processes approach to understanding DNNs Tensor Programs: Extension to Attention:

Matios Berhe: I’m not skilled enough to know why this makes me nervous cc:@paulportesi

Kevin Yang 楊凱筌: Another poster I'm really excited to see. I'm basically a sucker for anything that has GPs and NNs together.

Hacker News: Wide Neural Networks of Any Architecture Are Gaussian Processes: Comments:

Sham Kakade: cool stuff from @TheGregYang: Tensors, Neural Nets, GPs, and kernels! looks like we can derive a corresponding kernel/GP in a fairly general sense. very curious on broader empirical comparisons to neural nets, which (potentially) draw strength from the non-linear regime!

PDF content of a computer science paper: Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes