DeepMind: We show that batch normalisation biases deep residual networks towards shallow paths with well-behaved gradients. This dramatically increases the largest trainable depth. We can recover this benefit with a simple change to the initialisation scheme: https://arxiv.org/abs/2002.10444
6 replies, 738 likes
👩💻 DynamicWebPaige: "SkipInit: 1-line code change that can train deep
residual networks without normalization, and also enhances the performance of shallow residual networks.
We therefore conclude that one no longer needs normalization layers to train deep residual networks with small batch sizes."
1 replies, 20 likes
Yee Whye Teh: Nice work by Sam Smith and Soham De.
0 replies, 13 likes
Daisuke Okanohara: Batch normalization makes residual networks to use shallow paths by downscaling residual blocks and increase the trainable depth. We can obtain the same effect by just introducing a scalar multiplier initialized to 0 at the end of each residual branch. https://arxiv.org/abs/2002.10444
0 replies, 3 likes
Jesper Dramsch: Batch normalization did not work on some of the problems I worked on. Basically lost all information necessary for the regression of physical data.
I was wondering if better initializations would help. Seems they do!
#ml #machinelearning https://twitter.com/DeepMind/status/1232324838070669313
0 replies, 2 likes
Statistics Papers: Batch Normalization Biases Deep Residual Networks Towards Shallow Paths. http://arxiv.org/abs/2002.10444
0 replies, 2 likes
Greg Yang: @unsorsodicorda @KyleLLuther1 I started writing it but then other things ended getting priority :( However these guys https://arxiv.org/pdf/2002.10444.pdf essentially gives you the right thing for resnet with batchnorm
0 replies, 1 likes
Brundage Bot: Batch Normalization Biases Deep Residual Networks Towards Shallow Paths. Soham De and Samuel L. Smith http://arxiv.org/abs/2002.10444
1 replies, 1 likes
Found on Feb 25 2020 at https://arxiv.org/pdf/2002.10444.pdf