Ryan Adams: Automatic differentiation collapses the linearized computational graph to compute a Jacobian. We don’t need exact gradients for SGD, so let’s use cheap Monte Carlo estimators instead: https://arxiv.org/abs/2007.10412
@denizzokt @AlexBeatson @NMcgreivy @jaduol1 https://t.co/JArPO7btng
2 replies, 478 likes
Deniz Oktay: Why spend computation and memory on exact gradients only to use them for stochastic optimization?
Introducing: Randomized Automatic Differentiation (RAD)
w/ @NMcgreivy @jaduol1 @AlexBeatson @ryan_p_adams
3 replies, 443 likes
Sam Greydanus: At the price of adding noise to gradients, we can save lots of memory. This makes backprop ~1 order of magnitude more memory-efficient (depends on model), without hurting optimization much. I'd like to see this on larger problems.
1 replies, 49 likes
Alex Beatson: Super excited to share this new paper on Randomized Automatic Differentiation!
Minibatch SGD estimates grads by sampling *data* nodes in a computational graph. What if we sampled *all* nodes when doing AD?
We show this can reduce memory costs in ML and scientific computing.
0 replies, 32 likes
Sam Power: quite cool stuff. a bit reminiscent of some old techniques for solving linear systems with sampling methods, e.g. https://link.springer.com/article/10.1007/BF01578388 (n.b. at the time / to this author, `sequential monte carlo' did not carry the same meaning which it currently does)
0 replies, 4 likes
Found on Jul 24 2020 at https://arxiv.org/pdf/2007.10412.pdf