PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization
August 25, 2020 Β· Declared Dead Β· π International Conference on Machine Learning
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter RichtΓ‘rik
arXiv ID
2008.10898
Category
cs.LG: Machine Learning
Cross-listed
cs.AI,
cs.DS,
math.OC,
stat.ML
Citations
158
Venue
International Conference on Machine Learning
Last Checked
3 months ago
Abstract
In this paper, we propose a novel stochastic gradient estimator -- ProbAbilistic Gradient Estimator (PAGE) -- for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p_t$ or reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability $1-p_t$. We give a simple formula for the optimal choice of $p_t$. Moreover, we prove the first tight lower bound $Ξ©(n+\frac{\sqrt{n}}{Ξ΅^2})$ for nonconvex finite-sum problems, which also leads to a tight lower bound $Ξ©(b+\frac{\sqrt{b}}{Ξ΅^2})$ for nonconvex online problems, where $b:= \min\{\frac{Ο^2}{Ξ΅^2}, n\}$. Then, we show that PAGE obtains the optimal convergence results $O(n+\frac{\sqrt{n}}{Ξ΅^2})$ (finite-sum) and $O(b+\frac{\sqrt{b}}{Ξ΅^2})$ (online) matching our lower bounds for both nonconvex finite-sum and online problems. Besides, we also show that for nonconvex functions satisfying the Polyak-Εojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate $O(\cdot\log \frac{1}Ξ΅)$. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch showing that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating the optimal theoretical results and confirming the practical superiority of PAGE.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Machine Learning
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
π»
Ghosted
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
R.I.P.
π»
Ghosted
Semi-Supervised Classification with Graph Convolutional Networks
R.I.P.
π»
Ghosted
Proximal Policy Optimization Algorithms
R.I.P.
π»
Ghosted
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Language Models are Few-Shot Learners
R.I.P.
π»
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
R.I.P.
π»
Ghosted
A Unified Approach to Interpreting Model Predictions
R.I.P.
π»
Ghosted