Empirical Risk Minimization for Stochastic Convex Optimization: $O(1/n)$- and $O(1/n^2)$-type of Risk Bounds
February 07, 2017 Β· Declared Dead Β· π Annual Conference Computational Learning Theory
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Lijun Zhang, Tianbao Yang, Rong Jin
arXiv ID
1702.02030
Category
cs.LG: Machine Learning
Citations
51
Venue
Annual Conference Computational Learning Theory
Last Checked
3 months ago
Abstract
Although there exist plentiful theories of empirical risk minimization (ERM) for supervised learning, current theoretical understandings of ERM for a related problem---stochastic convex optimization (SCO), are limited. In this work, we strengthen the realm of ERM for SCO by exploiting smoothness and strong convexity conditions to improve the risk bounds. First, we establish an $\widetilde{O}(d/n + \sqrt{F_*/n})$ risk bound when the random function is nonnegative, convex and smooth, and the expected function is Lipschitz continuous, where $d$ is the dimensionality of the problem, $n$ is the number of samples, and $F_*$ is the minimal risk. Thus, when $F_*$ is small we obtain an $\widetilde{O}(d/n)$ risk bound, which is analogous to the $\widetilde{O}(1/n)$ optimistic rate of ERM for supervised learning. Second, if the objective function is also $Ξ»$-strongly convex, we prove an $\widetilde{O}(d/n + ΞΊF_*/n )$ risk bound where $ΞΊ$ is the condition number, and improve it to $O(1/[Ξ»n^2] + ΞΊF_*/n)$ when $n=\widetildeΞ©(ΞΊd)$. As a result, we obtain an $O(ΞΊ/n^2)$ risk bound under the condition that $n$ is large and $F_*$ is small, which to the best of our knowledge, is the first $O(1/n^2)$-type of risk bound of ERM. Third, we stress that the above results are established in a unified framework, which allows us to derive new risk bounds under weaker conditions, e.g., without convexity of the random function and Lipschitz continuity of the expected function. Finally, we demonstrate that to achieve an $O(1/[Ξ»n^2] + ΞΊF_*/n)$ risk bound for supervised learning, the $\widetildeΞ©(ΞΊd)$ requirement on $n$ can be replaced with $Ξ©(ΞΊ^2)$, which is dimensionality-independent.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Machine Learning
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
π»
Ghosted
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
R.I.P.
π»
Ghosted
Semi-Supervised Classification with Graph Convolutional Networks
R.I.P.
π»
Ghosted
Proximal Policy Optimization Algorithms
R.I.P.
π»
Ghosted
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Language Models are Few-Shot Learners
R.I.P.
π»
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
R.I.P.
π»
Ghosted
A Unified Approach to Interpreting Model Predictions
R.I.P.
π»
Ghosted