Learning a Single Neuron with Adversarial Label Noise via Gradient Descent
June 17, 2022 Β· Declared Dead Β· π Annual Conference Computational Learning Theory
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis
arXiv ID
2206.08918
Category
cs.LG: Machine Learning
Cross-listed
cs.DS,
math.ST,
stat.ML
Citations
24
Venue
Annual Conference Computational Learning Theory
Last Checked
3 months ago
Abstract
We study the fundamental problem of learning a single neuron, i.e., a function of the form $\mathbf{x}\mapstoΟ(\mathbf{w}\cdot\mathbf{x})$ for monotone activations $Ο:\mathbb{R}\mapsto\mathbb{R}$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\mathbf{x}, y)\in\mathbb{R}^d \times \mathbb{R}$ such that there exists $\mathbf{w}^\ast\in\mathbb{R}^d$ achieving $F(\mathbf{w}^\ast)=Ξ΅$, where $F(\mathbf{w})=\mathbf{E}_{(\mathbf{x},y)\sim D}[(Ο(\mathbf{w}\cdot \mathbf{x})-y)^2]$. The goal of the learner is to output a hypothesis vector $\mathbf{w}$ such that $F(\mathbb{w})=C\, Ξ΅$ with high probability, where $C>1$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions. Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: For the logistic activation, we obtain the first polynomial-time constant factor approximation (even under the Gaussian distribution). Our algorithm has sample complexity $\widetilde{O}(d/Ξ΅)$, which is tight within polylogarithmic factors. For the ReLU activation, we give an efficient algorithm with sample complexity $\tilde{O}(d\, \polylog(1/Ξ΅))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\tildeΞ©(d/Ξ΅)$. In both of these settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Machine Learning
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
π»
Ghosted
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
R.I.P.
π»
Ghosted
Semi-Supervised Classification with Graph Convolutional Networks
R.I.P.
π»
Ghosted
Proximal Policy Optimization Algorithms
R.I.P.
π»
Ghosted
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Language Models are Few-Shot Learners
R.I.P.
π»
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
R.I.P.
π»
Ghosted
A Unified Approach to Interpreting Model Predictions
R.I.P.
π»
Ghosted