Control-oriented Clustering of Visual Latent Representation
October 07, 2024 ยท Entered Twilight ยท ๐ International Conference on Learning Representations
"No code URL or promise found in abstract"
"Code repo scraped from project page (backfill)"
Evidence collected by the PWNC Scanner
Repo contents: .gitignore, CONTRIBUTING.md, LICENSE, README.md, configs, eval.py, nerfies, notebooks, requirements.txt, setup.py, third_party, train.py
Authors
Han Qi, Haocheng Yin, Heng Yang
arXiv ID
2410.05063
Category
cs.LG: Machine Learning
Cross-listed
cs.CV,
cs.RO
Citations
6
Venue
International Conference on Learning Representations
Repository
https://github.com/google/nerfies
โญ 1940
Last Checked
11 days ago
Abstract
We initiate a study of the geometry of the visual representation space -- the information channel from the vision encoder to the action decoder -- in an image-based control pipeline learned from behavior cloning. Inspired by the phenomenon of neural collapse (NC) in image classification (arXiv:2008.08186), we empirically demonstrate the prevalent emergence of a similar law of clustering in the visual representation space. Specifically, in discrete image-based control (e.g., Lunar Lander), the visual representations cluster according to the natural discrete action labels; in continuous image-based control (e.g., Planar Pushing and Block Stacking), the clustering emerges according to "control-oriented" classes that are based on (a) the relative pose between the object and the target in the input or (b) the relative pose of the object induced by expert actions in the output. Each of the classes corresponds to one relative pose orthant (REPO). Beyond empirical observation, we show such a law of clustering can be leveraged as an algorithmic tool to improve test-time performance when training a policy with limited expert demonstrations. Particularly, we pretrain the vision encoder using NC as a regularization to encourage control-oriented clustering of the visual features. Surprisingly, such an NC-pretrained vision encoder, when finetuned end-to-end with the action decoder, boosts the test-time performance by 10% to 35%. Real-world vision-based planar pushing experiments confirmed the surprising advantage of control-oriented visual representation pretraining.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Machine Learning
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
R.I.P.
๐ป
Ghosted
Semi-Supervised Classification with Graph Convolutional Networks
R.I.P.
๐ป
Ghosted
Proximal Policy Optimization Algorithms
R.I.P.
๐ป
Ghosted