Text Embeddings Reveal (Almost) As Much As Text

October 10, 2023 Β· Declared Dead Β· πŸ› Conference on Empirical Methods in Natural Language Processing

πŸ’€ CAUSE OF DEATH: 404 Not Found
Code link is broken/dead
Authors John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, Alexander M. Rush arXiv ID 2310.06816 Category cs.CL: Computation & Language Cross-listed cs.LG Citations 180 Venue Conference on Empirical Methods in Natural Language Processing Repository https://github.com/jxmorris12/vec2text}{github.com/jxmorris12/vec2text} Last Checked 1 month ago
Abstract
How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a naΓ―ve model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover $92\%$ of $32\text{-token}$ text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes. Our code is available on Github: \href{https://github.com/jxmorris12/vec2text}{github.com/jxmorris12/vec2text}.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Computation & Language

πŸŒ… πŸŒ… Old Age

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, ... (+6 more)

cs.CL πŸ› NeurIPS πŸ“š 166.0K cites 8 years ago

Died the same way β€” πŸ’€ 404 Not Found