R.I.P.
๐ป
Ghosted
PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs
June 05, 2024 ยท Entered Twilight ยท ๐ International Conference on Machine Learning
Repo contents: .gitignore, README.md, assets, custom_datasets.py, eval_distilgpt2.py, eval_llama2.py, llama_bootstrap.py, main.py, nn_histogram.py, requirements.txt, similarity.py, variation.py
Authors
Charlie Hou, Akshat Shrivastava, Hongyuan Zhan, Rylan Conway, Trang Le, Adithya Sagar, Giulia Fanti, Daniel Lazar
arXiv ID
2406.02958
Category
cs.LG: Machine Learning
Cross-listed
cs.AI,
cs.CL,
cs.CR,
cs.DC
Citations
26
Venue
International Conference on Machine Learning
Repository
https://github.com/houcharlie/PrE-Text
โญ 24
Last Checked
1 month ago
Abstract
On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ($ฮต=1.29$, $ฮต=7.58$). We achieve these results while using 9$\times$ fewer rounds, 6$\times$ less client computation per round, and 100$\times$ less communication per round. Second, finetuning large models on PrE-Text's DP synthetic data improves large language model (LLM) performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at https://github.com/houcharlie/PrE-Text.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Machine Learning
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
R.I.P.
๐ป
Ghosted
Semi-Supervised Classification with Graph Convolutional Networks
R.I.P.
๐ป
Ghosted
Proximal Policy Optimization Algorithms
R.I.P.
๐ป
Ghosted