Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

March 20, 2017 ยท Entered Twilight ยท ๐Ÿ› IEEE International Conference on Computer Vision

๐ŸŒ… TWILIGHT: Old Age
Predates the code-sharing era โ€” a pioneer of its time

"No code URL or promise found in abstract"
"Code repo scraped from project page (backfill)"

Evidence collected by the PWNC Scanner

Repo contents: .gitignore, LICENSE.txt, README.md, convert_gpu_to_cpu.lua, data, dataloader.lua, decoders, encoders, evaluate.lua, generate.lua, model.lua, model_utils, opts.lua, scripts, train.lua, utils.lua, vis

Authors Abhishek Das, Satwik Kottur, Josรฉ M. F. Moura, Stefan Lee, Dhruv Batra arXiv ID 1703.06585 Category cs.CV: Computer Vision Cross-listed cs.AI, cs.CL, cs.LG Citations 430 Venue IEEE International Conference on Computer Vision Repository https://github.com/batra-mlp-lab/visdial โญ 230 Last Checked 6 days ago
Abstract
We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end -- from pixels to multi-agent multi-round dialog to game reward. We demonstrate two experimental results. First, as a 'sanity check' demonstration of pure RL (from scratch), we show results on a synthetic world, where the agents communicate in ungrounded vocabulary, i.e., symbols with no pre-specified meanings (X, Y, Z). We find that two bots invent their own communication protocol and start using certain symbols to ask/answer about certain visual attributes (shape/color/style). Thus, we demonstrate the emergence of grounded language and communication among 'visual' dialog agents with no human supervision. Second, we conduct large-scale real-image experiments on the VisDial dataset, where we pretrain with supervised dialog data and show that the RL 'fine-tuned' agents significantly outperform SL agents. Interestingly, the RL Qbot learns to ask questions that Abot is good at, ultimately resulting in more informative dialog and a better team.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Computer Vision