Are Large Language Models Temporally Grounded?

November 14, 2023 ยท Entered Twilight ยท ๐Ÿ› North American Chapter of the Association for Computational Linguistics

๐Ÿ’ค TWILIGHT: Eternal Rest
Repo abandoned since publication

Repo contents: LICENSE, README.md, caters-gpt.py, caters-llama.py, dataset, denoising_event_lm, eval-caters.py, eval-mctaco.sh, eval-tempeval-bi-cot.py, eval-tempeval-bi.py, evaluator.py, gpt-output, illustration.png, llama-hf_environment.yml, llama-output, mctaco-gpt.py, mctaco-llama.py, metrics, normalise-mctaco-gpt.py, run-caters-gpt.sh, run-caters-llama.sh, run-mctaco-gpt.sh, run-mctaco-llama.sh, run-tempeval-gpt.sh, run-tempeval-llama.sh, tempeval-gpt.py, tempeval-llama.py, temporal-qualitative-case.pdf

Authors Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen arXiv ID 2311.08398 Category cs.CL: Computation & Language Cross-listed cs.AI Citations 22 Venue North American Chapter of the Association for Computational Linguistics Repository https://github.com/yfqiu-nlp/temporal-llms โญ 13 Last Checked 1 month ago
Abstract
Are Large language models (LLMs) temporally grounded? Since LLMs cannot perceive and interact with the environment, it is impossible to answer this question directly. Instead, we provide LLMs with textual narratives and probe them with respect to their common-sense knowledge of the structure and duration of events, their ability to order events along a timeline, and self-consistency within their temporal model (e.g., temporal relations such as after and before are mutually exclusive for any pair of events). We evaluate state-of-the-art LLMs (such as LLaMA 2 and GPT-4) on three tasks reflecting these abilities. Generally, we find that LLMs lag significantly behind both human performance as well as small-scale, specialised LMs. In-context learning, instruction tuning, and chain-of-thought prompting reduce this gap only to a limited degree. Crucially, LLMs struggle the most with self-consistency, displaying incoherent behaviour in at least 27.23% of their predictions. Contrary to expectations, we also find that scaling the model size does not guarantee positive gains in performance. To explain these results, we study the sources from which LLMs may gather temporal information: we find that sentence ordering in unlabelled texts, available during pre-training, is only weakly correlated with event ordering. Moreover, public instruction tuning mixtures contain few temporal tasks. Hence, we conclude that current LLMs lack a consistent temporal model of textual narratives. Code, datasets, and LLM outputs are available at https://github.com/yfqiu-nlp/temporal-llms.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Computation & Language

๐ŸŒ… ๐ŸŒ… Old Age

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, ... (+6 more)

cs.CL ๐Ÿ› NeurIPS ๐Ÿ“š 166.0K cites 8 years ago