CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

August 20, 2024 ยท Declared Dead ยท ๐Ÿ› International Conference on Computational Linguistics

๐Ÿ’€ CAUSE OF DEATH: 404 Not Found
Code link is broken/dead
Authors Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, Jing Ma arXiv ID 2408.10718 Category cs.SE: Software Engineering Cross-listed cs.CL Citations 26 Venue International Conference on Computational Linguistics Repository https://github.com/CodeLLM-Research/CodeJudge-Eval} Last Checked 1 month ago
Abstract
Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our codes and benchmark are available at \url{https://github.com/CodeLLM-Research/CodeJudge-Eval}.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Software Engineering

Died the same way โ€” ๐Ÿ’€ 404 Not Found