CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

August 20, 2024 · Declared Dead · 🏛 International Conference on Computational Linguistics

Authors Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, Jing Ma arXiv ID 2408.10718 Category cs.SE: Software Engineering Cross-listed cs.CL Citations 26 Venue International Conference on Computational Linguistics Repository https://github.com/CodeLLM-Research/CodeJudge-Eval} Last Checked 1 month ago

Abstract

Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our codes and benchmark are available at \url{https://github.com/CodeLLM-Research/CodeJudge-Eval}.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 💻 Repository 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Software Engineering

R.I.P. 👻 Ghosted

ImageJ2: ImageJ for the next generation of scientific image data

Curtis T. Rueden, Johannes Schindelin, ... (+5 more)

cs.SE 🏛 BMC Bioinformatics 📚 5.1K cites 9 years ago

R.I.P. 👻 Ghosted

GraphCodeBERT: Pre-training Code Representations with Data Flow

Daya Guo, Shuo Ren, ... (+16 more)

cs.SE 🏛 ICLR 📚 1.5K cites 5 years ago

R.I.P. 👻 Ghosted

DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars

Yuchi Tian, Kexin Pei, ... (+2 more)

cs.SE 🏛 ICSE 📚 1.4K cites 8 years ago

R.I.P. 👻 Ghosted

Microservices: yesterday, today, and tomorrow

Nicola Dragoni, Saverio Giallorenzo, ... (+5 more)

cs.SE 🏛 Present and Ulterior Software Engineering 📚 1.1K cites 9 years ago

R.I.P. 👻 Ghosted

Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks

Yaqin Zhou, Shangqing Liu, ... (+3 more)

cs.SE 🏛 NeurIPS 📚 1.0K cites 6 years ago

R.I.P. 👻 Ghosted

A Survey of Machine Learning for Big Code and Naturalness

Miltiadis Allamanis, Earl T. Barr, ... (+2 more)

cs.SE 🏛 ACM CSUR 📚 962 cites 8 years ago

Died the same way — 💀 404 Not Found

R.I.P. 💀 404 Not Found

Deep High-Resolution Representation Learning for Visual Recognition

Jingdong Wang, Ke Sun, ... (+10 more)

cs.CV 🏛 IEEE TPAMI 📚 4.4K cites 6 years ago

R.I.P. 💀 404 Not Found

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, ... (+20 more)

cs.CL 🏛 arXiv 📚 3.5K cites 6 years ago

R.I.P. 💀 404 Not Found

CCNet: Criss-Cross Attention for Semantic Segmentation

Zilong Huang, Xinggang Wang, ... (+5 more)

cs.CV 🏛 ICCV 📚 2.9K cites 7 years ago

R.I.P. 💀 404 Not Found

Unified Perceptual Parsing for Scene Understanding

Tete Xiao, Yingcheng Liu, ... (+3 more)

cs.CV 🏛 ECCV 📚 2.3K cites 7 years ago