MGS3: A Multi-Granularity Self-Supervised Code Search Framework
May 30, 2025 Β· Declared Dead Β· π Knowledge Discovery and Data Mining
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Rui Li, Junfeng Kang, Qi Liu, Liyang He, Zheng Zhang, Yunhao Sha, Linbo Zhu, Zhenya Huang
arXiv ID
2505.24274
Category
cs.SE: Software Engineering
Cross-listed
cs.IR
Citations
2
Venue
Knowledge Discovery and Data Mining
Last Checked
4 months ago
Abstract
In the pursuit of enhancing software reusability and developer productivity, code search has emerged as a key area, aimed at retrieving code snippets relevant to functionalities based on natural language queries. Despite significant progress in self-supervised code pre-training utilizing the vast amount of code data in repositories, existing methods have primarily focused on leveraging contrastive learning to align natural language with function-level code snippets. These studies have overlooked the abundance of fine-grained (such as block-level and statement-level) code snippets prevalent within the function-level code snippets, which results in suboptimal performance across all levels of granularity. To address this problem, we first construct a multi-granularity code search dataset called MGCodeSearchNet, which contains 536K+ pairs of natural language and code snippets. Subsequently, we introduce a novel Multi-Granularity Self-Supervised contrastive learning code Search framework (MGS$^{3}$}). First, MGS$^{3}$ features a Hierarchical Multi-Granularity Representation module (HMGR), which leverages syntactic structural relationships for hierarchical representation and aggregates fine-grained information into coarser-grained representations. Then, during the contrastive learning phase, we endeavor to construct positive samples of the same granularity for fine-grained code, and introduce in-function negative samples for fine-grained code. Finally, we conduct extensive experiments on code search benchmarks across various granularities, demonstrating that the framework exhibits outstanding performance in code search tasks of multiple granularities. These experiments also showcase its model-agnostic nature and compatibility with existing pre-trained code representation models.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Software Engineering
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Microservices: yesterday, today, and tomorrow
π
π
The Cartographer
A Survey of Machine Learning for Big Code and Naturalness
R.I.P.
π»
Ghosted
An Overview on Smart Contracts: Challenges, Advances and Platforms
R.I.P.
π»
Ghosted
Slither: A Static Analysis Framework For Smart Contracts
R.I.P.
π»
Ghosted
ContractFuzzer: Fuzzing Smart Contracts for Vulnerability Detection
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted