Code Search
48 papers with code • 5 benchmarks • 10 datasets
The goal of Code Search is to retrieve code fragments from a large code corpus that most closely match a developer’s intent, which is expressed in natural language.
Libraries
Use these libraries to find Code Search models and implementationsDatasets
Latest papers
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code.
Code Execution with Pre-trained Language Models
Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code.
REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models
This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training.
One Adapter for All Programming Languages? Adapter Tuning for Code Search and Summarization
To alleviate the potentially catastrophic forgetting issue in multilingual models, we fix all pre-trained model parameters, insert the parameter-efficient structure adapter, and fine-tune it.
Global Contrastive Batch Sampling via Optimization on Sample Permutations
Contrastive Learning has recently achieved state-of-the-art performance in a wide range of tasks.
Exploring Representation-Level Augmentation for Code Search
In this paper, we explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training, and based on this we propose a general format of representation-level augmentation that unifies existing methods.
XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence
To the best of our knowledge, it is the largest parallel dataset for source code both in terms of size and the number of languages.
NS3: Neuro-Symbolic Semantic Code Search
We compare our model - NS3 (Neuro-Symbolic Semantic Search) - to a number of baselines, including state-of-the-art semantic code retrieval methods, and evaluate on two datasets - CodeSearchNet and Code Search and Question Answering.
UniXcoder: Unified Cross-Modal Pre-training for Code Representation
Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task.