Benchmarking
1524 papers with code • 1 benchmarks • 5 datasets
Most implemented papers
Habitat: A Platform for Embodied AI Research
We present Habitat, a platform for research in embodied artificial intelligence (AI).
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering.
Multitask learning and benchmarking with clinical time series data
Health care is one of the most exciting frontiers in data mining and machine learning.
A large annotated medical image dataset for the development and evaluation of segmentation algorithms
Semantic segmentation of medical images aims to associate a pixel with a label in a medical image without human initialization.
COCO: A Platform for Comparing Continuous Optimizers in a Black-Box Setting
We introduce COCO, an open source platform for Comparing Continuous Optimizers in a black-box setting.
On Evaluation of Embodied Navigation Agents
Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence.
Benchmarking Natural Language Understanding Services for building Conversational Agents
We have recently seen the emergence of several publicly available Natural Language Understanding (NLU) toolkits, which map user utterances to structured, but more abstract, Dialogue Act (DA) or Intent specifications, while making this process accessible to the lay developer.
Torchreid: A Library for Deep Learning Person Re-Identification in Pytorch
Person re-identification (re-ID), which aims to re-identify people across different camera views, has been significantly advanced by deep learning in recent years, particularly with convolutional neural networks (CNNs).
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks
Multi-agent deep reinforcement learning (MARL) suffers from a lack of commonly-used evaluation tasks and criteria, making comparisons between approaches difficult.
Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet Datasets
We also develop a metrics library, ivtmetrics, for model evaluation on surgical triplets.