FLIP -- AAV, Designed vs mutant (adeno-associated virus)

Introduced by Dallago et al. in FLIP: Benchmark tasks in fitness landscape inference for proteins

FLIP includes several benchmark datasets that contain a variety of protein sequences, each with a real-valued label indicating its "fitness" (how well the protein performs some particular function). The goal is to predict the fitness of a given protein sequence using the sequence. Different representations of protein sequences (e.g. learned embeddings from large language models) may prove helpful here.

This sub-dataset (AAV) is a set of 201,426 training sequences and 82,583 test sequences in which the goal is to predict the fitness of mutants of the capsid protein from the adeno-associated virus (AAV). The training set proteins were designed, while the test set proteins are random mutants. The absolute value of the fitness is not important, but its ranking / relative value is -- protein designers would like to be able to pick a sequence with high fitness relative to those in the training set. Performance is therefore usually assessed using Spearman's r correlation coefficient.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


Modalities


Languages