We show that Transformer encoder architectures can be massively sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens.
Ranked #2 on Linguistic Acceptability on CoLA
Instead of directly predicting the latent code of a given real image using a single pass, the encoder is tasked with predicting a residual with respect to the current estimate of the inverted latent code in a self-correcting manner.
We present a novel method for local image feature matching.
This work introduces Itihasa, a large-scale translation dataset containing 93, 000 pairs of Sanskrit shlokas and their English translations.
As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.
The rapid progress in 3D scene understanding has come with growing demand for data; however, collecting and annotating 3D scenes (e. g. point clouds) are notoriously hard.
The proposed GAN prior embedded network (GPEN) is easy-to-implement, and it can generate visually photo-realistic results.
Ranked #1 on Blind Face Restoration on CelebA-HQ
Segmenting highly-overlapping objects is challenging, because typically no distinction is made between real object contours and occlusion boundaries.
Ranked #1 on Instance Segmentation on KINS
However, the application of sensitive entity detection for production systems in financial institutions has not been well explored due to the lack of publicly available, labeled datasets.