Compositional Embeddings: Joint Perception and Comparison of Class Label Sets
We explore the idea of compositional set embeddings that can be used to infer not just a single class, but the set of classes associated with the input data (e.g., image, video, audio signal). This can be useful, for example, in multi-object detection in images, or multi-speaker diarization (one-shot learning) in audio. In particular, we devise and implement two novel models consisting of (1) an embedding function f trained jointly with a “composite” function g that computes set union opera- tions between the classes encoded in two embedding vectors; and (2) embedding f trained jointly with a “query” function h that computes whether the classes en- coded in one embedding subsume the classes encoded in another embedding. In contrast to prior work, these models must both perceive the classes associated with the input examples, and also encode the relationships between different class label sets. In experiments conducted on simulated data, OmniGlot, and COCO datasets, the proposed composite embedding models outperform baselines based on traditional embedding approaches.
PDF Abstract