Scaling Hierarchical Coreference with Homomorphic Compression

Locality sensitive hashing schemes such as \simhash provide compact representations of multisets from which similarity can be estimated. However, in certain applications, we need to estimate the similarity of dynamically changing sets. In this case, we need the representation to be a homomorphism so that the hash of unions and differences of sets can be computed directly from the hashes of operands. We propose two representations that have this property for cosine similarity (an extension of \simhash and angle-preserving random projections), and make substantial progress on a third representation for Jaccard similarity (an extension of \minhash). We employ these hashes to compress the sufficient statistics of a conditional random field (CRF) coreference model and study how this compression affects our ability to compute similarities as entities are split and merged during inference. \cut{We study these hashes in a conditional random field (CRF) hierarchical coreference model in order to compute the similarity of entities as they are merged and split during inference.} We also provide novel statistical analysis of \simhash to help justify it as an estimator inside a CRF, showing that the bias and variance reduce quickly with the number of bits. On a problem of author coreference, we find that our \simhash scheme allows scaling the hierarchical coreference algorithm by an order of magnitude without degrading its statistical performance or the model's coreference accuracy, as long as we employ at least 128 or 256 bits. Angle-preserving random projections further improve the coreference quality, potentially allowing even fewer dimensions to be used.

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here