Domain-Targeted, High Precision Knowledge Extraction

TACL 2017 · Bhavana Dalvi Mishra, T, Niket on, Peter Clark ·

Our goal is to construct a domain-targeted, high precision knowledge base (KB), containing general (subject,predicate,object) statements about the world, in support of a downstream question-answering (QA) application. Despite recent advances in information extraction (IE) techniques, no suitable resource for our task already exists; existing resources are either too noisy, too named-entity centric, or too incomplete, and typically have not been constructed with a clear scope or purpose. To address these, we have created a domain-targeted, high precision knowledge extraction pipeline, leveraging Open IE, crowdsourcing, and a novel canonical schema learning algorithm (called CASI), that produces high precision knowledge targeted to a particular domain - in our case, elementary science. To measure the KB{'}s coverage of the target domain{'}s knowledge (its {``}comprehensiveness{''} with respect to science) we measure recall with respect to an independent corpus of domain text, and show that our pipeline produces output with over 80{\%} precision and 23{\%} recall with respect to that target, a substantially higher coverage of tuple-expressible science knowledge than other comparable resources. We have made the KB publicly available.

PDF Abstract