TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Factual probe	BEAR-probe	Meta-Llama-3-8B	Accuracy (%)	68.6 ± 2.2	# 1
Factual probe	BEAR-probe	Meta-Llama-3-8B	Size (Millions)	8	# 1
Factual probe	BEAR-probe	Llama-2-13b-hf	Accuracy (%)	66.9 ± 1.0	# 2
Factual probe	BEAR-probe	Llama-2-13b-hf	Size (Millions)	13	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bear-a-unified-framework-for-evaluating/factual-probe-on-bear-probe)](https://paperswithcode.com/sota/factual-probe-on-bear-probe?p=bear-a-unified-framework-for-evaluating)`

BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models

5 Apr 2024 · Jacek Wiland, Max Ploner, Alan Akbik ·

Knowledge probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training. Probing is an inexpensive way to compare LMs of different sizes and training configurations. However, previous approaches rely on the objective function used in pre-training LMs and are thus applicable only to masked or causal LMs. As a result, comparing different types of LMs becomes impossible. To address this, we propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement. We carefully design an evaluation dataset of 7,731 instances (40,916 in a larger variant) from which we produce alternative statements for each relational fact, one of which is correct. We then evaluate whether an LM correctly assigns the highest log-likelihood to the correct statement. Our experimental evaluation of 22 common LMs shows that our proposed framework, BEAR, can effectively probe for knowledge across different LM types. We release the BEAR datasets and an open-source framework that implements the probing approach to the research community to facilitate the evaluation and development of LMs.

PDF Abstract

Code

Add Remove Mark official

lm-pub-quiz/lm-pub-quiz official

faceonlive/ai-research

↳ Quickstart in

Spaces

213

Tasks

Add Remove

Factual probe

General Knowledge

Knowledge Probing

Language Modelling

World Knowledge

Datasets

Introduced in the Paper:

BEAR-probe

Used in the Paper:

LAMA KAMEL

Results from the Paper

Add Remove

Ranked #1 on Factual probe on BEAR-probe

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Factual probe	BEAR-probe	Meta-Llama-3-8B	Accuracy (%)	68.6 ± 2.2	# 1	Compare
Factual probe	BEAR-probe	Meta-Llama-3-8B	Size (Millions)	8	# 1	Compare
Factual probe	BEAR-probe	Llama-2-13b-hf	Accuracy (%)	66.9 ± 1.0	# 2	Compare
Factual probe	BEAR-probe	Llama-2-13b-hf	Size (Millions)	13	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove