Search Results for author: Sören Mindermann

Found 12 papers, 6 papers with code

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

1 code implementation26 Sep 2023 Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense.

Misinformation

The Alignment Problem from a Deep Learning Perspective

no code implementations30 Aug 2022 Richard Ngo, Lawrence Chan, Sören Mindermann

In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks.

Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

1 code implementation8 Mar 2021 Andrew Jesson, Sören Mindermann, Yarin Gal, Uri Shalit

We study the problem of learning conditional average treatment effects (CATE) from high-dimensional, observational data with unobserved confounders.

Identifying Causal-Effect Inference Failure with Uncertainty-Aware Models

1 code implementation NeurIPS 2020 Andrew Jesson, Sören Mindermann, Uri Shalit, Yarin Gal

We show that our methods enable us to deal gracefully with situations of "no-overlap", common in high-dimensional data, where standard applications of causal effect approaches fail.

Active Inverse Reward Design

1 code implementation9 Sep 2018 Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell

We propose structuring this process as a series of queries asking the user to compare between different reward functions.

Informativeness

Cannot find the paper you are looking for? You can Submit a new open access paper.