Search Results for author: Fazl Barez

Found 18 papers, 10 papers with code

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

no code implementations • 23 Feb 2024 • Clement Neo, Shay B. Cohen, Fazl Barez

In this paper, we investigate the interplay between attention heads and specialized "next-token" neurons in the Multilayer Perceptron that predict specific tokens.

Paper
Add Code

Increasing Trust in Language Models through the Reuse of Verified Circuits

1 code implementation • 4 Feb 2024 • Philip Quirke, Clement Neo, Fazl Barez

To exhibit the reusability of verified modules, we insert the trained integer addition model into an untrained model and train the combined model to perform both addition and subtraction.

Paper
Code

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

1 code implementation • 10 Jan 2024 • Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it).

Paper
Code

Large Language Models Relearn Removed Concepts

1 code implementation • 3 Jan 2024 • Michelle Lo, Shay B. Cohen, Fazl Barez

This demonstrates that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons.

Model Editing

Paper
Code

Measuring Value Alignment

no code implementations • 23 Dec 2023 • Fazl Barez, Philip Torr

As artificial intelligence (AI) systems become increasingly integrated into various domains, ensuring that they align with human values becomes critical.

Autonomous Vehicles Recommendation Systems

Paper
Add Code

Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model

no code implementations • 7 Nov 2023 • Michael Lan, Fazl Barez

While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret.

Language Modelling Large Language Model

Paper
Add Code

Understanding Addition in Transformers

3 code implementations • 19 Oct 2023 • Philip Quirke, Fazl Barez

Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use.

Paper
Code

Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models

no code implementations • 12 Oct 2023 • Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, Philip Torr, Fazl Barez

Large language models (LLMs) fine-tuned by reinforcement learning from human feedback (RLHF) are becoming more widely deployed.

Paper
Add Code

AI Systems of Concern

no code implementations • 9 Oct 2023 • Kayla Matteucci, Shahar Avin, Fazl Barez, Seán Ó hÉigeartaigh

Concerns around future dangers from advanced AI often centre on systems hypothesised to have intrinsic characteristics such as agent-like behaviour, strategic awareness, and long-range planning.

Paper
Add Code

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

1 code implementation • 3 Oct 2023 • Albert Garde, Esben Kran, Fazl Barez

By granting access to state-of-the-art interpretability methods, DeepDecipher makes LLMs more transparent, trustworthy, and safe.

Paper
Code

Neuron to Graph: Interpreting Language Model Neurons at Scale

1 code implementation • 31 May 2023 • Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez

Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to.

Language Modelling

Paper
Code

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

1 code implementation • 27 May 2023 • Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas, Fazl Barez

We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity.

Model Editing Specificity

Paper
Code

The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

1 code implementation • 24 May 2023 • Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, Shay B. Cohen

Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming.

Code Generation

Paper
Code

System III: Learning with Domain Knowledge for Safety Constraints

no code implementations • 23 Apr 2023 • Fazl Barez, Hosien Hasanbieg, Alesandro Abbate

We evaluate the satisfaction of these constraints via p-norms in state vector space.

Safe Exploration

Paper
Add Code

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

no code implementations • 22 Apr 2023 • Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez

Understanding the function of individual neurons within language models is essential for mechanistic interpretability research.

Paper
Add Code

Fairness in AI and Its Long-Term Implications on Society

no code implementations • 16 Apr 2023 • Ondrej Bohdal, Timothy Hospedales, Philip H. S. Torr, Fazl Barez

Successful deployment of artificial intelligence (AI) in various settings has led to numerous positive outcomes for individuals and society.

Decision Making Fairness

Paper
Add Code

Exploring the Advantages of Transformers for High-Frequency Trading

1 code implementation • 20 Feb 2023 • Fazl Barez, Paul Bilokon, Arthur Gervais, Nikita Lisitsyn

This paper explores the novel deep learning Transformers architectures for high-frequency Bitcoin-USDT log-return forecasting and compares them to the traditional Long Short-Term Memory models.

Position Time Series +2

Paper
Code

PMIC: Improving Multi-Agent Reinforcement Learning with Progressive Mutual Information Collaboration

1 code implementation • 16 Mar 2022 • Pengyi Li, Hongyao Tang, Tianpei Yang, Xiaotian Hao, Tong Sang, Yan Zheng, Jianye Hao, Matthew E. Taylor, Wenyuan Tao, Zhen Wang, Fazl Barez

However, we reveal sub-optimal collaborative behaviors also emerge with strong correlations, and simply maximizing the MI can, surprisingly, hinder the learning towards better collaboration.

Multi-agent Reinforcement Learning reinforcement-learning +1

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.