Search Results for author: Daniel Paleka

Found 6 papers, 2 papers with code

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

1 code implementation • 15 Apr 2024 • Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs).

Paper
Code

ARB: Advanced Reasoning Benchmark for Large Language Models

no code implementations • 25 Jul 2023 • Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, Aran Komatsuzaki

As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge.

Math

Paper
Add Code

Evaluating Superhuman Models with Consistency Checks

2 code implementations • 16 Jun 2023 • Lukas Fluri, Daniel Paleka, Florian Tramèr

If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth?

Decision Making

Paper
Code

Poisoning Web-Scale Training Datasets is Practical

no code implementations • 20 Feb 2023 • Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr

Deep learning models are often trained on distributed, web-scale datasets crawled from the internet.

Data Poisoning

Paper
Add Code

Red-Teaming the Stable Diffusion Safety Filter

no code implementations • 3 Oct 2022 • Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr

We then reverse-engineer the filter and find that while it aims to prevent sexual content, it ignores violence, gore, and other similarly disturbing content.

Image Generation

Paper
Add Code

A law of adversarial risk, interpolation, and label noise

no code implementations • 8 Jul 2022 • Daniel Paleka, Amartya Sanyal

In supervised learning, it has been shown that label noise in the data can be interpolated without penalties on test accuracy.

Inductive Bias

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.