no code implementations • 11 Feb 2024 • Francis Rhys Ward, Matt MacDermott, Francesco Belardinelli, Francesca Toni, Tom Everitt
In addition, we show how our definition relates to past concepts, including actual causality, and the notion of instrumental goals, which is a core idea in the literature on safe AI agents.
no code implementations • NeurIPS 2023 • Francis Rhys Ward, Francesco Belardinelli, Francesca Toni, Tom Everitt
There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games.
no code implementations • 26 Jun 2023 • Ismail Sahbane, Francis Rhys Ward, C Henrik Åslund
How to detect and mitigate deceptive AI systems is an open problem for the field of safe and trustworthy AI.
no code implementations • 28 Sep 2022 • Francis Rhys Ward, Francesco Belardinelli, Francesca Toni
We define a novel neuro-symbolic framework, argumentative reward learning, which combines preference-based argumentation with existing approaches to reinforcement learning from human feedback.