[Re] Reproducing Learning to Deceive With Attention-Based Explanations

RC 2020 · Andrew Harrison, Rahel Habacker, Ard Snijders, Mathias Parisot ·

Reproduction study for Learning to Deceive with Attention-Based Explanations

Scope of Reproducibility
Based on the intuition that attention in neural networks is what the model focuses on, attention is now being used as an explanation for a modelsʼ prediction (see Galassi, Lippi, and Torroni1 for a survey). Pruthi et al.2 challenge the usage of attention-based explanation through a series of experiments using classification and sequence-to-sequence (seq2seq) models. They examine the modelʼs use of impermissible tokens, which are user-defined tokens that can introduce bias e.g. gendered pronouns. Across multiple datasets, the authors show that with the impermissible tokens removed the model accuracy drops, implying their usage in prediction. And then by penalising attention paid to the impermissible tokens but keeping them in, they train models that retain full accuracy hence must be using the impermissible tokens, but that does not show attention being paid to the impermissible tokens. As the paperʼs claims have such significant implications for the use of attention-based explanations, we seek to reproduce their results.

Methodology
Using the authorsʼ code, for classifiers we attempt to reproduce their embedding, BiLSTM, and BERT results across the occupation prediction, gender identify, and SST + wiki datasets. Further, we reimplemented BERT using HuggingFaceʼs transformer library [3] with restricted self-attention (information cannot flow between permissible and impermissible tokens). For seq2seq we used the authorsʼ code to reproduce results across Bigram Flip, Sequence Copy, Sequence Reverse, and English-German (En-De) machine translation datasets. We performed refactoring on the authorsʼ code aiming toward a more uniformly usable code style as well as porting across to PyTorch Lightning. All experiments were run in approximately 130 GPU hours on a computing cluster with nodes containing Titan RTX GPUs.

Results
We reproduced the authorsʼ results across all models and all available datasets, confirming their findings that attention-based explanations can be manipulated and that mod els can learn to deceive. We also replicated their BERT results using our reimplemented model. There was only one result not as strongly (> 1 S.D.) in their experimental direction.

What Was Easy
The authorsʼ methods were largely well described and easy to follow, and we could quickly produce the first results as their code worked straightaway with minor adjustments. They were also extremely responsive and helpful via email.

What Was Difficult
Re-implementing the BERT-based classification model to perform replicability, with further specification details on model architecture, penalty mechanism, and training procedure needed. Also, porting code across to PyTorch Lightning.

Communication With Original Authors
There was a continuous email chain with the authors for several weeks during the reproducibility work. They made additional code and datasets available per our requests, along with providing detailed responses and clarifications to our emailed questions. They encouraged the work and we wish to thank them for their time and support.