Grey-box Extraction of Natural Language Models

1 Jan 2021 · Santiago Zanella-Beguelin, Shruti Tople, Andrew Paverd, Boris Köpf ·

Model extraction attacks attempt to replicate a target machine learning model from predictions obtained by querying its inference API. Most existing attacks on Deep Neural Networks achieve this by supervised training of the copy using the victim's predictions. An emerging class of attacks exploit algebraic properties of DNNs to obtain high-fidelity copies using orders of magnitude fewer queries than the prior state-of-the-art. So far, such powerful attacks have been limited to networks with few hidden layers and ReLU activations. In this paper we present algebraic attacks on large-scale natural language models in a grey-box setting, targeting models with a pre-trained (public) encoder followed by a single (private) classification layer. Our key observation is that a small set of arbitrary embedding vectors is likely to form a basis of the classification layer's input space, which a grey-box adversary can compute. We show how to use this information to solve an equation system that determines the classification layer from the corresponding probability outputs. We evaluate the effectiveness of our attacks on different sizes of transformer models and downstream tasks. Our key findings are that (i) with frozen base layers, high-fidelity extraction is possible with a number of queries that is as small as twice the input dimension of the last layer. This is true even for queries that are entirely in-distribution, making extraction attacks indistinguishable from legitimate use; (ii) with fine-tuned base layers, the effectiveness of algebraic attacks decreases with the learning rate, showing that fine-tuning is not only beneficial for accuracy but also indispensable for model confidentiality.

PDF Abstract