TAPE: Assessing Few-shot Russian Language Understanding
Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.
PDF AbstractCode
Datasets
Introduced in the Paper:
MultiQ Ethics (per ethics) RuWorldTree RuOpenBookQA CheGeKa Winograd AutomaticUsed in the Paper:
GLUE SuperGLUE OpenBookQA WSC ETHICS MuSeRCTask | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Question Answering | CheGeKa | RuGPT-3 Small | Accuracy | 00 | # 2 | |
Question Answering | CheGeKa | Human benchmark | Accuracy | 64.5 | # 1 | |
Question Answering | CheGeKa | RuGPT-3 Large | Accuracy | 00 | # 2 | |
Question Answering | CheGeKa | RuGPT-3 Medium | Accuracy | 00 | # 2 | |
Ethics | Ethics | RuGPT-3 Small | Accuracy | 55.5 | # 3 | |
Ethics | Ethics | RuGPT-3 Meduim | Accuracy | 68.3 | # 2 | |
Ethics | Ethics | RuGPT-3 Large | Accuracy | 68.6 | # 1 | |
Ethics | Ethics | Human benchmark | Accuracy | 52.9 | # 4 | |
Ethics | Ethics (per ethics) | Human benchmark | Accuracy | 67.6 | # 1 | |
Ethics | Ethics (per ethics) | RuGPT-3 Small | Accuracy | 60.9 | # 2 | |
Ethics | Ethics (per ethics) | RuGPT-3 Medium | Accuracy | 44.1 | # 4 | |
Ethics | Ethics (per ethics) | RuGPT-3 Large | Accuracy | 44.9 | # 3 | |
Question Answering | MultiQ | Human benchmark | Accuracy | 91.0 | # 1 | |
Question Answering | MultiQ | RuGPT-3 Small | Accuracy | 00 | # 2 | |
Question Answering | MultiQ | RuGPT-3 Medium | Accuracy | 00 | # 2 | |
Question Answering | MultiQ | RuGPT-3 Large | Accuracy | 00 | # 2 | |
Question Answering | RuOpenBookQA | RuGPT-3 Small | Accuracy | 57.9 | # 2 | |
Question Answering | RuOpenBookQA | Human benchmark | Accuracy | 86.5 | # 1 | |
Question Answering | RuOpenBookQA | RuGPT-3 Large | Accuracy | 55.5 | # 4 | |
Question Answering | RuOpenBookQA | RuGPT-3 Medium | Accuracy | 57.2 | # 3 | |
Logical Reasoning | RuWorldTree | Human benchmark | Accuracy | 83.7 | # 1 | |
Logical Reasoning | RuWorldTree | RuGPT-3 Large | Accuracy | 40.7 | # 2 | |
Logical Reasoning | RuWorldTree | RuGPT-3 Small | Accuracy | 34.0 | # 4 | |
Logical Reasoning | RuWorldTree | RuGPT-3 Medium | Accuracy | 38.0 | # 3 | |
Logical Reasoning | Winograd Automatic | Human benchmark | Accuracy | 87.0 | # 1 | |
Logical Reasoning | Winograd Automatic | RuGPT-3 Small | Accuracy | 57.9 | # 2 | |
Logical Reasoning | Winograd Automatic | RuGPT-3 Medium | Accuracy | 57.2 | # 3 | |
Logical Reasoning | Winograd Automatic | RuGPT-3 Large | Accuracy | 55.5 | # 4 |