mBBC dataset (Multilingual BBC news)

Introduced by Nezhad et al. in Exploring the Maze of Multilingual Modeling

To construct our multilingual dataset - mBBC - we gathered news articles from various BBC news websites in 43 different languages. This selection was based on the fact that BBC broadcasts news in these 43 languages, providing a global coverage across continents, and spanning a diverse range of language families, scripts, resource-levels, and word order ensuring a comprehensive representation of linguistic diversity. We collected data from various language families such as Indo-European, Sino-Tibetan, Niger-Congo, Austronesian, Dravidian, and more, encompassing several scripts like Latin, Cyrillic, Arabic, Devanagari, Chinese characters, and others. This extensive representation facilitates a comprehensive evaluation of multilingual language models across different linguistic contexts. Moreover, the dataset includes both high-resource languages like English, Spanish, and French, benefiting from extensive linguistic resources, as well as low-resource languages such as Somali, Burmese, and Nepali, with limited resources or smaller speaker populations. Including languages with varying resource levels enables us to assess the adaptability and effectiveness of multilingual language models across diverse linguistic settings. To ensure an unbiased and robust analysis, our dataset consists of news articles of minimum text length of 500 characters, sourced from reputable sources in 2023, ensuring the models studied have not seen the data during training in the most new LLMs.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • Unknown

Modalities


Languages