mBBC dataset (Multilingual BBC news)

Introduced by Nezhad et al. in Exploring the Maze of Multilingual Modeling

To construct our multilingual dataset - mBBC - we gathered news articles from various BBC news websites in 43 different languages. This selection was based on the fact that BBC broadcasts news in these 43 languages, providing a global coverage across continents, and spanning a diverse range of language families, scripts, resource-levels, and word order ensuring a comprehensive representation of linguistic diversity. We collected data from various language families such as Indo-European, Sino-Tibetan, Niger-Congo, Austronesian, Dravidian, and more, encompassing several scripts like Latin, Cyrillic, Arabic, Devanagari, Chinese characters, and others. This extensive representation facilitates a comprehensive evaluation of multilingual language models across different linguistic contexts. Moreover, the dataset includes both high-resource languages like English, Spanish, and French, benefiting from extensive linguistic resources, as well as low-resource languages such as Somali, Burmese, and Nepali, with limited resources or smaller speaker populations. Including languages with varying resource levels enables us to assess the adaptability and effectiveness of multilingual language models across diverse linguistic settings. To ensure an unbiased and robust analysis, our dataset consists of news articles of minimum text length of 500 characters, sourced from reputable sources in 2023, ensuring the models studied have not seen the data during training in the most new LLMs.

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

mBBC dataset (Multilingual BBC news)

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Usage

License

Modalities

Languages

mBBC dataset (Multilingual BBC news)

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages