The 100PoisonMpts dataset is a significant initiative in the realm of large language model governance. Developed collaboratively by Alibaba Tmall Genie and the Tongyi Large Model Team, this open-source Chinese dataset aims to address safety concerns associated with large language models, especially after the release of ChatGPT. The project's purpose is to ensure that information disseminated by these models aligns with safety, reliability, and human values.
Here are the key details about the 100PoisonMpts dataset:
It responds to concerns about AI-generated content being safe, healthy, and aligned with human values.
Data Collection:
The large model's answers were then annotated, creating a dynamic interplay between "poisoning" and "detoxification."
Significance:
It aligns with the temporary management measures for generative AI services, which emphasize preventing discrimination based on ethnicity, religion, nationality, gender, age, occupation, and health.
Expertise and Diversity:
Experts from fields such as environmental sociology, law, psychology, and child education contributed.
Data Format:
train.json
file.Each sample is in JSON format, containing the following fields:
prompt
: Inductive questions proposed by domain experts.answer
: Expert-approved answers.domain_en
: Domain information (in English).domain_zh
: Domain information (in Chinese).answer_source
: Indicates whether the answer is from an expert or the large model.Usage:
Source: Conversation with Bing, 3/18/2024 (1) 100PoisonMpts: 中文大模型治理数据集. https://www.modelscope.cn/datasets/damo/100PoisonMpts/summary. (2) 100PoisonMpts: 中文大模型治理数据集. https://www.modelscope.cn/datasets/damo/100PoisonMpts/files. (3) 阿里100瓶毒药解马斯克难题?国内首个大模型价值对齐数据集开源,15万评测题上线! - 知乎. https://zhuanlan.zhihu.com/p/643552287.
Paper | Code | Results | Date | Stars |
---|