AlignBench is a comprehensive benchmark designed specifically for evaluating the alignment performance of large Chinese language models (LLMs). It focuses on assessing how well these models align with human intent across multiple dimensions. Let me provide you with more details:

  1. Purpose and Importance:
  2. For fine-tuned LLMs, alignment with human intent has become a critical factor in their practical applications.
  3. Existing evaluation benchmarks do not accurately reflect model performance in real-world scenarios or their alignment with human intent.
  4. AlignBench aims to address this challenge by providing a comprehensive, multi-dimensional evaluation benchmark specifically for Chinese LLMs.

  5. Data and Construction:

  6. AlignBench uses a human-in-the-loop data curation process to ensure dynamic and realistic evaluation data.
  7. The data comes from real user queries (partly challenging questions constructed by researchers) in ChatGLM online services.
  8. It covers various categories, including fundamental language ability, Chinese understanding, open-ended questions, writing ability, logical reasoning, mathematics, task-oriented role play, and professional knowledge.

  9. Evaluation Methodology:

  10. AlignBench employs a multi-dimensional, rule-calibrated evaluation method called "LLM-as-Judge."
  11. It combines human judgment with a chain-of-thought analysis to enhance reliability and interpretability.
  12. The evaluation process includes comparing model responses to high-quality reference answers and generating multi-dimensional scores.

  13. CritiqueLLM:

  14. To facilitate easy alignment assessment for Chinese researchers, AlignBench introduces a dedicated evaluation model called CritiqueLLM.
  15. CritiqueLLM can recover 95% of GPT-4's evaluation capability and will provide an accessible API for researchers in the future.

(1) THUDM/AlignBench: 多维度中文对齐评测基准 - GitHub. https://github.com/THUDM/AlignBench. (2) 智谱AI发布中文 LLM 对齐评测基准AlignBench | 前途科技. https://accesspath.com/ai/5890084/. (3) AlignBench:量身打造的中文大语言模型对齐评测 - CSDN博客. https://blog.csdn.net/cenyk1230/article/details/135228409. (4) AlignBench:专为「中文 LLM」而生的对齐评测 - 知乎. https://zhuanlan.zhihu.com/p/671884106. (5) AlignBench: Benchmarking Chinese Alignment of Large Language Models. https://arxiv.org/abs/2311.18743.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages