AlignBench Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

**AlignBench** is a comprehensive benchmark designed specifically for evaluating the alignment performance of large Chinese language models (LLMs). It focuses on assessing how well these models align with human intent across multiple dimensions. Let me provide you with more details:

1. **Purpose and Importance**:
   - For fine-tuned LLMs, alignment with human intent has become a critical factor in their practical applications.
   - Existing evaluation benchmarks do not accurately reflect model performance in real-world scenarios or their alignment with human intent.
   - AlignBench aims to address this challenge by providing a comprehensive, multi-dimensional evaluation benchmark specifically for Chinese LLMs.

2. **Data and Construction**:
   - AlignBench uses a human-in-the-loop data curation process to ensure dynamic and realistic evaluation data.
   - The data comes from real user queries (partly challenging questions constructed by researchers) in ChatGLM online services.
   - It covers various categories, including fundamental language ability, Chinese understanding, open-ended questions, writing ability, logical reasoning, mathematics, task-oriented role play, and professional knowledge.

3. **Evaluation Methodology**:
   - AlignBench employs a multi-dimensional, rule-calibrated evaluation method called "LLM-as-Judge."
   - It combines human judgment with a chain-of-thought analysis to enhance reliability and interpretability.
   - The evaluation process includes comparing model responses to high-quality reference answers and generating multi-dimensional scores.

4. **CritiqueLLM**:
   - To facilitate easy alignment assessment for Chinese researchers, AlignBench introduces a dedicated evaluation model called **CritiqueLLM**.
   - CritiqueLLM can recover 95% of GPT-4's evaluation capability and will provide an accessible API for researchers in the future.

(1) THUDM/AlignBench: 多维度中文对齐评测基准 - GitHub. https://github.com/THUDM/AlignBench.
(2) 智谱AI发布中文 LLM 对齐评测基准AlignBench | 前途科技. https://accesspath.com/ai/5890084/.
(3) AlignBench：量身打造的中文大语言模型对齐评测 - CSDN博客. https://blog.csdn.net/cenyk1230/article/details/135228409.
(4) AlignBench：专为「中文 LLM」而生的对齐评测 - 知乎. https://zhuanlan.zhihu.com/p/671884106.
(5) AlignBench: Benchmarking Chinese Alignment of Large Language Models. https://arxiv.org/abs/2311.18743.

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

AlignBench

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

MathBench

C3

MT-Bench

CCPM

Usage

License

Modalities

Languages

AlignBench

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit