- Purpose
- Evaluation Framework
- Prerequisites
- Environment Setup
- Running the Accuracy Benchmark
- Understanding the Results
This guide explains how to evaluate the accuracy of a deployed CodeGen service using standardized code generation benchmarks. It helps quantify the model's ability to generate correct and functional code based on prompts.
We utilize the bigcode-evaluation-harness, a framework specifically designed for evaluating code generation models. It supports various standard benchmarks such as HumanEval, MBPP, and others.
- A running CodeGen service accessible via an HTTP endpoint. Refer to the main CodeGen README for deployment options.
- Python 3.8+ environment.
- Git installed.
-
Clone the Evaluation Repository:
git clone https://github.com/opea-project/GenAIEval cd GenAIEval
-
Install Dependencies:
pip install -r requirements.txt pip install -e .
-
Set Environment Variables: Replace
{your_ip}
with the IP address of your deployed CodeGen service and{your_model_identifier}
with the identifier of the model being tested (e.g.,Qwen/CodeQwen1.5-7B-Chat
).export CODEGEN_ENDPOINT="http://{your_ip}:7778/v1/codegen" export CODEGEN_MODEL="{your_model_identifier}"
Note: Port
7778
is the default for the CodeGen gateway; adjust if you customized it. -
Execute the Benchmark Script: The script will run the evaluation tasks (e.g., HumanEval by default) against the specified endpoint.
bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT
Note: Currently, the framework runs the full task set by default. Using 'limit' parameters might affect result comparability.
The results will be printed to the console and saved in evaluation_results.json
. A key metric is pass@k
, which represents the percentage of problems solved correctly within k
generated attempts (e.g., pass@1
means solved on the first try).
Example output snippet:
{
"humaneval": {
"pass@1": 0.7195121951219512
},
"config": {
"model": "Qwen/CodeQwen1.5-7B-Chat",
"tasks": "humaneval",
"instruction_tokens": null,
"batch_size": 1,
"max_length_generation": 2048,
"precision": "fp32",
"load_in_8bit": false,
"load_in_4bit": false,
"left_padding": false,
"limit": null,
"limit_start": 0,
"save_every_k_tasks": -1,
"postprocess": true,
"allow_code_execution": true,
"generation_only": false,
"load_generations_path": null,
"load_data_path": null,
"metric_output_path": "evaluation_results.json",
"save_generations": true,
"load_generations_intermediate_paths": null,
"save_generations_path": "generations.json",
"save_references": true,
"save_references_path": "references.json",
"prompt": "prompt",
"max_memory_per_gpu": null,
"check_references": false,
"codegen_url": "http://192.168.123.104:7778/v1/codegen"
}
}
This indicates a pass@1
score of approximately 72% on the HumanEval benchmark for the specified model via the CodeGen service endpoint.