CodeGen Accuracy Benchmark

Purpose

This guide explains how to evaluate the accuracy of a deployed CodeGen service using standardized code generation benchmarks. It helps quantify the model's ability to generate correct and functional code based on prompts.

Evaluation Framework

We utilize the bigcode-evaluation-harness, a framework specifically designed for evaluating code generation models. It supports various standard benchmarks such as HumanEval, MBPP, and others.

Prerequisites

A running CodeGen service accessible via an HTTP endpoint. Refer to the main CodeGen README for deployment options.
Python 3.8+ environment.
Git installed.

Environment Setup

Clone the Evaluation Repository:

git clone https://github.com/opea-project/GenAIEval
cd GenAIEval

Install Dependencies:

pip install -r requirements.txt
pip install -e .

Running the Accuracy Benchmark

Set Environment Variables: Replace {your_ip} with the IP address of your deployed CodeGen service and {your_model_identifier} with the identifier of the model being tested (e.g., Qwen/CodeQwen1.5-7B-Chat).
```
export CODEGEN_ENDPOINT="http://{your_ip}:7778/v1/codegen"
export CODEGEN_MODEL="{your_model_identifier}"
```
Note: Port 7778 is the default for the CodeGen gateway; adjust if you customized it.
Execute the Benchmark Script: The script will run the evaluation tasks (e.g., HumanEval by default) against the specified endpoint.
```
bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT
```
Note: Currently, the framework runs the full task set by default. Using 'limit' parameters might affect result comparability.

Understanding the Results

The results will be printed to the console and saved in evaluation_results.json. A key metric is pass@k, which represents the percentage of problems solved correctly within k generated attempts (e.g., pass@1 means solved on the first try).

Example output snippet:

{
  "humaneval": {
    "pass@1": 0.7195121951219512
  },
  "config": {
    "model": "Qwen/CodeQwen1.5-7B-Chat",
    "tasks": "humaneval",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 2048,
    "precision": "fp32",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "left_padding": false,
    "limit": null,
    "limit_start": 0,
    "save_every_k_tasks": -1,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_results.json",
    "save_generations": true,
    "load_generations_intermediate_paths": null,
    "save_generations_path": "generations.json",
    "save_references": true,
    "save_references_path": "references.json",
    "prompt": "prompt",
    "max_memory_per_gpu": null,
    "check_references": false,
    "codegen_url": "http://192.168.123.104:7778/v1/codegen"
  }
}

This indicates a pass@1 score of approximately 72% on the HumanEval benchmark for the specified model via the CodeGen service endpoint.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CodeGen Accuracy Benchmark

Table of Contents

Purpose

Evaluation Framework

Prerequisites

Environment Setup

Running the Accuracy Benchmark

Understanding the Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

CodeGen Accuracy Benchmark

Table of Contents

Purpose

Evaluation Framework

Prerequisites

Environment Setup

Running the Accuracy Benchmark

Understanding the Results