Skip to content

Latest commit

 

History

History
105 lines (81 loc) · 3.58 KB

File metadata and controls

105 lines (81 loc) · 3.58 KB

CodeGen Accuracy Benchmark

Table of Contents

Purpose

This guide explains how to evaluate the accuracy of a deployed CodeGen service using standardized code generation benchmarks. It helps quantify the model's ability to generate correct and functional code based on prompts.

Evaluation Framework

We utilize the bigcode-evaluation-harness, a framework specifically designed for evaluating code generation models. It supports various standard benchmarks such as HumanEval, MBPP, and others.

Prerequisites

  • A running CodeGen service accessible via an HTTP endpoint. Refer to the main CodeGen README for deployment options.
  • Python 3.8+ environment.
  • Git installed.

Environment Setup

  1. Clone the Evaluation Repository:

    git clone https://github.com/opea-project/GenAIEval
    cd GenAIEval
  2. Install Dependencies:

    pip install -r requirements.txt
    pip install -e .

Running the Accuracy Benchmark

  1. Set Environment Variables: Replace {your_ip} with the IP address of your deployed CodeGen service and {your_model_identifier} with the identifier of the model being tested (e.g., Qwen/CodeQwen1.5-7B-Chat).

    export CODEGEN_ENDPOINT="http://{your_ip}:7778/v1/codegen"
    export CODEGEN_MODEL="{your_model_identifier}"

    Note: Port 7778 is the default for the CodeGen gateway; adjust if you customized it.

  2. Execute the Benchmark Script: The script will run the evaluation tasks (e.g., HumanEval by default) against the specified endpoint.

    bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT

    Note: Currently, the framework runs the full task set by default. Using 'limit' parameters might affect result comparability.

Understanding the Results

The results will be printed to the console and saved in evaluation_results.json. A key metric is pass@k, which represents the percentage of problems solved correctly within k generated attempts (e.g., pass@1 means solved on the first try).

Example output snippet:

{
  "humaneval": {
    "pass@1": 0.7195121951219512
  },
  "config": {
    "model": "Qwen/CodeQwen1.5-7B-Chat",
    "tasks": "humaneval",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 2048,
    "precision": "fp32",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "left_padding": false,
    "limit": null,
    "limit_start": 0,
    "save_every_k_tasks": -1,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_results.json",
    "save_generations": true,
    "load_generations_intermediate_paths": null,
    "save_generations_path": "generations.json",
    "save_references": true,
    "save_references_path": "references.json",
    "prompt": "prompt",
    "max_memory_per_gpu": null,
    "check_references": false,
    "codegen_url": "http://192.168.123.104:7778/v1/codegen"
  }
}

This indicates a pass@1 score of approximately 72% on the HumanEval benchmark for the specified model via the CodeGen service endpoint.