Skip to content

feat: expose language detection probabilities to server example #3044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sachaarbonel
Copy link
Contributor

Description:
This PR enhances the JSON API response by adding detailed language detection information when transcribing or translating audio. The changes include:

  1. Language detection probabilities for the detected language
  2. A comprehensive list of language probabilities for all languages with non-negligible confidence scores (>0.001)
  3. Integration with Whisper's existing language detection capabilities

The new information is added under a language_detection field in the JSON response, containing:

  • probability: Confidence score for the detected language
  • language_probabilities: Map of language codes to their detection probabilities

This enhancement provides more transparency into the language detection process and can be valuable for applications requiring confidence scores in language identification.

The changes are non-breaking and only add additional information to the existing JSON response structure.

Example Output:

{
  "task": "transcribe",
  "language": "english",
  "text": "This is the transcribed text of the audio file.",
  "language_detection": {
    "probability": 0.982,
    "language_probabilities": {
      "en": 0.982,
      "fr": 0.008,
      "es": 0.005,
      "de": 0.003
    }
  },
  "segments": [
    // ... segments array content ...
  ]
}

In this example:

  • The main detected language (English) has a 98.2% confidence score
  • Other languages with lower probabilities are also included
  • Only languages with probabilities > 0.001 (0.1%) are shown
  • The original JSON structure remains intact, with the new language_detection field added

@sachaarbonel sachaarbonel changed the title feat: expose language detection probabilities to server.cpp feat: expose language detection probabilities to server example Apr 14, 2025
@@ -919,13 +919,34 @@ int main(int argc, char ** argv) {
} else if (params.response_format == vjson_format) {
/* try to match openai/whisper's Python format */
std::string results = output_str(ctx, params, pcmf32s);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Remove empty spaces (you can't see them here but they show up in local diffs as red bars, and it is just nice to not have the extra "noise").

// Get language probabilities
std::vector<float> lang_probs(whisper_lang_max_id() + 1, 0.0f);
const auto detected_lang_id = whisper_lang_auto_detect(ctx, 0, params.n_threads, lang_probs.data());

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Remove empty spaces.

json lang_info = json::object();
// Include the probability of the detected language
lang_info["probability"] = lang_probs[detected_lang_id];

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Remove empty spaces.

}
}
lang_info["language_probabilities"] = all_lang_probs;
jres["language_detection"] = lang_info;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this could be add to jres directly so that it is easy to see all the attributes returned in one place, for example:

            json jres = json{                                                   
                {"task", params.translate ? "translate" : "transcribe"},        
                {"language", whisper_lang_str_full(whisper_full_lang_id(ctx))}, 
                {"duration", float(pcmf32.size())/WHISPER_SAMPLE_RATE},         
                {"text", results},                                              
                {"segments", json::array()},                                    
                {"language_detection", lang_info},                              
            };

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants