-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Fix gemini 2.5 flash on Vertex AI #10189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
58c2551
62ed5b2
f87f500
e546e5f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -365,17 +365,14 @@ def _map_reasoning_effort_to_thinking_budget( | |
if reasoning_effort == "low": | ||
return { | ||
"thinkingBudget": DEFAULT_REASONING_EFFORT_LOW_THINKING_BUDGET, | ||
"includeThoughts": True, | ||
} | ||
elif reasoning_effort == "medium": | ||
return { | ||
"thinkingBudget": DEFAULT_REASONING_EFFORT_MEDIUM_THINKING_BUDGET, | ||
"includeThoughts": True, | ||
} | ||
elif reasoning_effort == "high": | ||
return { | ||
"thinkingBudget": DEFAULT_REASONING_EFFORT_HIGH_THINKING_BUDGET, | ||
"includeThoughts": True, | ||
} | ||
else: | ||
raise ValueError(f"Invalid reasoning effort: {reasoning_effort}") | ||
|
@@ -388,9 +385,9 @@ def _map_thinking_param( | |
thinking_budget = thinking_param.get("budget_tokens") | ||
|
||
params: GeminiThinkingConfig = {} | ||
if thinking_enabled: | ||
params["includeThoughts"] = True | ||
if thinking_budget: | ||
if not thinking_enabled: | ||
params["thinkingBudget"] = 0 | ||
elif thinking_budget is not None: | ||
params["thinkingBudget"] = thinking_budget | ||
|
||
return params | ||
|
@@ -743,6 +740,7 @@ def _handle_content_policy_violation( | |
def _calculate_usage( | ||
self, | ||
completion_response: GenerateContentResponseBody, | ||
thinking_enabled: bool | None, | ||
) -> Usage: | ||
cached_tokens: Optional[int] = None | ||
audio_tokens: Optional[int] = None | ||
|
@@ -768,17 +766,24 @@ def _calculate_usage( | |
audio_tokens=audio_tokens, | ||
text_tokens=text_tokens, | ||
) | ||
completion_tokens = completion_response["usageMetadata"].get( | ||
"candidatesTokenCount", 0 | ||
) | ||
if reasoning_tokens: | ||
# Usage(...) constructor expects that completion_tokens includes the reasoning_tokens. | ||
# However the Vertex AI usage metadata does not include reasoning tokens in candidatesTokenCount. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there any documentation / reference for this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not that I'm immediately aware of. I didn't even know this was a problem until it was mentioned here: #10141 (comment). Once I looked at my logs and did manual testing, I confirmed the behavior for Vertex AI. I have not tested the Gemini API myself. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For example, with Gemini 2.5 Flash on Vertex AI:
As can be seen from the total token count, candidates token count does not include the thoughts token count (total = candidates + thoughts + prompt). |
||
# Reportedly, this is different from the Gemini API. | ||
completion_tokens += reasoning_tokens | ||
## GET USAGE ## | ||
usage = Usage( | ||
prompt_tokens=completion_response["usageMetadata"].get( | ||
"promptTokenCount", 0 | ||
), | ||
completion_tokens=completion_response["usageMetadata"].get( | ||
"candidatesTokenCount", 0 | ||
), | ||
completion_tokens=completion_tokens, | ||
total_tokens=completion_response["usageMetadata"].get("totalTokenCount", 0), | ||
prompt_tokens_details=prompt_tokens_details, | ||
reasoning_tokens=reasoning_tokens, | ||
thinking_enabled=thinking_enabled, | ||
) | ||
|
||
return usage | ||
|
@@ -910,6 +915,16 @@ def transform_response( | |
completion_response=completion_response, | ||
) | ||
|
||
thinking_enabled = None | ||
if "gemini-2.5-flash" in model: | ||
# Only Gemini 2.5 Flash can have its thinking disabled by setting the thinking budget to zero | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what happens on gemini-2.5-pro? so if you send it thinking budget = 0
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When I compared the behavior of gemini-2.5-flash and gemini-2.5-pro, setting the thinking budget to 0 only had an effect on gemini-2.5-flash. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe its related. Can you kindly confirm? Using gemini as per docs at https://docs.litellm.ai/docs/tutorials/openai_codex I get the below error
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vinaynair this is an unrelated error message. It is because of this
Which does look like invalid input There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please file a separate ticket for a feature request, where we filter this scenario |
||
thinking_budget = ( | ||
request_data.get("generationConfig", {}) | ||
.get("thinkingConfig", {}) | ||
.get("thinkingBudget") | ||
) | ||
thinking_enabled = thinking_budget != 0 | ||
|
||
model_response.choices = [] | ||
|
||
try: | ||
|
@@ -923,7 +938,10 @@ def transform_response( | |
_candidates, model_response, litellm_params | ||
) | ||
|
||
usage = self._calculate_usage(completion_response=completion_response) | ||
usage = self._calculate_usage( | ||
completion_response=completion_response, | ||
thinking_enabled=thinking_enabled, | ||
) | ||
setattr(model_response, "usage", usage) | ||
|
||
## ADD METADATA TO RESPONSE ## | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a unit test for this behaviour in here - https://github.com/BerriAI/litellm/blob/main/tests/litellm/llms/vertex_ai/gemini/test_vertex_and_google_ai_studio_gemini.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't really sure what you were looking for. Take a look at the test I added and let me know if you wanted something else.