Skip to content

Fix gemini 2.5 flash on Vertex AI #10189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

awesie
Copy link

@awesie awesie commented Apr 21, 2025

Title

Fix cost calculations and thinking budget for Gemini 2.5 Flash on Vertex AI

Relevant issues

Fixes #10141
Fixes #10121

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • I have added a screenshot of my new test passing locally
  • My PR passes all unit tests on (make test-unit)[https://docs.litellm.ai/docs/extras/contributing_code]
  • My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🐛 Bug Fix

Changes

Copy link

vercel bot commented Apr 21, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
litellm ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 21, 2025 4:50pm

)
if reasoning_tokens:
# Usage(...) constructor expects that completion_tokens includes the reasoning_tokens.
# However the Vertex AI usage metadata does not include reasoning tokens in candidatesTokenCount.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any documentation / reference for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I'm immediately aware of. I didn't even know this was a problem until it was mentioned here: #10141 (comment). Once I looked at my logs and did manual testing, I confirmed the behavior for Vertex AI.

I have not tested the Gemini API myself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, with Gemini 2.5 Flash on Vertex AI:

  "usageMetadata": {
    "promptTokenCount": 10,
    "candidatesTokenCount": 2622,
    "totalTokenCount": 4434,
    "trafficType": "ON_DEMAND",
    "promptTokensDetails": [
      {
        "modality": "TEXT",
        "tokenCount": 10
      }
    ],
    "candidatesTokensDetails": [
      {
        "modality": "TEXT",
        "tokenCount": 2622
      }
    ],
    "thoughtsTokenCount": 1802
  },

As can be seen from the total token count, candidates token count does not include the thoughts token count (total = candidates + thoughts + prompt).

if thinking_enabled:
params["includeThoughts"] = True
if thinking_budget:
if not thinking_enabled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't really sure what you were looking for. Take a look at the test I added and let me know if you wanted something else.

@@ -910,6 +915,16 @@ def transform_response(
completion_response=completion_response,
)

thinking_enabled = None
if "gemini-2.5-flash" in model:
# Only Gemini 2.5 Flash can have its thinking disabled by setting the thinking budget to zero
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens on gemini-2.5-pro? so if you send it thinking budget = 0

  • what is the response from gemini-2.5-flash vs. gemini-2.5-pro?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I compared the behavior of gemini-2.5-flash and gemini-2.5-pro, setting the thinking budget to 0 only had an effect on gemini-2.5-flash.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe its related. Can you kindly confirm?

Using gemini as per docs at https://docs.litellm.ai/docs/tutorials/openai_codex
codex -m gemini-2.0-flash --full-auto
with a prompt that attempts to use a screenshot / image
Generate a web app with the backing fastapi based backend that mimics the a.png in this folder

I get the below error

litellm.exceptions.APIConnectionError: litellm.APIConnectionError: Invalid user message={'role': 'user', 'content': [{'type': 'text', 'text': 'Generate a web app with the backing fastapi based backend that mimics the  in this folder'}, {'type': 'image', 'text': None}]} at index 1. Please ensure all user messages are valid OpenAI chat completion messages.
Traceback (most recent call last):
  File "/Users/vichandrasekharan/code/temp/codex/.venv/lib/python3.10/site-packages/litellm/utils.py", line 6315, in validate_chat_completion_user_messages
    raise Exception("invalid content type")
Exception: invalid content type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinaynair this is an unrelated error message.

It is because of this

{'type': 'image', 'text': None}

Which does look like invalid input

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please file a separate ticket for a feature request, where we filter this scenario

@CLAassistant
Copy link

CLAassistant commented Apr 21, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@Classic298
Copy link
Contributor

@krrishdholakia i think if thinking is enabled, all output tokens are charged at 3.5$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Gemini 2.5 Flash - Vertex AI to be added to LiteLLM
5 participants