Skip to content

Add a raw generate API to the vLLM server #3227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wilrop
Copy link
Contributor

@wilrop wilrop commented Apr 3, 2025

What does this PR do?

This PR adds a generate_raw endpoint to the vLLM server. While this is not strictly necessary (yet) for any trainer provided in TRL, this does allow for easier experimentation. For example, it is now straightforward to send curl requests and inspect the results. Additionally, this endpoint allows other processes to communicate with the same vLLM server as used by TRL which may be useful for some applications.

Speaking from personal experience, I actually need my TRL trainer and another process to utilise the same vLLM server and this was the easiest way to get that working. I imagine others might find it useful as well.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@qgallouedec
Copy link
Member

it is now straightforward to send curl requests and inspect the results

With the current implementation, you can send curl requests. Or maybe you want to have access to everything returned by vLLM, not just the IDs?

@qgallouedec
Copy link
Member

Additionally, this endpoint allows other processes to communicate with the same vLLM server as used by TRL which may be useful for some applications.

Can you elaborate on this point? It's not very clear. Do you mean that it allows trl vllm-serve and vllm serve to be more aligned?

@wilrop
Copy link
Contributor Author

wilrop commented Apr 5, 2025

it is now straightforward to send curl requests and inspect the results

With the current implementation, you can send curl requests. Or maybe you want to have access to everything returned by vLLM, not just the IDs?

Yes exactly, having the full RequestOutput is quite convenient for debugging or when another process wants direct access to the vLLM generations.

Additionally, this endpoint allows other processes to communicate with the same vLLM server as used by TRL which may be useful for some applications.

Can you elaborate on this point? It's not very clear. Do you mean that it allows trl vllm-serve and vllm serve to be more aligned?

Yes this is essentially what I mean. In my own use case, I have a TRL trainer that is using the vLLM client for generation but I also have another process that is making frequent generation calls to the model that is being trained by TRL. Intuitively, this other process wants to have access to fast generations from the latest version of the model. Therefore, having this process communicate directly through the same server, which gets regular updates from the TRL trainer, greatly simplifies the workflow.

I realise that this PR is maybe a bit niche, but I have actually a real need for this and I suppose others may find it useful in the future as well.

@wilrop
Copy link
Contributor Author

wilrop commented Apr 17, 2025

Any news on this PR?

@qgallouedec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants