Skip to content

CLI parameter to enable warm-up #580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vrdn-23 opened this issue Apr 11, 2025 · 0 comments
Open

CLI parameter to enable warm-up #580

vrdn-23 opened this issue Apr 11, 2025 · 0 comments

Comments

@vrdn-23
Copy link

vrdn-23 commented Apr 11, 2025

Feature request

Would it possible to have a cli warm-up that allows to have a warm-up requests before server start for arbitrary models? The types of warm-up requests would be for embed, classify and rerank from what I understand. It would essentially just send a small dummy request so that the first inference request is not slow.

Motivation

Currently I think only the Flash implementation of certain models does an automatic warmup, but it would be nice to have a cli argument can perform a warmup call to the models that are served using TEI.
Currently most models have a very slow first request

For sentence-transformers/all-MiniLM-L6-v2 - 1.6s

{"timestamp":"2025-04-11T17:02:47.952264Z","level":"INFO","message":"Args { model_id: \"/dat*/*****/***-******-*6-v2\", revision: None, tokenization_workers: Some(2), dtype: Some(Float32), pooling: Some(Mean), max_concurrent_requests: 512, max_batch_tokens: 65536, max_batch_requests: None, max_client_batch_size: 128, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: \"0.0.0.0\", port: 8000, uds_path: \"/tmp/text-embeddings-inference-server\", huggingface_hub_cache: Some(\"/root/.cache\"), payload_limit: 2000000, api_key: None, json_output: true, otlp_endpoint: None, otlp_service_name: \"text-embeddings-inference.server\", cors_allow_origin: None }","target":"text_embeddings_router","filename":"router/src/main.rs","line_number":175}
{"timestamp":"2025-04-11T17:02:47.962852Z","level":"WARN","message":"Could not find a Sentence Transformers config","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":184}
{"timestamp":"2025-04-11T17:02:47.962879Z","level":"INFO","message":"Maximum number of tokens per request: 512","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":188}
{"timestamp":"2025-04-11T17:02:47.962896Z","level":"INFO","message":"Starting 2 tokenization workers","target":"text_embeddings_core::tokenization","filename":"core/src/tokenization.rs","line_number":28}
{"timestamp":"2025-04-11T17:02:47.975107Z","level":"INFO","message":"Starting model backend","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":230}
{"timestamp":"2025-04-11T17:02:48.353249Z","level":"INFO","message":"Starting Bert model on Cuda(CudaDevice(DeviceId(1)))","target":"text_embeddings_backend_candle","filename":"backends/candle/src/lib.rs","line_number":275}
{"timestamp":"2025-04-11T17:03:03.652186Z","level":"INFO","message":"Starting HTTP server: 0.0.0.0:8000","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":1812}
{"timestamp":"2025-04-11T17:03:03.652213Z","level":"INFO","message":"Ready","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":1813}
{"timestamp":"2025-04-11T17:03:07.979256Z","level":"INFO","message":"Success","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":714,"span":{"inference_time":"1.660399408s","queue_time":"400.617µs","tokenization_time":"171.673µs","total_time":"1.661061439s","name":"embed"},"spans":

BAAI/bge-reranker-base - 1.2s

{"timestamp":"2025-04-11T16:17:27.084973Z","level":"INFO","message":"Args { model_id: \"/dat*/*****/***-********-*ase\", revision: None, tokenization_workers: Some(2), dtype: Some(Float32), pooling: None, max_concurrent_requests: 512, max_batch_tokens: 65536, max_batch_requests: None, max_client_batch_size: 128, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: \"0.0.0.0\", port: 8000, uds_path: \"/tmp/text-embeddings-inference-server\", huggingface_hub_cache: Some(\"/root/.cache\"), payload_limit: 2000000, api_key: None, json_output: true, otlp_endpoint: None, otlp_service_name: \"text-embeddings-inference.server\", cors_allow_origin: None }","target":"text_embeddings_router","filename":"router/src/main.rs","line_number":175}
{"timestamp":"2025-04-11T16:17:27.569870Z","level":"WARN","message":"Could not find a Sentence Transformers config","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":184}
{"timestamp":"2025-04-11T16:17:27.569895Z","level":"INFO","message":"Maximum number of tokens per request: 512","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":188}
{"timestamp":"2025-04-11T16:17:27.569918Z","level":"INFO","message":"Starting 2 tokenization workers","target":"text_embeddings_core::tokenization","filename":"core/src/tokenization.rs","line_number":28}
{"timestamp":"2025-04-11T16:17:28.303617Z","level":"INFO","message":"Starting model backend","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":230}
{"timestamp":"2025-04-11T16:17:28.449523Z","level":"INFO","message":"Starting Bert model on Cuda(CudaDevice(DeviceId(1)))","target":"text_embeddings_backend_candle","filename":"backends/candle/src/lib.rs","line_number":297}
{"timestamp":"2025-04-11T16:17:37.199346Z","level":"INFO","message":"Starting HTTP server: 0.0.0.0:8000","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":1812}
{"timestamp":"2025-04-11T16:17:37.199367Z","level":"INFO","message":"Ready","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":1813}
{"timestamp":"2025-04-11T16:17:39.675045Z","level":"INFO","message":"Success","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":459,"span":{"inference_time":"1.227423716s","queue_time":"243.605µs","tokenization_time":"195.464µs","total_time":"1.227960617s","name":"rerank"},"spans":[{"inference_time":"1.227423716s","queue_time":"243.605µs","tokenization_time":"195.464µs","total_time":"1.227960617s","name":"rerank"}]}

I can provide more examples for models if required.

Your contribution

I can help with testing and verifying the fix if required!
cc @Narsil @alvarobartt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant