You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
The id series of the response attribute and sources dataframe outputs of the DRIFT SearchResult object appear to be consistently misnumbered by -1. For instance, if the human_readable_id value of a given text unit in text_units is '2', the corresponding id value of the sources dataframe is '1'.
The response attribute of the DRIFT SearchResult object appears to occasionally hallucinate source ids. For instance, response may reference "[Data: Sources (1)]" where there is no corresponding '1' in the id series of the sources dataframes of a DRIFT SearchResult object.
Steps to reproduce
Execute a DRIFT search query
For a given text unit, inspect the corresponding id value of the sources dataframes of the SearchResult.context_data dict
For the same text unit, inspect the corresponding human_readable_id of text_units.parquet
Expected Behavior
For a given text unit, the id series of the resulting sources dataframes of the SearchResult.context_data dict should be the same as the human_readable_id of text_units.parquet.
Only source ids used for context should be referenced in the generated response.
GraphRAG Config Used
### This config file contains required core defaults that must be set, along with a handful of common optional settings.### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/### LLM settings ##### There are a number of settings to tune the threading and token limits for LLM calls - check the docs.models:
default_chat_model:
type: openai_chat # or azure_openai_chat# api_base: https://<instance>.openai.azure.com# api_version: 2024-05-01-previewauth_type: api_key # or azure_managed_identityapi_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file# audience: "https://cognitiveservices.azure.com/.default"# organization: <organization_id>model: gpt-4o-mini# deployment_name: <azure_model_deployment_name># encoding_model: cl100k_base # automatically set by tiktoken if left undefinedmodel_supports_json: true # recommended if this is available for your model.concurrent_requests: 25# max number of simultaneous LLM requests allowedasync_mode: threaded # or asyncioretry_strategy: nativemax_retries: -1# set to -1 for dynamic retry logic (most optimal setting based on server response)tokens_per_minute: 0# set to 0 to disable rate limitingrequests_per_minute: 0# set to 0 to disable rate limitingdefault_embedding_model:
type: openai_embedding # or azure_openai_embedding# api_base: https://<instance>.openai.azure.com# api_version: 2024-05-01-previewauth_type: api_key # or azure_managed_identityapi_key: ${GRAPHRAG_API_KEY}# audience: "https://cognitiveservices.azure.com/.default"# organization: <organization_id>model: text-embedding-3-small# deployment_name: <azure_model_deployment_name># encoding_model: cl100k_base # automatically set by tiktoken if left undefinedmodel_supports_json: true # recommended if this is available for your model.concurrent_requests: 25# max number of simultaneous LLM requests allowedasync_mode: threaded # or asyncioretry_strategy: nativemax_retries: -1# set to -1 for dynamic retry logic (most optimal setting based on server response)tokens_per_minute: 0# set to 0 to disable rate limitingrequests_per_minute: 0# set to 0 to disable rate limitingvector_store:
default_vector_store:
type: lancedbdb_uri: output\lancedbcontainer_name: defaultoverwrite: Trueembed_text:
model_id: default_embedding_modelvector_store_id: default_vector_store### Input settings ###input:
type: file # or blobfile_type: text # [csv, text, json]base_dir: "input"chunks:
size: 600overlap: 50group_by_columns: [id]### Output settings ##### If blob storage is specified in the following four sections,## connection_string and container_name must be providedcache:
type: file # [file, blob, cosmosdb]base_dir: "cache"reporting:
type: file # [file, blob, cosmosdb]base_dir: "logs"output:
type: file # [file, blob, cosmosdb]base_dir: "output"### Workflow settings ###extract_graph:
model_id: default_chat_modelprompt: "prompts/extract_graph.txt"entity_types: [school of thought, hypothesis, concept, contention,knowledge gap, policy, country, gender, capability, livelihood, social harm,ecological harm, technology, norm]max_gleanings: 1summarize_descriptions:
model_id: default_chat_modelprompt: "prompts/summarize_descriptions.txt"max_length: 500extract_graph_nlp:
text_analyzer:
extractor_type: regex_english # [regex_english, syntactic_parser, cfg]extract_claims:
enabled: truemodel_id: default_chat_modelprompt: "prompts/extract_claims.txt"description: "Any claims or facts that could be relevant to information discovery."max_gleanings: 1community_reports:
model_id: default_chat_modelgraph_prompt: "prompts/community_report_graph.txt"text_prompt: "prompts/community_report_text.txt"max_length: 2000max_input_length: 8000cluster_graph:
max_cluster_size: 10embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodesumap:
enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)snapshots:
graphml: trueembeddings: false### Query settings ##### The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#querylocal_search:
chat_model_id: default_chat_modelembedding_model_id: default_embedding_modelprompt: "prompts/local_search_system_prompt.txt"global_search:
chat_model_id: default_chat_modelmap_prompt: "prompts/global_search_map_system_prompt.txt"reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"drift_search:
chat_model_id: default_chat_modelembedding_model_id: default_embedding_modelprompt: "prompts/drift_search_system_prompt.txt"reduce_prompt: "prompts/drift_search_reduce_prompt.txt"basic_search:
chat_model_id: default_chat_modelembedding_model_id: default_embedding_modelprompt: "prompts/basic_search_system_prompt.txt"
Logs and screenshots
No response
Additional Information
GraphRAG Version: v2.1.0
Operating System: Windows 11
Python Version: 3.12.9
Related Issues:
The text was updated successfully, but these errors were encountered:
DaraDCE
added
bug
Something isn't working
triage
Default label assignment, indicates new issue needs reviewed by a maintainer
labels
Apr 8, 2025
Do you need to file an issue?
Describe the bug
The
id
series of theresponse
attribute andsources
dataframe outputs of the DRIFTSearchResult
object appear to be consistently misnumbered by -1. For instance, if thehuman_readable_id
value of a given text unit intext_units
is '2', the correspondingid
value of thesources
dataframe is '1'.The response attribute of the DRIFT
SearchResult
object appears to occasionally hallucinate source ids. For instance,response
may reference "[Data: Sources (1)]" where there is no corresponding '1' in theid
series of thesources
dataframes of a DRIFTSearchResult
object.Steps to reproduce
id
value of thesources
dataframes of theSearchResult.context_data
dicthuman_readable_id
oftext_units.parquet
Expected Behavior
For a given text unit, the
id
series of the resultingsources
dataframes of theSearchResult.context_data
dict should be the same as thehuman_readable_id
oftext_units.parquet
.Only source ids used for context should be referenced in the generated response.
GraphRAG Config Used
Logs and screenshots
No response
Additional Information
The text was updated successfully, but these errors were encountered: