[Bug]: <title>'sources' ids in DRIFT SearchResult.context_data are misnumbered or hallucinated #1871

DaraDCE · 2025-04-08T14:03:41Z

Do you need to file an issue?

I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

The id series of the response attribute and sources dataframe outputs of the DRIFT SearchResult object appear to be consistently misnumbered by -1. For instance, if the human_readable_id value of a given text unit in text_units is '2', the corresponding id value of the sources dataframe is '1'.
The response attribute of the DRIFT SearchResult object appears to occasionally hallucinate source ids. For instance, response may reference "[Data: Sources (1)]" where there is no corresponding '1' in the id series of the sources dataframes of a DRIFT SearchResult object.

Steps to reproduce

Execute a DRIFT search query
For a given text unit, inspect the corresponding id value of the sources dataframes of the SearchResult.context_data dict
For the same text unit, inspect the corresponding human_readable_id of text_units.parquet

Expected Behavior

For a given text unit, the id series of the resulting sources dataframes of the SearchResult.context_data dict should be the same as the human_readable_id of text_units.parquet.
Only source ids used for context should be referenced in the generated response.

GraphRAG Config Used

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

models:
  default_chat_model:
    type: openai_chat # or azure_openai_chat
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-05-01-preview
    auth_type: api_key # or azure_managed_identity
    api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    model: gpt-4o-mini
    # deployment_name: <azure_model_deployment_name>
    # encoding_model: cl100k_base # automatically set by tiktoken if left undefined
    model_supports_json: true # recommended if this is available for your model.
    concurrent_requests: 25 # max number of simultaneous LLM requests allowed
    async_mode: threaded # or asyncio
    retry_strategy: native
    max_retries: -1                   # set to -1 for dynamic retry logic (most optimal setting based on server response)
    tokens_per_minute: 0              # set to 0 to disable rate limiting
    requests_per_minute: 0            # set to 0 to disable rate limiting
  default_embedding_model:
    type: openai_embedding # or azure_openai_embedding
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-05-01-preview
    auth_type: api_key # or azure_managed_identity
    api_key: ${GRAPHRAG_API_KEY}
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    model: text-embedding-3-small
    # deployment_name: <azure_model_deployment_name>
    # encoding_model: cl100k_base # automatically set by tiktoken if left undefined
    model_supports_json: true # recommended if this is available for your model.
    concurrent_requests: 25 # max number of simultaneous LLM requests allowed
    async_mode: threaded # or asyncio
    retry_strategy: native
    max_retries: -1                   # set to -1 for dynamic retry logic (most optimal setting based on server response)
    tokens_per_minute: 0              # set to 0 to disable rate limiting
    requests_per_minute: 0            # set to 0 to disable rate limiting

vector_store:
  default_vector_store:
    type: lancedb
    db_uri: output\lancedb
    container_name: default
    overwrite: True

embed_text:
  model_id: default_embedding_model
  vector_store_id: default_vector_store

### Input settings ###

input:
  type: file # or blob
  file_type: text # [csv, text, json]
  base_dir: "input"

chunks:
  size: 600
  overlap: 50
  group_by_columns: [id]

### Output settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: file # [file, blob, cosmosdb]
  base_dir: "cache"

reporting:
  type: file # [file, blob, cosmosdb]
  base_dir: "logs"

output:
  type: file # [file, blob, cosmosdb]
  base_dir: "output"

### Workflow settings ###

extract_graph:
  model_id: default_chat_model
  prompt: "prompts/extract_graph.txt"
  entity_types: [school of thought, hypothesis, concept, contention,
  knowledge gap, policy, country, gender, capability, livelihood, social harm,
  ecological harm, technology, norm]
  max_gleanings: 1

summarize_descriptions:
  model_id: default_chat_model
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

extract_graph_nlp:
  text_analyzer:
    extractor_type: regex_english # [regex_english, syntactic_parser, cfg]

extract_claims:
  enabled: true
  model_id: default_chat_model
  prompt: "prompts/extract_claims.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  model_id: default_chat_model
  graph_prompt: "prompts/community_report_graph.txt"
  text_prompt: "prompts/community_report_text.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)

snapshots:
  graphml: true
  embeddings: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  chat_model_id: default_chat_model
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/drift_search_system_prompt.txt"
  reduce_prompt: "prompts/drift_search_reduce_prompt.txt"

basic_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/basic_search_system_prompt.txt"

Logs and screenshots

No response

Additional Information

GraphRAG Version: v2.1.0
Operating System: Windows 11
Python Version: 3.12.9
Related Issues:

The text was updated successfully, but these errors were encountered:

DaraDCE added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: <title>'sources' ids in DRIFT SearchResult.context_data are misnumbered or hallucinated #1871

[Bug]: <title>'sources' ids in DRIFT SearchResult.context_data are misnumbered or hallucinated #1871

DaraDCE commented Apr 8, 2025 •

edited

Loading

[Bug]: <title>'sources' ids in DRIFT SearchResult.context_data are misnumbered or hallucinated #1871

[Bug]: <title>'sources' ids in DRIFT SearchResult.context_data are misnumbered or hallucinated #1871

Comments

DaraDCE commented Apr 8, 2025 • edited Loading

Do you need to file an issue?

Describe the bug

Steps to reproduce

Expected Behavior

GraphRAG Config Used

Logs and screenshots

Additional Information

DaraDCE commented Apr 8, 2025 •

edited

Loading