Evaluating Graphs in GraphRAG and the Impact of Graph Structure on LLM Reasoning Performance

5 min readJul 3, 2024

Today, I’d like to discuss the evaluation methods for GraphRAG. When thinking about RAG evaluation, what common methods come to mind? You’d likely divide the evaluation into two categories: retrieval and generation. For retrieval, metrics such as NDCG, MRR, and MAP are commonly used. For generation, metrics like BLEU, ROUGE, BERT Score, and METEOR are typical. (Based on AutoRAG documentation: https://github.com/Marker-Inc-Korea/AutoRAG)

Additionally, there are evaluation methods where LLMs serve as the evaluators, including single point, reference-based, and pairwise-based scoring. Beyond just using LLMs as judges, evaluation frameworks like ARES and RAGAS illustrate a growing trend of relying on LLMs. These frameworks separate the LLM that creates the answers from the LLM that evaluates them (RAG generation LLM), thus entrusting the entire process of answer generation and evaluation to LLMs.

So, how does GraphRAG evaluate its performance? The key lies in transforming the results obtained from subgraph extraction into token forms and then applying token-based retrieval evaluators. Before delving into subgraph extraction, let’s briefly look at the main types of subgraphs used in GraphRAG. Generally, there are two primary types, which we can categorize.

Figure 1. LPG vs. RDF for graph data representation

Figure2. Graph Modeling for Chatbot in graphrag scenario

The second type is the RDF (Resource Description Framework) format, which is well-suited for logical reasoning graphs in the subject-predicate-object structure.

Depending on how you manage and utilize graphs in these two forms, LPG and RDF, you can distinguish the graph storage methods and apply them in GraphRAG. So, how can we effectively extract subgraphs?

For the first type, LPG, since data can be additionally stored in the graph under the name of properties, we should consider extracting based on these properties. You can use text embedding values to perform vector indexing, designate one of the property values as a meta for meta filtering, or use the text values themselves to extract relevant subgraphs using exact search techniques like full-text search.

For the second type, RDF, since the data mostly has a logical structure in the form of subject-predicate-object, you can design rule-based inference rules to perform inductive or deductive reasoning from the subject to the object. Alternatively, because RDF structures are diverse, you should consider using pathfinding algorithms and harmoniously arranging the graph structure to determine which parts to retrieve from, thus performing graph search and retrieval to achieve satisfactory results.

On the other hand, LPG identifies appropriate relationships based on embedding-based similar values to create new connections. Then, using clustering algorithms like community detection, it creates related knowledge graph clusters and uses intra-community and inter-community hierarchies to provide comprehensive answers. This is the method recently released by Microsoft for GraphRAG.

Ultimately, subgraphs are extracted from knowledge graphs designed with these two perspectives. The resulting outputs are then converted into text (token) formats and evaluated using metrics like token F1 and recall.

If you want to focus more on graph structures to assess and improve them, I recommend the recent paper published by Google. Since designed knowledge graphs eventually take on one of the many types of graph structures, this paper can help you understand the characteristics of each structure and how they impact the results, allowing you to design knowledge graph searches accordingly.

Figure3. graph standard structure what we usually faced in real-world

The seven graph structures depicted in Figure 2 are common types of graph structures we often encounter. When you construct graphs using RDF or LPG and then extract subgraphs, it is highly likely that these subgraphs will belong to one of the seven structures. Therefore, the key is how to extract meaningful nodes and edges within these structures.

Figure 4. Graph structure has a siginificant impact on the LLM’s performance

Sure, here’s the translated and refined passage in English:

In addition to Figure 4, the paper includes various experiments such as the relationship between Graph Encoding Function, Prompt Question Model Capacity, and Graph Reasoning, as well as how performance varies depending on the absence of edges during reasoning. Those who are interested should take a look at the paper for more detailed insights.

We have examined the different graph structures and how their performance varies through the figures and text in the paper. This provides a basis for understanding what type of graph structure might be formed when converting a company’s data into graph form and how this structure might significantly impact LLM performance. The references at the end of this post provide more detailed information.

Today, we discussed the factors to consider for evaluating GraphRAG and how to approach these considerations. Knowledge graphs designed from LPG and RDF perspectives differ, as do the graphs retrieved from them. It’s important to reflect on the purpose of using GraphRAG and design accordingly. The key takeaway is that the designed knowledge graphs are managed and evaluated in token form. Many are contemplating how to leverage GraphRAG to improve RAG performance, and I hope this post has helped alleviate some of that uncertainty.

Thank you for reading this long post. Have a great day.

Linkedin : linkedin.com/in/yitaejeong

Reference

1.Edge, Darren, et al. “From local to global: A graph rag approach to query-focused summarization.” arXiv preprint arXiv:2404.16130 (2024).

2.Fatemi, Bahare, Jonathan Halcrow, and Bryan Perozzi. “Talk like a graph: Encoding graphs for large language models.” arXiv preprint arXiv:2310.04560 (2023).

Evaluating Graphs in GraphRAG and the Impact of Graph Structure on LLM Reasoning Performance

Written by Jeong Yitae

Responses (1)