GraphRAG, Let’s check the rationality with the paper and a few checklists to see if it’s time to introduce it to the RAG we’re developing now.

Jeong Yitae
9 min readMar 3, 2024

--

Email : jeongiitae6@gmail.com

Linkedin : https://www.linkedin.com/in/ii-tae-jeong/

G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering

PaperLink : https://arxiv.org/pdf/2402.07630.pdf

CodeLink : https://github.com/XiaoxinHe/G-Retriever

Keywords : Graph RAG , Retrieval-augmented generation , Large Language Model, hallucination , Graph Neural Network

Before we discuss GraphRAG and review this paper, we would like to know the GraphRAG concept.

  • Utilizing graph embeddings from GNN (graph neural networks) for user query response inference, this approach adds graph embeddings to text embeddings. Termed soft-prompting, it is a form of prompt engineering.
  • Prompt engineering can be broadly categorized into Hard and Soft. Hard prompts are explicit, where context is manually added to the given user query. For example, if a user query is “I want to eat bread today,” a hard prompt might explicitly outline the task, context, persona, example, format, and tone, requiring input on six dimensions. This method is subjective, with the prompt creator’s bias heavily influencing its optimization. However, its simplicity is advantageous.
  • Conversely, Soft prompts are implicit, enhancing existing text embeddings with additional embedding information as the model infers answers similar to the query. This method ensures objectivity and optimizes weight values but requires more complex model design and implementation.

When to Use GraphRAG

  • GraphRAG isn’t a one-size-fits-all solution. If the existing RAG works well, switching to the more advanced GraphRAG without a compelling reason may not be well-received. Any system improvement requires justification to answer why it’s necessary.
  • GraphRAG is worth considering when there’s a mismatch between retrieved information and user intent, a fundamental limitation similar to vector search. Since retrieval is based on similarity rather than exact matches, it may yield inaccurate information.
  • Improvements might involve introducing BM25 for an exact search in a hybrid search approach, enhancing the ranking process with reranker functions, or fine-tuning to improve embedding quality. If these efforts result in minimal RAG performance improvements, considering GraphRAG is advisable.

How G-Retriever is works?

1.Indexing

  • This passage describes the process of refining and storing data in a format that is readily accessible for use beforehand. In GraphRAG, the information to be utilized in advance refers to the textual information contained within the properties of nodes and edges in the graph. To convert this information into quantifiable values, a language model is employed.

2.Retrieval

  • This passage discusses the process of measuring and retrieving data based on its relevance to a user’s query. To assess the relevance, the language model evaluates the similarity between the ‘query’ and the values of ‘nodes’ and ‘edges’ within the graph, utilizing the K-nearest neighbors algorithm for this purpose.

3.Subgraph Construction

  • Unlike other RAG (Retrieval-Augmented Generation) models that retrieve documents, GraphRAG needs to fetch graphs relevant to the user’s query. In the initial retrieval process, simply comparing the user’s text with the graph’s text to fetch information does not strictly utilize the semantics of the graph’s connections.
  • To leverage this, it’s necessary to assess how much semantic similarity each node and edge has with the user’s query. For this assessment, the PCST (Prize-Collecting Steiner Tree) approach is utilized.
  • Briefly explaining the PCST approach: both nodes and edges are assigned prizes. The value of these prizes is determined by using the ranking similarities of nodes and edges to the user’s query, identified in the earlier retrieval process. Higher prizes are awarded to nodes similar to the query, while dissimilar nodes may receive lower values or even zero.
  • The prizes are summed across connected nodes and edges, extracting those with high total values. This total represents the nodes and edges with the highest sums. To manage the size of the subgraph, a parameter called ‘cost’ is used to pre-determine how much penalty to assign to each edge, effectively controlling the subgraph size.
  • Ultimately, this process extracts subgraphs containing information similar to the user’s query, while also managing the subgraph size through the cost parameter.

4.Answer generation

  • This passage describes the process of generating answers to queries by combining text embedding values with graph embedding values. Here, text embeddings refer to the values from the self-attention layers of a pre-trained Language Learning Model (LLM) that are kept frozen, meaning their weights are not updated during training.
  • By utilizing graph embedding values for training, it takes advantage of the soft-prompt technique mentioned earlier, which involves extracting and updating optimized weight values to incorporate semantics into answer generation.
  • The method for deriving graph embedding values and combining them with text embedding values is straightforward:
    1. Generate node embeddings using a Graph Neural Network (GNN).
    2. Aggregate these values using a pooling layer.
    3. To align the dimensions of the pooled graph embedding values with the text embedding values, project them through a Multi-Layer Perceptron (MLP) layer.
  • This process underscores the synergy between text and graph embeddings in enhancing the semantic richness of the generated answers, leveraging the strengths of both pre-trained models and graph-based information.

G-Retrieval Insight

1.Efficiency Retrieval

  • I think that the criteria may vary. In this paper, we discuss efficiency in terms of the comparison before and after retrieval, based on how much the usage of tokens is saved.
  • One of the key aspects of RAG (Retrieval-Augmented Generation) is the emphasis on including the most optimal information within the given token capacity. When utilizing G-Retrieval, there is a remarkable effect observed, where the amount of tokens decreases significantly, ranging from 83% to 99%.

2.Architecture

  • To demonstrate the effectiveness of G-Retriever, we conduct comparative experiments across three different architectures: 1. Architecture utilizing only pretrained weights, 2. Architecture that employs both pretrained weights and prompt engineering, 3. Architecture leveraging finetuned weights along with prompt engineering. Each of these architectures has its own implications.
  • The first architecture aims to determine the significance of the textual graph. The second architecture is designed to explore the meaningfulness of soft-prompts through the use of a Graph Encoder and projection. Finally, the third architecture seeks to understand the importance of optimizing LLM (Language Learning Model) weights independently of the Graph.

3.Performance

  • The results of the ablation study are also interesting. In particular, it can be observed that performance decreases by nearly 13% in parts related to the graph, specifically in scenarios without Edge Retrieval. This suggests that Edge, or in other words, Semantic Retrieval, plays a critical role in the RAG (Retrieval-Augmented Generation) framework.

In the end, what we must keep in mind in GraphRAG

  • Retrieving a graph is important, but how the entire graph is designed is equally crucial. In this idea, we only showcased retrieval using a benchmark dataset for knowledge graphs, omitting the story behind the graph’s construction.
  • With this in mind, we recommend proceeding with the task while maintaining fundamental questions about how nodes were created, how edges were formed, and why semantics were set in a particular way.

Core concept code with explaination

def retrieval_via_pcst(graph, q_emb, textual_nodes, textual_edges, topk=3, topk_e=3, cost_e=0.5):
c = 0.01
if len(textual_nodes) == 0 or len(textual_edges) == 0:
desc = textual_nodes.to_csv(index=False) + '\n' + textual_edges.to_csv(index=False, columns=['src', 'edge_attr', 'dst'])
graph = Data(x=graph.x, edge_index=graph.edge_index, edge_attr=graph.edge_attr, num_nodes=graph.num_nodes)
return graph, desc

root = -1 # unrooted
num_clusters = 1
pruning = 'gw'
verbosity_level = 0
if topk > 0:
n_prizes = torch.nn.CosineSimilarity(dim=-1)(q_emb, graph.x)
topk = min(topk, graph.num_nodes)
_, topk_n_indices = torch.topk(n_prizes, topk, largest=True)

n_prizes = torch.zeros_like(n_prizes)
n_prizes[topk_n_indices] = torch.arange(topk, 0, -1).float()
else:
n_prizes = torch.zeros(graph.num_nodes)

if topk_e > 0:
e_prizes = torch.nn.CosineSimilarity(dim=-1)(q_emb, graph.edge_attr)
topk_e = min(topk_e, e_prizes.unique().size(0))

topk_e_values, _ = torch.topk(e_prizes.unique(), topk_e, largest=True)
e_prizes[e_prizes < topk_e_values[-1]] = 0.0
last_topk_e_value = topk_e
for k in range(topk_e):
indices = e_prizes == topk_e_values[k]
value = min((topk_e-k)/sum(indices), last_topk_e_value-c)
e_prizes[indices] = value
last_topk_e_value = value
# cost_e = max(min(cost_e, e_prizes.max().item()-c), 0)
else:
e_prizes = torch.zeros(graph.num_edges)

costs = []
edges = []
vritual_n_prizes = []
virtual_edges = []
virtual_costs = []
mapping_n = {}
mapping_e = {}
for i, (src, dst) in enumerate(graph.edge_index.T.numpy()):
prize_e = e_prizes[i]
if prize_e <= cost_e:
mapping_e[len(edges)] = i
edges.append((src, dst))
costs.append(cost_e - prize_e)
else:
virtual_node_id = graph.num_nodes + len(vritual_n_prizes)
mapping_n[virtual_node_id] = i
virtual_edges.append((src, virtual_node_id))
virtual_edges.append((virtual_node_id, dst))
virtual_costs.append(0)
virtual_costs.append(0)
vritual_n_prizes.append(prize_e - cost_e)

prizes = np.concatenate([n_prizes, np.array(vritual_n_prizes)])
num_edges = len(edges)
if len(virtual_costs) > 0:
costs = np.array(costs+virtual_costs)
edges = np.array(edges+virtual_edges)

vertices, edges = pcst_fast(edges, prizes, costs, root, num_clusters, pruning, verbosity_level)

selected_nodes = vertices[vertices < graph.num_nodes]
selected_edges = [mapping_e[e] for e in edges if e < num_edges]
virtual_vertices = vertices[vertices >= graph.num_nodes]
if len(virtual_vertices) > 0:
virtual_vertices = vertices[vertices >= graph.num_nodes]
virtual_edges = [mapping_n[i] for i in virtual_vertices]
selected_edges = np.array(selected_edges+virtual_edges)

edge_index = graph.edge_index[:, selected_edges]
selected_nodes = np.unique(np.concatenate([selected_nodes, edge_index[0].numpy(), edge_index[1].numpy()]))

n = textual_nodes.iloc[selected_nodes]
e = textual_edges.iloc[selected_edges]
desc = n.to_csv(index=False)+'\n'+e.to_csv(index=False, columns=['src', 'edge_attr', 'dst'])

mapping = {n: i for i, n in enumerate(selected_nodes.tolist())}

x = graph.x[selected_nodes]
edge_attr = graph.edge_attr[selected_edges]
src = [mapping[i] for i in edge_index[0].tolist()]
dst = [mapping[i] for i in edge_index[1].tolist()]
edge_index = torch.LongTensor([src, dst])
data = Data(x=x, edge_index=edge_index, edge_attr=edge_attr, num_nodes=len(selected_nodes))

return data, desc

** Original Code Resource : https://github.com/XiaoxinHe/G-Retriever

  • The provided code outlines a function designed to perform subgraph extraction based on the Prize-Collecting Steiner Tree (PCST) approach. The idea is to select a subset of nodes and edges from a given graph that are most relevant to a specific query embedding (`q_emb`). This method is particularly useful in scenarios where the graph represents textual data, and you’re interested in extracting a coherent and relevant subgraph based on semantic similarity to a query.

Let’s break down the key parts of the function for better understanding:

Function Parameters:

- `graph`: The original graph from which the subgraph is to be extracted. Expected to be a PyTorch Geometric `Data` object.
- `q_emb`: The query embedding vector representing the query’s semantic content.
- `textual_nodes`, `textual_edges`: Pandas DataFrames containing information about the nodes and edges of the `graph`.
- `topk`, `topk_e`: Parameters specifying the number of top nodes and edges to consider based on similarity to `q_emb`.
- `cost_e`: A threshold cost for including edges in the solution.

Key Steps Explained:

1. **Early Return for Empty Graph Components**: If there are no textual nodes or edges, it immediately returns the original graph and a description derived from `textual_nodes` and `textual_edges`.

2. **Initialization**: Sets up variables for PCST including the root (unrooted in this case), number of clusters, and pruning method.

3. **Node and Edge Prize Calculation**:
— Calculates similarity scores (`n_prizes` for nodes, `e_prizes` for edges) between the query embedding and graph components using cosine similarity.
— Adjusts these scores to determine the “prizes” for including each node or edge in the subgraph. For edges, it further filters them based on the `cost_e` threshold.

4. **Graph Transformation for PCST**:
— Transforms the original graph into a format suitable for PCST by potentially introducing “virtual” nodes and adjusting edges and their costs based on the computed prizes and costs.

5. **PCST Algorithm Execution**:
— Runs the PCST algorithm (`pcst_fast`) on the transformed graph to select a subset of nodes and edges that form the optimal subgraph based on the given prizes and costs.

6. **Subgraph Reconstruction**:
— Extracts the selected nodes and edges based on the output of the PCST algorithm.
— Reconstructs the subgraph using the selected components, ensuring that the resulting subgraph is connected and relevant to the query.

7. **Subgraph Description Generation**:
— Generates a textual description of the selected subgraph by converting the relevant parts of `textual_nodes` and `textual_edges` DataFrames to CSV format.

8. **Return**: The function returns the reconstructed subgraph as a PyTorch Geometric `Data` object along with its textual description.

### Annotations for Clarity:

- **Prize Calculation**: The prizes for nodes and edges are derived from their semantic relevance to the query. Higher similarity scores lead to higher prizes, indicating a stronger preference for including these components in the subgraph.

- **Virtual Nodes and Edges**: Introduced to facilitate the PCST algorithm. They represent potential modifications to the original graph’s structure to accommodate the prize and cost model. Virtual nodes act as intermediaries, adjusting the connectivity based on the optimization process.

- **PCST Algorithm**: The core of the function, `pcst_fast`, is an external algorithm that takes the transformed graph (with prizes and costs) and identifies the optimal subgraph. This step is where the actual optimization occurs.

- **Mapping and Reconstruction**: After identifying the optimal components, the function maps them back to the original graph’s context, ensuring that the resulting subgraph is accurately represented and relevant to the query.

This function encapsulates a complex process of graph optimization tailored to extracting semantically relevant subgraphs based on the PCST model, making it a powerful tool for tasks like document summarization, information retrieval, and knowledge graph exploration.

--

--

Jeong Yitae

Linkedin : jeongiitae / i'm the graph and network data enthusiast. I always consider how the graph data is useful in the real-world.