Really helpful to knowledge graph for reasoning and improving LLM performance?
Evolving the industry of LLM, we had been faced the challenge for enhancing their performance aspect of hallucination , data privacy and lack of knowledge.
So, personally we divided into 3 category for enhancement it ‘prompt’ ,’fine tuning’ , ‘retrieval augment generation’. There are many each pros and cons.
in this posting, we particualy deep dive into prompt engineering with knowledge graph paper. and answer our own question ‘is it really helpful to constructing your own knowledge graph?’
Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs.
Paperlink — Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs
Intro
Prompt, is it really important? There was once a question about it. I’ve seen many omakase chefs around me look skeptically at the notion that if a person artificially composes a command and inputs it well, the result will also turn out well. I can fully understand this perspective.
Although there is a specific formula, just as ‘Ah’ and ‘Eo’ are different, if the nuances are varied with each prompt, these subtle differences can bring about significant changes. Moreover, the concept of giving instructions directly to a machine in writing was unfamiliar, so it is thought that this could be possible.
I wonder if methods like DsPY have not emerged because of concerns not only around me but also nationwide and globally. Automated prompt engineering is a concept that has recently emerged, but manually writing prompts is still likely the prevalent method.
Today, I will talk about a paper that beautifully unfolds a prompt in which CoT reasoning is applied to a knowledge graph perspective to access reliable data and infer from it.
Why?
Of course, the final answer delivered to the user is important, but the process of deriving that answer involves several inference processes by the LLM, and the importance of this has been overlooked. The paper’s author realizes the need for improvement and proposes using the Knowledge Graph with the CoT reasoning method.
What is Chain-of-thought (CoT) Reasoning?
CoT is one type of reasoning framework. When receiving a query from a user, it instructs the LLM to perform chained reasoning using a chaining method to derive the answer.
Architecture
Since the importance of the method of inferring answers has been mentioned, we should discuss how to improve it, right? The author of the paper talks about the process of inference in two parts.
The first is the Discriminative Evaluation method, where the LLM analyzes knowledge to assess faithful reasoning. The second is the Generative Evaluation method, where the LLM evaluates the faithful reasoning generated during the CoT process.
This is similar to the GAN architecture encountered when first studying deep learning. If you think of the concept of creating and distinguishing to enhance the model’s performance, it will be easier to understand. However, the difference here is that the data used for creation and distinction is based on a Knowledge Graph.
Specifically, what will be the criteria for evaluation?
As mentioned earlier, a measure is needed to evaluate good distinction and good generation. This section specifically discusses how and what will be evaluated.
Discriminative Evaluation, what will be the criteria for assessing the quality of distinction?
A criterion for good distinction is necessary. In this paper, to make a good distinction, the existing ground-truth is randomly mixed to create good and bad examples, which are then directly injected into the Prompt. The method of creating bad examples consists of the following three types:
1.Factual error reasoning path
Randomly mix the entities of a valid reasoning path to create an invalid path.
2.Incoherent reasoning path
Randomly mix the elements within the triple structure of a valid reasoning path to create an incoherent reasoning path.
3.Misguided reasoning path
Randomly mix paths that have arisen from other questions in the KG to create paths with different facts and triple structures.
After forming these three types of bad examples, the LLM is directly trained to determine whether these examples are reasoning paths that match the user’s query through yes or no answering.
Generative Evaluation, what will be the criteria for assessing the quality of generation?
It is a method of evaluating the quality of generation by comparing the Triple generated through CoT with the existing Knowledge Graph to determine if it is a meaningful Triple. Since the generated Triple form should be made similar to the path form of the existing Knowledge Graph, prompt design becomes even more crucial.
Turning the result, a Triple, produced through CoT into a meaningful Path and comparing it with the Knowledge Graph is key.
First, to make it a meaningful Path, this paper uses an embedding method to convert the Triple into numerical form. The embedding method used here is Sentence-BERT.
Afterward, the numerically transformed Triple and the existing Knowledge Graph are compared using cosine similarity to retrieve the top k similar Knowledge Graph elements. Note that all these operations take place in a vector database (FAISS).
To prevent cases where an entity (relation) is missing and not retrieved because it is related but not connected, additional evaluation includes the head and tail values of the triple.
The evaluation formula created in this way is referred to as a factual threshold, and evaluations are based on values above or below this threshold.
In addition to the evaluation formula created, various perspectives are additionally adopted to consider whether the reasoning method is appropriate.
Factual correctness — A method of judging based on whether the retrieved triple is lower or higher than the factual threshold.
Coherence — A method of judging based on what evidence was used previously to draw a conclusion.
Final answer correctness — A method of judging whether the answer derived using the given factual correct and coherent path is accurate or not.
What were the results? Experiment
As mentioned in the introduction, how well the prompt is designed is crucial to the paper.
The paper utilizes three prompt strategies for experimental comparison:
Prompting Strategies list in this paper
- Few-shot CoT
A method that involves adding five examples to CoT.
- Few-shot CoT with planning (CoT-Plan)
A prompting strategy that adds the process of LLM directly planning and decomposing before verbalizing the CoT reasoning results.
- Few-shot CoT with self-consistency (CoT-SC)
A prompting strategy that induces the LLM to think based on aggregated results to maintain answer consistency, where the results of majority votes for each answer are aggregated.
Discriminative evaluation result
Finding1. LLMs possess knowledge of valid reasoning
This experiment shows a clear distinction between zero-shot and few-shot results. While zero-shot performance is respectable, few-shot performance is generally poor.
This suggests that the examples in few-shot may actually act as noise and hinder the LLM’s reasoning.
It can also be seen that LLM still has a long way to go before it can faithfully reason from information received from a knowledge graph.
Generative evaluation result
Finding2. The correct final answer may not necessarily result from faithful reasoning.
The key is to look at the gap between the answer and reasoning. The insight that the performance of the answer improves as the reasoning performance increases can be seen through the gap.
A smaller gap indicates a strong dependency between the two, and a larger gap indicates less dependency, which is why this is a critical interval to assess the impact of reasoning on the answer.
If I were to summarize the results of the experiment interpreted in the paper in one sentence, it would be that faithful reasoning is not an essential element to the extent that it does not significantly influence the answer.
Finding3. The reasoning gap worsens as the model size increases.
It is evident that there is a difference in reasoning performance depending on the model size. The larger the model size, the better the reasoning performance. On the other hand, if the model size is smaller, the performance improvement is less compared to larger models.
Finding4.Better prompting strategy can improve both the answer and reasoning accuracy.
Unlike prompt strategies that only use few-shot, those utilizing the Plan-SC approach consistently show good performance. This demonstrates that prompt strategies are effective not only for answers but also for reasoning.
Extro
This paper discussed methods for distinguishing and generating paths that would be helpful for LLMs. It was particularly interesting to me because it used specific prompt templates and design strategies, making it more accessible and practically approachable than many other papers.
However, although it did prove the effects of reasoning through experimental results, I found the paper not as compelling. Most of the experimental results did not provide convincing evidence that reasoning is as crucial as suggested. Nonetheless, I believe that these types of reasoning studies need to be actively pursued to improve aspects like the black-box nature and hallucinations of models.