YOU MUST EAT THIS GRAPH NEWS, GRAPH OMAKASE. 2weeks september

Jeong Yitae
6 min readSep 17, 2023

--

MESH: A Flexible Distributed Hypergraph Processing System

[https://ieeexplore.ieee.org/document/8790188]

“Think like a vertex” Think like a point. This is the first sentence I came across in this paper, but it is one of the development philosophies of graph processing systems. It is similar to the concept of object-oriented programming. Think of everything in terms of points. You can understand it better if you think of it as programming based on the state and behavior of points (interactions between neighbors). In the engine we’re going to talk about today, MESH, we developed a hypergraph processing engine with this same core idea.

MESH is said to be superior in terms of performance compared to HyperX, an existing hypergraph processing engine, by focusing on three values: 1.Expressiveness & Ease of Use 2.Scalability 3.Flexibility and ease of Implementation. To summarize, unlike HyperX, MESH was developed from scratch to fit the hypergraph structure, so it fulfills all three of the aforementioned points, which is the key.

But where do we need these advantages? If you think about it, eventually you will have a set of hypotheses to solve the problems you are facing when applying the hypergraph perspective to real-world data, and you will simulate various situations to validate those hypotheses. In order to simulate ‘various’ situations, you need to be flexible, and previous engines were not only difficult, but also long in code. HyperX, on the other hand, has implementations that make this possible (for various paritioning strategies).

Despite the paper’s age, I put it on the omakase table this week for the following reasons: 1. Logic and reference for how to choose partitioning when faced with large amounts of data 2. Awareness of Spark
In the real world, there are many concerns about introducing Graph. The most important thing is to contribute to business impact by utilizing graphs, but this can be a negative if you design a fancy graph plan.

In other words, the key is to find a good compromise between the ideal and the reality, and I think Spark (GraphX) is a good alternative between planning and implementation. Of course, how to allocate resources based on CPU, cluster, and memory is another concern, so it requires a lot of study, but I’m sure the output will be worth it.

which is good choose between Hadoop or Spark in your circumstance?[https://www.geeksforgeeks.org/difference-between-hadoop-and-spark/]

LLM AS DBA

[https://arxiv.org/pdf/2308.05481.pdf]

Database architecture Administration (DBA), literally the job of managing databases. As the importance of data storage and processing increases, DBAs are becoming increasingly valuable. This is the thesis that an LLM can replace it. Data analysts and data scientists will probably use SQL queries to get the data they need, which requires collaboration with other teams (DBAs and data engineers).

Obviously, you’re managing data and you’re utilizing data, so you’re looking at it from different perspectives to fulfill different tasks, and sometimes those perspectives are aligned, and sometimes they’re not. SQL queries that analysts take for granted can be frustrating for DBAs, and if done incorrectly, can cause bottlenecks on the server, which can be detrimental to everyone involved. DBAs and data engineers are the ones who prevent this from happening.

In this paper, I share the process and results of developing an LLM bot to replace DBAs, which I believe is an important perspective. In the appendix, I’ve even written prompts and test cases that are the secret sauce of LLMs. How is this possible? You may be thinking, “How is this possible? It may be a part of the tacit knowledge of DBAs, but in fact, many situational solutions have been distributed in the form of documents. It was difficult for people in other roles to understand the documents, so the access barrier was high, and even LLMs had no instructions for RCA (root cause analysis), so when they asked Chatgpt about it, hallucinations occurred.

In this paper, we develop a bot that combines these two perspectives. We develop a bot that combines these two perspectives: it can understand documents well, and it is specialized in tree of thought for RCA. What is noteworthy is the collaboration and discussion among DBAs in charge of different parts to derive answers. Chief, Memory, and CPU DBAs, it’s a fun idea, but it’s also scary, because as chatgpt develops, experts of industry said that the first jobs to be replaced are white-collar jobs, such as lawyers, prosecutors, and judges, which I didn’t relate to, but now I’m thinking, “Will I be one of them soon?”.

In this era, we need to constantly study to prove our value and develop, right? In this paper, DBAs have written down a series of processes on how to analyze and solve the cause to create a DBA bot. These processes and wisdom on what to make decisions based on are very informative. I think it will be useful for those who are interested in DB and those who will use LLM to create and service bot. If you lack the foundation of DB, it will be difficult to read. For those who do not have the foundation of DB.

recommendation resource for study of database overview.

[https://cs145-fall22.github.io/]

GraphFC: Customs Fraud Detection with Label Scarcity

[https://arxiv.org/pdf/2305.11377.pdf]

This is something that comes up a lot in GNN use cases: if you have a fraud detection system (FDS), a bunch of ideal data, and it’s well-stocked and managed, it’s not a big deal to introduce graphs, so it’s very useful in situations where the aforementioned data conditions are a prerequisite. But what about the opposite? One of the more challenging situations is when you don’t have a lot of graph data and labeled data to use for machine learning, which is a no-brainer. Can we really label TBs of data every second? In this paper, we discuss ideas that leverage a graph perspective to overcome this challenge.

We use topology to mitigate the label scarcity phenomenon and demonstrate through experimental results that it is superior to other methodologies (tabluar) when applied to machine learning.
The pre-training and fine-tuning process is not much different from the existing GNN papers, but the Tree-base cross-feature embedding and Transaction Graph modeling process are very interesting. It also uses different message passing models, RGCN GraphSAGE and GAT, and it is useful to see how the difference between these models works on FDS or unlabeled data.

I often get inquiries from people who want to apply Graph FDS, how to design a DBMS, derive rules from it, and apply ML. I can’t cover all the areas mentioned, but I think this paper will be a ray of light for them from an ML perspective, because it focuses on Label Scarcity, a problem that is often mentioned in the field. I’ve been thinking about FDS only from the DB perspective, but I found a good reference for how to augment data from unlabeled data, which is the key point from the ML perspective, and what model to utilize, so I really enjoyed reading the paper.

If you have a question or want to contact to me , touch below link.

https://www.linkedin.com/in/ii-tae-jeong/

--

--

Jeong Yitae
Jeong Yitae

Written by Jeong Yitae

Linkedin : jeongyitae I'm the graph and network data enthusiast from hardware to software(application)

No responses yet