GraphFDS — 01
Author linkedin link : https://www.linkedin.com/in/ii-tae-jeong/
tl;dr
- This posting told to you the graph data overview from complexity each representation method of
graph data (adjacency list and adjacency matrix).
- This posting handle with what graph technique is useful in fraud detection system.
- Graph data modeling procedure from raw data mat to graph data format.
1.What is graph data?
- It is represented as a network (node link) and a graph (vertex, edge). In terms of complex system physics, we mainly use the term Network, and in terms of mathematics, we mainly use the term Graph.
- The fundamentals are different, but the purpose is similar, so there is no problem to use them interchangeably, unless it is a language to be used in academia such as papers. ex) Graph data — consists of nodes and edges.
- For reference, let’s talk about graphs more specifically in terms of math theory: A structure of a set of objects where some pairs of objects are ‘related’.
- To represent graphs, we mainly utilize the ‘adjacency matrix’ and ‘adjacency list’ forms. Each of these representations has its advantages and disadvantages.
- Let’s compare them and talk about what’s advantageous in terms of time complexity and what’s disadvantageous in terms of space complexity.
- First, let’s start with the advantages of the adjacency matrix: Since all connections are stored in a two-dimensional matrix, you only need to check the presence or absence of the matrix when searching for relationships, so it is faster to search for related nodes, which is advantageous in terms of time complexity. On the other hand, it is disadvantageous in terms of space complexity because all nodes and relationships in the graph need to be represented in a two-dimensional matrix, so both the presence and absence of relationships need to be stored, which requires a lot of space to manage.
- Next, the adjacency list represents the connected nodes and relationships as a list of connections, so it requires less space to store and manage only the edges that exist, which is an advantage in terms of space complexity. On the other hand, it is disadvantageous in terms of time complexity because you have to go through all the neighbor lists formed to check if there is a relationship between nodes and see if they are common.
- Therefore, before utilizing data, we need to think about how to represent graphs in order to apply them effectively. The choice of representation depends on the sparsity of edges, or the density of the graph.
- If the density distribution of the graph is high, an adjacency matrix is favorable because it indirectly implies that there are many connected nodes per node. On the other hand, if the density distribution is low, an adjacency list is favorable because it indirectly implies that there are fewer connected nodes per node.
- Since scipy.sparse, dgl.sparse, torch.sparse, and pyg.sparse are all similar to the above, it’s important to check the distribution of your data and compare whether it’s advantageous to use sparse or edgelist.
- The advantage of graph data is that it optimizes the speed of data exploration. It searches for connections between each object, and based on the results, it indicates what value the data has. There are two main strategies for exploring “connections”: Depth-First-Search (DFS) and Breath-First-Search (BFS).
# dfs implementation code
def iterative_dfs(start_v):
discovered = []
stack = [stack_v]
while stack:
v = stack.pop()
if v not in discovered:
discovered.append(v)
for w in graph[v]:
stack.append(w)
return discovered
# bfs implementation code
def iterative_bfs(start_v):
discovered = [start_v]
queue = [start_v]
while queue:
v = queue.pop(0)
for in graph[v]:
if w not in discovered:
discovered.append(w)
queue.append(w)
return discovered
- Difference between graph data structure and tree data structure is whether ‘cyclic exists’ or not.
Summary
Depending on the application of the graph, it is important to choose how to represent it, as there are clear advantages and disadvantages to both the graph adjacency matrix and the graph adjacency list. The main measurement of this choice is the graph density distribution. Whether the density distribution is power-law or normal can indirectly tell us whether the edges are sparse or dense, so visualizing this process in the form of a chart is necessary.
2. Why graph data in FDS?
- previous step we talk about graph data overall shallowly. and So why use graph data in a financial FDS? Let’s start with the traditional methods and talk about their limitations and how Graph can compensate for them.
- In a nutshell, we can say that correlation is less and more convenient before and after Graph. Before the graph is applied, the user’s past transaction data is utilized to determine anomaly or not standalone. For a moment, let’s talk about historical transaction data more clearly, I first learned that historical data is represented in two data management forms: historical transaction data and historical user account data.
- Transaction data is data that changes, while user account data is data that accumulates in a non-changing form. From a financial perspective, my account balance is historical data because it’s constantly changing, while the transaction data I sent is immutable.
- In terms of data design, if we ask the fundamental question, “Would I put this in one table?”, I would say no. After all, this is where the relationship lookup (join cost) occurs to check the existence of these historical data.
- Strictly speaking, even if it is an individual data, it is indirectly implied that it is necessary to check the relationship in the historical data.The “historical data” just mentioned is divided into transaction history and user account history, which are referred to as “ledger data” and “transaction data” in the financial world. Ledger is short for original ledger, and data integrity is judged based on ledger data when duplication occurs between data.
- Therefore, since the criteria for judging data integrity are stored and managed in one master table, the amount of elements to be managed is vast, and when searching for specific data, many join operations are inevitably required to extract the desired data from the vastness.
- On the other hand, ‘transaction data’ is data that contains the record of transactions between media at the time the transaction is finalized.So far, we’ve been talking about the types of data in the financial sector for a while
- we’ve seen that there are a lot of factors involved in determining whether an individual has made a transaction, and the amount of work required to determine whether an individual’s transaction violates or complies with existing FDS rules through filtering is enormous.
- On top of this challenge of identifying the characteristics of an individual, we also have to scrutinize the targeted transactions within the transaction data for the three types of transactions: deposit, withdrawal, and remittance.
- I made a transaction, and a situation was captured where I delivered an ‘amount’ that was different from ‘usual’ to a ‘target’ named Yoo Jimin through a channel called ‘A Bank’. At this point, we need to get a standard for ‘usual’ from past data, and compare the current transaction data.
- Even if there are no deviations from ‘usual’, further comparisons are made to see if they are similar to the FDS patterns designed by each banking region. As you can imagine, there are a lot of join operations involved in this process.
- This is where graphs have an advantage in terms of utilization. If we talk about this from a database perspective, we can say that the efficiency of a graph is closely correlated to how much “normalization” has been done, or in other words, how spread out the data tables are.
- For example, if the data is divided into multiple tables for data management, it can be said that the computation that occurs here is inefficient because it is necessary to explore all the tables and derive the relationship in order to infer the information related to this data.
- On the contrary, the more unstructured tables there are, the less the process of checking the information in each table is, and the less the advantage of the graph data form can be appealed to.
- In terms of advantages, there are two main ones.
1)Schemaless
- This phrase is often used when referring to the difference between RDB and GDB, SQL and NoSQL. The idea of a schema is to be able to manipulate data without being tied to a structure. Let me give you an example: nowadays, thanks to MyData(Korean financial information centralization platform), you can see all your financial communications, etc. on the platform, so when you bring in different types of data, the first thing you do is check for data consistency.
- If I make a payment with A card on platform A, and when I check it on platform B, it says A card +@, this is not the same data. If this happens, there is a lot of room for user churn because there is a mistrust of the transaction history and therefore a mistrust of the platform. So we are putting a lot of effort into the consistency of the data to the extent that we have a separate task to mediate and standardize each data. So, let’s say I open a new card, B, and I make the first transaction, and then I go back to using A. I need to record the transaction history, so in the table format, there will be a new column for B card. To reflect a single transaction and card information.
- This is a very reasonable choice in terms of consistency, but it’s also an inefficient choice because of the wasted space to prove a single transaction. If we were to use a NoSQL graph in the same situation, we would simply add one node and one edge each, which would result in less wasted space. The advantage of graphs is that they are not bound by the traditional table format and are flexible to perform operations such as update and delete on new information.
- The figure above shows how RDB and GDB manage transactions based on the B card. The RDB grows rows and columns to accommodate additional transactions and history. In this case, we need to create an additional column for the plain card attribute ABC to include the additional attribute of card B, credit availability. If you think about having multiple attributes for multiple cards, the amount of space wasted to represent this one attribute is staggering. On the other hand, graphically representing it is simple. Put the transaction history on the edges and the card attributes on the nodes.
- There’s a lot of room for this in fintech, given that convenience is a core value in modern society. In today’s generation, which is addicted to newness and speed, various data is bound to be generated, and whenever this data is generated, all of its characteristics must be managed, so the above cases often occur. In this aspect, the characteristic called schema is gaining attention in terms of being flexible because it can manage and store data without schema.
2) ease & intuitive analysis associated with other transaction & account
- In Figure 4, you can see three different Account nodes connected based on the attribute “send money”. Pardon the pun, but what does the Account that sent money from my Account have to do with the Account that sent the money? To find out, the graph simply looks up the paths that are related by the relationship “send money” and analyzes the nodes in those paths.
- This is where graphs come into their own, as they make it very easy, quick, and intuitive to correlate transactions like this .
- Let’s take a look at an illustration to compare the traditional and graph approaches.
- The advantage and difference is that the graph method can be interpreted in a broader sense, taking into account the pattern of transactions between you and the person you are transacting with.
- For example, if the account that I transacted with was not only me, but also the majority of the goods that I received from other accounts were strange remittance transactions, it can be judged that there is a high probability of fraudulent transactions, which is significant as a basis for imposing sanctions.
- In the end, we can summarize that the advantage of the graph method, FDS, is that we can expand the basis of judgment for sanctions, so we can consider additional context for fraud judgment.
Summary
The advantages of graphs in the field of anomaly detection can be categorized as 1. schema-less 2. correlation. The advantages of graphs in FDS are that they can increase the precision of fraud detection by expanding the existing perspective and reflecting additional context for transactions, and they can identify and deal with potential risks by discovering undetected patterns.
3. Five steps to take to transform your data from traditional to graphical.
- So far, we’ve talked about graphs and FDS, but before you can utilize graphs, you need data in “graph” form. If you think about it, it’s very rare that graph data is waiting for you in the real world. Even if it does, you need to understand the intent of the data architects, the people who transformed it into graphs, before you can use it effectively.
- In this section, I’ll talk about this process because my goal is to walk you through the task of converting raw data, table data, into graph data. Graph data comes from five steps: 1. Define the problem, 2. Understand the existing data, 3. Create relationships, 4. Visualize the graph, and 5. Check the scenario. Let’s list how the process goes, based on the premise of applying graphs to FDS.
1) Define the problem
- The end goal of FDS is to detect fraud patterns. But what does it take to detect a pattern? We need to think about answering this fundamental question. Unlike existing methods, the graph method uses correlation to detect media related to fraud patterns that have not yet been discovered, and furthermore, applies machine learning to detect graphs that are related to patterns and graphs that are not related to patterns to quantify the differences between patterns. The goal is to ‘maximize correlation between data’ and ‘quantify patterns’.
2) Understanding existing data
- In a financial system, it is assumed that it is dealt with in two forms: historical data and historical data. Understanding existing data well is a necessary preprocessing task in FDS to utilize graphs. This is the time to sift through the historical data to determine which information will help detect anomalous patterns. Traditional approaches would take all of this data into account, but on the graph side, the key is to take full advantage of the properties of schemas. Flexible modeling requires that we actively adopt a graph perspective to understand our data, which can help us solve problems differently than traditional perspectives.
3) Graph design & modeling
- In the problem definition phase, we set the end goal of detecting media that are “involved” in a fraud pattern. To achieve this goal, we create relationships between data based on our understanding of the existing data. Once the relationships are established, we need to think about what characteristics we want to see reflected in the nodes and relationships in the listed graph. At this point, you might consider properties that are characteristics of a node to be other nodes and represent them as new nodes, or vice versa.
- If you choose Graph Modeling 2, you can see patterns of users based on the cards connected to the store, and if you choose Graph Modeling 3, you can see patterns of users based on the transactions connected to the store. This is why it is important to always remember to define the problem and the purpose of the graph before modeling the graph, as the graph will be different depending on what you focus on.
4) Visualize the graph
- Load the graphs created in step 3) into the graph database. Then, query the nodes related to ‘fraud’ with Cypher, a database query. Visualize the results in terms of normal/abnormal transaction patterns and connections between transactions.
5) Scenario check (communication for problem solving)
- Until now we discuss how we build the graph technically. on other hand, This is the process of discussing the progress up to this point with each team member. Since graph modeling is a highly subjective field, it can be said that this is a task to reflect objectivity. From the initial problem definition to discussing why the graph was designed in this way, the prototyping, expected results, and business impact are coordinated through this process.