GraphFDS — 02
💡 tl;dr
- We will learn about NoSQL and talk about Neo4j, a database that specializes in graphs.
- We will briefly introduce the dataset we will be utilizing, the credit card dataset.
- To analyze the data by loading it into Neo4j, we will build the environment from docker files to py2neo library.
- We will visualize and explore the loaded data through simple queries.
1.NoSQL
In the previous series, we talked about why ‘relationships’ are important in FDS, and if they are important, what can be derived from them. The advantage is that you can detect anomalous transactions by utilizing not only the perspective of ‘me’ but also the perspective of ‘people related to me’.
- When you look up and analyze the data of people who are related to you, of course, you can analyze it with SQL, which is widely used in the market, but there are various limitations, which is why we are looking for NoSQL. Some of the limitations include
- Data schema normalization — Because of the overly normalized data schema, it’s quite tricky to determine what data is where. You have to figure out the data statement, identify primary keys , foreign keys, and create relationships to track down specific data. Of course, this is possible with a well-articulated data specification, but if it’s not, you’ll have a hard time tracking down the schema due to lack of management.
- Not a relationship-specific data structure — The tool was built with the goal of how to efficiently manage tables, which is different from our goal of analyzing based on “associations”.
- In order to manage and analyze based on ‘connections’ like this, we decided that SQL is limited, so nosql is suitable. Of course, other database types can check something related to ‘connections’ and it is very easy, but if there is a situation to derive insights based on ‘connections’, there will be a situation that requires analysis based on ‘connections’.
- In order to do this, something else is added for the ‘connection’, which leads to rather complicated code, requiring answers beyond the scope of the original database functionality, resulting in slow query execution and significant resource consumption.
- GDB is often mentioned as a suitable tool for managing and analyzing “connections” because it is a database designed for connectivity queries to work well: its data structure allows queries to be based on connections between elements, and it is optimized for understanding the interaction of complex relationships between elements.
- An additional advantage is that graph data is quite flexible in terms of what you can analyze, as you can derive insights into the “connections” and then encode them back into rows in a relational database.
Summary
NOSQL is better suited for analyzing “connected” and “related” data for FDS purposes than SQL. There are limitations such as schema dependency, table normalization, and lack of relationship-specific structure, which NoSQL GDB can compensate for.
2. NoSQL ,GDB and Neo4j
- In the previous section, I briefly mentioned NOSQL and said that a graph database is efficient for analyzing and managing data based on connections. Just as there are many types of RDBs, there are also many types of GDBs, of which I chose Neo4j.
<There are many Graph Database, but why we select the GDB neo4j? The reason is …>
1.Efficient Storage for Graph data
- When you query based on relationships, you may end up searching in reverse. For example, a query like “who is related to me and who is not related to me” will result in a reversal of the direction of the query as you look up related people and then look up unrelated people.
- In this case, you end up having to consider the direction of the search further, resulting in a lookup for the sake of a lookup. In order to search without problems even in this situation, neo4j has created a data storage structure that considers the direction of the search, and indexing can be added to it.
- Since the engine is configured in JAVA, it is not as efficient in terms of speed as other graph databases that are configured in C++. Nevertheless, I chose NEO4J because the advantages of native graph storage seemed to outweigh the speed aspect.
- PROPERTY GRAPH MANAGEMENT As a product that considers the convenience of property graph management, it is constantly updating property types and is being developed to work efficiently even with large amounts of data. As neo4j supports property graphs at its core, how to efficiently store properties must be a top priority, and it’s good to see that this is one of the steps in the product development roadmap.
2. Product completeness
- It may vary depending on the product version release criteria, but since it is the current standard community and enterprise 5.x version, I thought it was stable. Although each person thinks differently about the software life cycle, I judged that version 4 or higher is somewhat stable, so I judged it to be highly complete and selected it.
3.Product Scability
- It is rare for data to be stored as a graph in the first place, so we had two questions: are there many tools that are convenient for tabular -> graph conversion in terms of migration, and are there tools that can be expanded and utilized? In other words, we considered whether it was convenient to migrate and whether we had the technology to manage large amounts of data by distributing it to different databases when loading it, and we decided that it had both of the first two factors.
- It supports parquet data data type, which is a column-based data storage type often used in hadoop, and supports arrow data type, which is an in-memory data type, and supports spark connector, which enables distributed processing analysis and storage management using pyspark.
- In addition, there is a composite that can be scaled out to efficiently manage storage and retrieval between databases. When different databases have similar characteristics but different values to manage, it is easy to discover hidden patterns between similar data by virtually connecting them if the retrieval format is the same.
- The above three are my subjective criteria, and below I refer to the considerations mentioned in the book [Fundamentals of Data Engineering] to determine if the NEO4J product meets the main considerations.
First Consideration
- How does a database discover data and perform searches? Indexes help speed up lookups, but not all databases have them. Let’s start by determining if your database uses indexes. If it does, what are the best patterns for designing and maintaining an index? It’s also helpful to have a basic understanding of the main types of indexes, including B-trees and log-structured merge-trees (LSMs).
- => There are indexes and constraints for lookups. There are more than 10 types of indexes and constraints, and we have developed only various and necessary elements considering the characteristics of each data. In addition, we have integrated Apache Lucene for full-text indexing to improve the performance of full-text indexing, which is an insufficient part.
Second Consideration
- Does the database use a query optimizer and what are its characteristics?
- => The database uses a query optimizer. Cypher queries begin with a string containing information to match a specific graph pattern. This string is processed by the query parser and transformed into an internal representation, an abstract syntax tree (AST), which is checked and cleaned. This AST is received by a query optimizer, also known as a planner, which builds a plan for how the query should be executed.
Third Consideration
- Can the database scale with demand? What scaling strategy do you use? Horizontal scaling (increasing database nodes)? Or vertical scaling (increasing resources on a single machine)?
- => Both horizontal scaling (scale-out) and vertical scaling (scale-up) are possible. For horizontal scaling, there is a module called composite. For vertical scaling, you can modify conf, and there are commands that recommend it, so you can efficiently utilize OS performance.
Fourth Consideration
- How do I query, create, update, and delete dates from a database? Different types of databases handle CRUD operations differently.
Create :
- Node Creation:
CREATE (:Label {property_name: property_value})
- Relationship) Creation:
CREATE (a)-[:RELATIONSHIP_TYPE]->(b)
Read:
- Node Search:
MATCH (n:Label) RETURN n
- Relationship Search:
MATCH (a)-[:RELATIONSHIP_TYPE]->(b) RETURN a, b
Update:
- Node property update:
MATCH (n:Label {property_name: old_value}) SET n.property_name = new_value
- Rlationship property update:
MATCH (a)-[r:RELATIONSHIP_TYPE]->(b) SET r.property_name = new_value
Delete:
- Node Delete:
MATCH (n:Label {property_name: property_value}) DELETE n
- Relationship Delete:
MATCH (a)-[r:RELATIONSHIP_TYPE]->(b) DELETE r
Summary
- Among the many GDBs in the market, neo4j is the best in terms of efficient storage structure, product maturity, and product scalability.
- To determine whether neo4j is suitable in terms of data engineering, we considered four aspects of CRUD, scaling and distributing, query optimizer, and lookup, and found that they were all reflected.- There were additional consistency and data modeling patterns in the book, but I will omit the data modeling patterns because they were mentioned in Series 1. For consistency, I need to do additional data research, so I will supplement it after data research.
- There are many other things to explain about neo4j, but since the purpose of this post is FDS and Graph, I will reduce it here and create another section.
3. Data
- To analyze the data, we looked at a lot of different data. From the huge list of data, we decided to use [kaggle- Credit Card Transactions](https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions) to fulfill our two main goals: graph and FDS.
The criteria are as follows
- 1.A distributional approach to approximate real-world data.
=>There is a published paper on how the data was generated in [paper-Synthesizing Credit Card Transactions (https://arxiv.org/pdf/1910.03033.pdf), so we took a look at it and thought it was similar to the distribution. - 2.Is it a large amount of data.
=>Although it is a small amount of data compared to the real world, it is about 30 million transactions, so we thought it was the best among the publicly available data and selected it. - 3.Relevance to the financial industry.
=>We selected this data because it can combine multiple data such as ‘card,’ ‘channel,’ and ‘transaction history,’ and it also contains information on ‘user’ and ‘FICO’ creditworthiness, so its characteristics are very relevant to finance.
4. Envrionment setup and analysis
- To start analyzing data in earnest, we need to build an environment. If your goal is to simply use a graph database, you can install and utilize it locally, but since the goal of this series is to build a pipeline from GNN to MLops, we’ve set up a docker file.
Dockerfile script
# load the foundation image
FROM neo4j:latest
# Set environment variables
ENV NEO4J_AUTH=none \
NEO4J_PLUGINS='["graph-data-science", "apoc", "apoc-extended"]' \
NEO4J_dbms_connector_bolt_listen__address=0.0.0.0 \
NEO4J_dbms_security_procedures_allowlist='gds.*, apoc.*, apoc-extended.*' \
NEO4J_dbms_security_procedures_unrestricted='gds.*, apoc.*, apoc-extended.*'
# Expose necessary ports
EXPOSE 7474 7687 8491 8888
# Install Python and other dependencies
RUN apt-get update && \
apt-get install -y python3 python3-pip vim git && \
apt-get clean && \
pip3 install -q graphdatascience jupyter neomodel py2neo pyarrow pandas matplotlib ray[default] https://github.com/neo4j-field/checker/releases/download/0.4.0/checker-0.4.0.tar.gz neo4j_arrow@https://github.com/neo4j-field/neo4j_arrow/archive/refs/tags/0.6.1.tar.gz
# Make the workdir of analysis
RUN mkdir /var/lib/neo4j/graphwoody
# Mount volumes
VOLUME /var/lib/neo4j/graphwoody /data /logs /plugins /var/lib/neo4j/import /conf
# Set the working directory
WORKDIR /var/lib/neo4j
# Set up entrypoint
CMD ["neo4j"]
- After creating the docker file as above, enter `docker build -t graphwoody:0.02 .`to create the following docker image.
- The rules for using docker files can be found in the following link, so if you need to make any changes, please check the following link. dockerfile_argument
- Once the docker image is created, we need to run docker, which requires a few more commands to share local space with docker.
docker run -itd \
-v /Users/jeong-itae/neo4j/data:/data \
-v /Users/jeong-itae/neo4j/logs:/logs \
-v /Users/jeong-itae/neo4j/plugins:/plugins \
-v /Users/jeong-itae/neo4j/import:/var/lib/neo4j/import \
-v /Users/jeong-itae/neo4j/conf:/conf \
-p 7474:7474 \
-p 7687:7687 \
-p 8491:8491 \
-p 8888:8888 \
--ipc=host \
--name graphwoody \
--restart unless-stopped \
graphwoody:0.02bas
Docker command description
- <data> is a space that contains how and how many index constraints and node edges are loaded to store neo4j’s data.
- <logs> is a space where logs generated by the neo4j database are recorded. It is very important because it contains information such as how transactions are executed when executing queries, how the internal engine is working, and what causes errors.
- <plugins> is a space for modules to be utilized in neo4j. There are many modules, but we have written gds and apoc, which are modules for analytics, and put them in the dockerfile, so they can be utilized. You can think of it as a space similar to lib in python anaconda.
- <import> is a space for importing and exporting data from the neo4j database. This space recognizes files and loads them into the database, or outputs the results of the analysis.
- <conf> is a space that contains files to configure neo4j database elements.
Forwarding ports 7474 , 7687 is declared to communicate locally with neo4j. Other ports are not utilized now, but will be explained in a later post. - <IPC> is an option for sharing resources with the OS.
Neo4j Engine Conf
- If you run the docker file and connect to the container that started running, you will be directed to /var/lib/neo4j, which is the workdir we created in the docker file.
- Since we’re sharing the resources of host via the ipc option, to make the most of it, we’ll enter the neo4j memory recommendation command we mentioned a moment ago.
- If you type “neo4j-adim server memory-recommendation” in your cypher shell or docker envrionment, you will see the output shown below, where the red colored parts are the conf settings recommended by neo4j.
- To briefly explain what the configuration elements do, heap is an element related to the garbage collector that determines which elements are utilized by the database and which are not. pagecache is an element that checks the query load when executing a query and sets the allowance.
- For the sake of simplicity, there may be a leap to the elements, so I recommend referring to memory-configuration-neo4 for more details.
- We’re going to use the red part of this to reset our conf file.
If we go into “/var/lib/neo4j/conf”, we’ll see a file called neo4j.conf. If we go into vim and look at it, we see the following elements.
The yellow box elements are the ones we declared in the dockerfile.
- They are reflected in the conf, so we can see that they are declared correctly. And the red box elements are the result of entering the recommended settings that we checked earlier.
- If you want to see more options for the neo4j conf, please check out neo4j_conf
constraint
- Before loading data, we need to set rules or conditions to maintain the integrity of the node edge and ensure quality. For this purpose, neo4j has constraints, which come in different flavors and have different purposes and effects. Among them, we utilize the unique constraint.
- This is because we looked at the ‘card’, ‘user’, and ‘store’ data and found a key that reflects the uniqueness of each data. You can find more details in neo4j-constraint, so we recommend selecting it according to the purpose, situation, and data type.
py2neo
- It is a module that can communicate with neo4j using python. Although there is a trend to recommend neomodel due to the recent story that neo4j no longer maintains py2neo, we will use py2neo because we believe that the difference in functionality and performance between the two modules is not that big.
from py2neo import Graph, Node, Relationship
HOST = "bolt://localhost:7687"
graph = Graph(HOST, auth=None)
- Load the Graph, Node, and Relationship functions that will be utilized by the py2neo module.
- Declare HOST as a variable to connect to python via bolt 7687, and assign it to the Graph function to declare it as graph.Python load data to Neo4j
There are two different loading methods: node and edge. Let’s start with the node loading method, which looks like the following code.
Node load code
#1
def create_Cardnode(row):
node = Node("Card", \
cardId = row['combinedC'], \
cardBrand = row['Card Brand'], \
cardType = row['Card Type'], \
)
graph.create(node)
#2
graph.run("CREATE CONSTRAINT Card_cardId_unique IF NOT EXISTS FOR (x:Card) REQUIRE x.cardId IS UNIQUE")
#3
tmpcard[['combinedC','Card Brand','Card Type']].apply(create_Cardnode, axis=1)
- In #1, we set the label and properties of the cardnode.
- In #2, we declare a constarint before loading. It is written by specifying a label and a unique property.
- In #3, we apply the pandas.dataframe containing the data to be loaded to the node load function using the apply function.
Edge load code
#1
def create_cmedge(row):
src = graph.nodes.match('Card', cardId=row['combinedC']).first()
dst = graph.nodes.match('Merchant', merchantId=row['Pmer']).first()
if src and dst:
relationship = Relationship(src,\
row['relname'],\
dst,
fraud=row['Is Fraud?'],\
amount=row['Amount'],\
date=row['Date'],\
time=row['Time'])
graph.create(relationship)
#2
tmptx.apply(create_cmedge,axis=1)
- In #1, we specify the start and end nodes using unique labels between the nodes, and if the labels match, we create a relationship and input the properties of the relationship.
- In #2, we apply the function through apply to the data frame containing the edge relationship.
- Connect to “localhost:7474/browser”to check the loaded result.
To check the graph schema, enter call db.schema.visualization and you can see the loaded data.
Summary
- We learned how to describe docker files, build containers, load them, and visualize them for analysis using neo4j images.
- The neo4j command recommended tuning the conf file to efficiently utilize the internal data engine, and I applied it.
- For the convenience of loading data, we used py2neo, a module that can communicate with neo4j through python, to load data.