Home/Blog/Generative Ai/Vector search on knowledge graph in Neo4j
Home/Blog/Generative Ai/Vector search on knowledge graph in Neo4j

Vector search on knowledge graph in Neo4j

13 min read
Feb 20, 2025
content
What is vector search?
Implementation: Vector search in Neo4j
Step 1: Loading the knowledge graph
Step 2: Generating graph embeddings
Node2Vec for structure and relationships
BERT for semantic depth
Concatenating embeddings for a comprehensive representation
Step 3: Storing embeddings in Neo4j
Cypher query for storing embeddings
Step 4: Creating a vector index
Step 5: Performing similarity searches
Generating query embedding
Performing the similarity search
Conclusion and next steps

Knowledge graphs (KGs) provide a powerful way to represent data by connecting entities (nodes) through relationships (edges). They help us uncover insights by revealing how various pieces of information interrelate—insights that are often hidden in unstructured text. But what happens when we need to find a specific entity in this intricate web of connections?

Traditional search methods rely on exact keyword matching: you enter a node’s name or label, and the system returns only results that match exactly. While this approach works for simple queries, it struggles with more complex scenarios. For example, searching for “William Shakespeare” in a knowledge graph where the node is labeled “Shakespeare” would fail unless the name matches precisely—even though both refer to the same person. Clearly, we need a more intelligent approach.

Knowledge graphs have gained attention in recent years for their role in enhancing the capabilities of large language models (LLMs). When paired with LLMs, knowledge graphs act as a structured context provider for answering questions. However, efficiently retrieving relevant nodes or entities from a knowledge graph to feed into an LLM often poses a challenge, especially in large and complex graphs.

This is where vector search comes in.

Vector search enables us to find similar entities by representing them as embeddings—numerical vectors that capture the semantic meaning of an entity based on its attributes and relationships. By comparing the embeddings of different entities, we can identify those that are most similar in meaning, even when their exact wording or structure differs.

Embeddings are generated using embedding models, which are specialized machine learning models designed to transform entities, texts, or other forms of data into vector representations.

Here’s how vector search works when used with LLMs for context retrieval:

Vector search on knowledge graph
Vector search on knowledge graph
  1. A user inputs a query into a Q/A system that integrates a LLM for response generation and a knowledge graph for context retrieval.

  2. The query is transformed into a numerical vector (embedding) using an embedding model.

  3. This query vector is matched against a precomputed vector index derived from the knowledge graph.

  4. The most similar vectors (and their corresponding entities) are retrieved, and the LLM converts the results into a natural language response.

Now, how do we implement vector search on knowledge graphs? Enter Neo4j, a leading graph database management system (GDBMS) designed for storing and querying data as a network of nodes and relationships.

In this blog, we’ll assume you already have a knowledge graph stored in Neo4j. We’ll guide you through the steps to integrate vector search, including:

  1. Loading a knowledge graph from Neo4j.

  2. Generating embeddings for nodes in the graph.

  3. Storing the embeddings as properties in Neo4j.

  4. Creating a vector index on these embeddings using Cypher, Neo4j’s query language.

  5. Performing similarity searches with vector embeddings.

By the end of this post, you’ll be ready to transform your knowledge graph search experience with the power of Neo4j and vector search.

Implementation: Vector search in Neo4j#

Before we dive into the implementation, ensure you have the following installed:

pip install neo4j pandas numpy networkx==2.5 node2vec scikit-learn

Additionally, ensure you have access to a Neo4j instance, that is, a running Neo4j database where your knowledge graph is built and stored.

We have the following knowledge graph built and stored in our Neo4j instance. It's a simple knowledge graph with 15 entities and 11 relationships.

My simple knowledge graph stored in Neo4j
My simple knowledge graph stored in Neo4j

Step 1: Loading the knowledge graph#

The first step involves connecting to your Neo4j database and loading the knowledge graph data into a NetworkX graph structure. NetworkX is a Python library used for the creation, manipulation, and analysis of graphs. It helps us represent the knowledge graph as a collection of nodes (entities) and edges (relationships), allowing us to iterate over the graph and generate node and relationship embeddings.

To do this, first of all, we need to import the GraphDatabase from Neo4j Python driver. The Neo4j Python driver is a package that allows Python applications to interact with a Neo4j database, providing an interface to execute queries, manage transactions, and retrieve data from the Neo4j instance.

from neo4j import GraphDatabase

To connect to your Neo4j instance, you need to have the following credentials.

NEO4J_URI = ""
NEO4J_USER_NAME = ""
NEO4J_PASSWORD = ""
Required Neo4j credentials

Using these credentials, we establish a connection to a Neo4j database and open a session to execute queries like fetching nodes and relationships from the knowledge graph.

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER_NAME, NEO4J_PASSWORD))
with driver.session() as session:
Opening a Neo4j session to interact with the data in Neo4j instance

Inside the session, we execute a Cypher query to retrieve all nodes. In our knowledge graph, we have all nodes labeled Entity. Each node has id and name properties, as shown in the following illustration.

Node properties in my knowledge graph in Neo4j
Node properties in my knowledge graph in Neo4j

With the following Cypher query, we fetch all nodes with label Entity. We return id and name of each node.

nodes_query = """
MATCH (n:Entity)
RETURN id(n) AS id, n.name AS name
"""
Cypher query to fetch all nodes labeled "Entity"

To execute the Cypher query through code, we need to run it in the session with session.run().

nodes = session.run(nodes_query)

Similar to fetching nodes, we will now fetch relationship tuples (n, r, m) where n and m are two nodes and r represents the relationship between node n and node m. Each relationship is labeled RELATION. Each relationship has id and type properties.

Relationship properties in the knowledge graph
Relationship properties in the knowledge graph

With the following Cypher query, we fetch all relationships with label RELATION. We return the relationship as tuple: source node, target node, and the relationship between them.

relationships_query = """
MATCH (n:Entity)-[r:RELATION]->(m:Entity)
RETURN n.name AS source, m.name AS target, r.type AS relationship_type
"""
Cypher query to fetch relationships of type "RELATION"

We run the Cypher query to fetch relationships.

relationships = session.run(relationships_query)

Here's the complete code for loading the knowledge graph from Neo4j.

def load_graph():
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
with driver.session() as session:
# Fetch all nodes with label 'Entity' and their properties
nodes_query = """
MATCH (n:Entity)
RETURN id(n) AS id, n.name AS name
"""
nodes = session.run(nodes_query)
nodes_df = pd.DataFrame([record.data() for record in nodes])
print(nodes_df)
# Fetch all relationships of type 'RELATION'
relationships_query = """
MATCH (n:Entity)-[r:RELATION]->(m:Entity)
RETURN n.name AS source, r.type AS relationship_type, m.name AS target
"""
relationships = session.run(relationships_query)
relationships_df = pd.DataFrame([record.data() for record in relationships])
print(relationships_df)
driver.close()
return nodes_df, relationships_df

Step 2: Generating graph embeddings#

With the knowledge graph loaded, the next step is to generate node embeddings that capture the semantic, structural and relational nuances of the graph.

Node2Vec for structure and relationships#

Node2Vec is a widely-used algorithm for generating embeddings that capture structural and relational nuances of the graph. It leverages random walks to explore the graph's structure, capturing homophily and structural equivalence.

  • Homophily: It is a principle that suggests that similar nodes tend to connect to each other. In a social network, for example, friends often share similar interests or backgrounds. In a knowledge graph, this means that nodes with similar characteristics (e.g., books with the same genre) are likely to have direct connections. Node2Vec captures these connections so that the embeddings of connected nodes are similar, reflecting their shared characteristics

  • Structural equivalence: Structural equivalence describes nodes that play similar roles in the graph, even if they aren't directly connected. For instance, in a corporate network, two employees from different departments may not work together directly but may both report to managers, reflecting a similar "role" within the hierarchy. Node2Vec captures this by considering nodes with similar structural patterns, allowing embeddings to reflect role similarity rather than direct similarity.

To prepare graph input for the Node2Vec algorithm, let's first create the graph structure using the nodes and relationship tuples we retrieved from Neo4j using the NetworkX library.

import networkx as nx
def create_graph(nodes_df, relationships_df):
G = nx.MultiDiGraph()
for _, row in nodes_df.iterrows():
G.add_node(str(row['id']), name=row['name'])
for _, row in relationships_df.iterrows():
if row['source'] in nodes_df['id'].values and row['destination'] in nodes_df['id'].values:
G.add_edge(row['source'], row['destination'], relationship=row['relationship'])
return G
Creating NetworkX graph

After creating the graph G, now we can apply Node2Vec algorithm to generate embeddings for each node in the graph.

from node2vec import Node2Vec
def generate_node2vec_embeddings(G):
node2vec = Node2Vec(G, dimensions=64, walk_length=10, num_walks=100, workers=4)
node2vec_model = node2vec.fit()
node2vec_embeddings = {node: node2vec_model.wv[node] for node in G.nodes()}
return node2vec_embeddings
  • Line 3: We configure the Node2Vec model with the following parameters:

    • G: The input graph.

    • dimensions (default=128): It specifies the size of each embedding vector. Higher dimensions provide more information in the embeddings but require more computational power.

    • walk_length (default=80): It controls the number of steps in each random walk performed by Node2Vec. A longer walk length can capture more distant relationships in the graph but can also increase computational costs.

    • num_walks (default=10): It defines the number of random walks to start from each node. More walks help capture a broader variety of relationships, enhancing the quality of the embeddings.

    • workers (default=4): It is the number of CPU cores to use during training. Increasing the number of workers speeds up the computation by parallelizing tasks.

  • Line 4: We train our configured node2vec model on the graph G, learning embeddings that capture the graph's structure.

  • Line 5: For each node in the graph, we extract its corresponding embedding vector and store it in a dictionary for easy access.

BERT for semantic depth#

While Node2Vec embeddings effectively capture structural and relational nuances within the graph, they do not account for the semantic depth in node labels or descriptions. This limitation can lead to ambiguities in scenarios where node semantics play a crucial role, such as resolving the polysemous meanings of words. For example, consider the polysemous word "bank." Using Node2Vec alone, the embeddings might capture that "bank" is connected to nodes like "loans" or "interest rates" but fail to differentiate between "bank" as a financial institution and "bank" as a riverbank.

To address this, we can use pre-trained transformer-based models, such as BERT, which are highly effective at capturing the nuanced meanings of words and phrases. Using transformer embeddings alongside Node2Vec enables us to embed nodes with vectors that reflect not only their positions and connections within the graph but also their inherent meanings, creating a dual perspective of structure and semantics. This approach results in a more comprehensive embedding space, where nodes are positioned based on both their contextual relevance and their roles within the graph, enhancing applications like search and recommendation with greater precision and interpretability. So let's see how to create node embeddings using BERT.

In our simple knowledge graph, we only have node ID and name. But, in the code below, we assume we also have a node description. We know a node can have multiple properties. So, whatever properties we want to include in generating embeddings for the node, we can use those properties as input text to the BERT or any other transformer model.

from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
def get_transformer_embeddings(nodes_df):
embeddings = {}
for _, row in nodes_df.iterrows():
inputs = tokenizer(row['name'] + row['description'], return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
# Use the mean of the last hidden states as the node embedding
embeddings[str(row['id'])] = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
return embeddings
Generating embedding using BERT transformer
  • Line 1: We import the BERT tokenizer and model from the Hugging Face Transformers library, which are used for generating embeddings from text.

  • Line 2: We import the PyTorch library for handling tensor operations, crucial for working with deep learning models.

  • Line 3: We load the pre-trained BERT tokenizer that converts input text into tokens suitable for the BERT model.

  • Line 4: We load the pre-trained BERT model, which generates embeddings based on the input tokens.

  • Line 6–14: We define a function to generate embeddings for each node’s textual data stored in the DataFrame nodes_df.

    • Line 7: We initialize an empty dictionary to store the embeddings for each node, indexed by node ID.

    • Line 8: We iterate over each row in the DataFrame, where row contains the data for each node.

    • Line 9: We tokenize the concatenated name and description for each node, preparing it for the BERT model.

    • Line 10: We disable gradient computation to speed up the process during inference.

    • Line 11: We pass the tokenized inputs through the BERT model to obtain the output embeddings.

    • Line 13: We calculate the mean of the last hidden states to create a single embedding for the node, then store it in the dictionary.

    • Line 14: We returns the dictionary containing the generated embeddings for all nodes.

Now that we are done generating Node2Vec and BERT embeddings for all our nodes in the knowledge graph, we will concatenate these embeddings for each node.

Concatenating embeddings for a comprehensive representation#

It's a simple concatenation of Node2Vec and BERT embeddings for each node.

import numpy as np
nodes_df, relationships_df = load_graph()
G = create_graph(nodes_df, relationships_df)
node2vec_embeddings = generate_node2vec_embeddings(G)
transformer_embeddings = get_transformer_embeddings(nodes_df)
combined_embeddings = {}
for node in G.nodes():
combined_embeddings[node] = np.concatenate((node2vec_embeddings[node], transformer_embeddings[node]))
# Display the combined embeddings
for node, embedding in combined_embeddings.items():
print(f"Node ID: {node}, Combined Embedding Shape: {embedding.shape}")
Aggregating Node2Vec and BERT embeddings
  • Lines 2–5: We call the all the functions we defined above one by one.

  • Line 9: We concatenate the embeddings for each node using NumPy concatenation function.

  • Line 12–13: We display the embeddings for verification.

Step 3: Storing embeddings in Neo4j#

Now that we have the embeddings, it’s time to store them in Neo4j as properties of the nodes. This will allow us to later create a vector index and perform searches.

Cypher query for storing embeddings#

We execute a Cypher query to update each node in the Neo4j knowledge graph to store node embedding as a node property.

def store_embeddings_to_Neo4j(uri, user, password, combined_embeddings):
driver = GraphDatabase.driver(uri, auth=(user, password))
with driver.session() as session:
for node_id, embedding in embeddings.items():
update_query = """
MATCH (n:Entity {name: $node_id})
SET n.embedding = $embedding
"""
session.run(update_query, node_id=node_id, embedding=embedding)
print(f"Updated embedding for node '{node_id}'")
driver.close()
print(f"Stored embeddings for {len(embeddings)} nodes in Neo4j.")
store_embeddings_to_Neo4j(NEO4J_URI, NEO4J_USER_NAME, NEO4J_PASSWORD, combined_embeddings)
  • Lines 4–9: For each node and its corresponding embedding, we execute a Cypher MATCH query to locate the node by its name and then SET the embedding property with the generated vector.

Now each node in our knowledge graph in Neo4j contains an embedding property, as shown below, based on which we will create a vector index.

Node embedding stored as node property in Neo4j
Node embedding stored as node property in Neo4j

Step 4: Creating a vector index#

To facilitate efficient similarity searches, we need to create a vector index on the embedding property of the nodes. This index allows Neo4j to perform rapid vector-based queries.

CREATE VECTOR INDEX kgvectorindex IF NOT EXISTS
FOR (m:Entity)
ON m.embedding
OPTIONS {
indexConfig: {
`vector.dimensions`: 832,
`vector.similarity_function`: 'cosine'
}
}
  • Lines 1–3: We create a vector index named kgvectorindex for nodes labeled Entity, specifically targeting the embedding property.

  • Lines 5–8: Here we configure the following parameters:

    • Vector dimensions: It is set to 832, matching the dimensionality of our combined embeddings.

    • Similarity function: It is configured to cosine similarity, a common metric for measuring the similarity between two vectors based on their orientation.

Following is the complete function of executing the Cypher query through code.

def create_vector_index(uri, user, password):
create_index_query = """
CREATE VECTOR INDEX kgvectorindex IF NOT EXISTS
FOR (n:Entity)
ON n.embedding
OPTIONS {
indexConfig: {
`vector.dimensions`: 128,
`vector.similarity_function`: 'cosine'
}
}
"""
driver = GraphDatabase.driver(uri, auth=(user, password))
with driver.session() as session:
try:
session.run(create_index_query)
print("Vector index 'kgvectorindex' has been created or already exists.")
except Exception as e:
print(f"An error occurred while creating the vector index: {e}")
driver.close()
create_vector_index(NEO4J_URI, NEO4J_USER_NAME, NEO4J_PASSWORD)

This index is crucial for enabling fast and accurate similarity searches within the knowledge graph.

Step 5: Performing similarity searches#

With the vector index in place, we can now perform similarity searches. We need to ensure that the query embeddings are generated using the same Node2Vec and BERT model and concatenated the same way we concatenated node embeddings to maintain consistency in the vector space.

Generating query embedding#

Given a query, we can extract entities from it and generate embeddings for these entities using a Node2Vec model trained on our knowledge graph. In the code below, we assumed, there is a single entity in the query we get the embedding for using our Node2Vec model.

def get_query_entity_embedding(model, query_entity):
if query_entity in model.wv:
return model.wv.get_vector(query_entity).tolist()
else:
raise ValueError(f"Query entity '{query_entity}' not found in the Node2Vec model.")
# Example query
query_entity = "Abraham Lincoln"
query_embedding = generate_query_embedding(model, query_entity)
  • Lines 2–5: We retrieve the embedding vector for a specified query_entity from the Node2Vec model. If the entity exists in the model's vocabulary, it returns the vector as a list; otherwise, it raises a ValueError indicating that the entity is not found.

We can generate embedding for the complete query using the same pre-trained BERT model, just like we got node embeddings. We can then combine Node2Vec embedding for each entity in the query with the BERT embedding and then apply vector search on each of these combined embeddings to retrieve relevant context for the query.

Suppose the query entity is not found in the Node2Vec model. In that case, we can use a placeholder zero vector of the same dimension while concatenating it with the BERT embedding so that the combined embedding has the same dimension as node embeddings in the knowledge graph. This is necessary to find the similarity between embeddings.

Based on the query embedding, we perform a vector search on the Neo4j database containing the knowledge graph and its vector index to find the top 5 similar nodes. The number of nodes to retrieve is configurable.

def find_similar_nodes(uri, user, password, query_embedding, top_k=5):
driver = GraphDatabase.driver(uri, auth=(user, password))
with driver.session() as session:
search_query = """
CALL db.index.vector.queryNodes('kgvectorindex', $top_k, $query_embedding)
YIELD node, score
RETURN node.name AS name, score
ORDER BY score DESC
"""
results = session.run(search_query, top_k=top_k, query_embedding=query_embedding)
similar_nodes = [(record['name'], record['score']) for record in results]
driver.close()
return similar_nodes
# Find top 5 similar nodes
similar_nodes = find_similar_nodes(NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD, query_embedding)
print("Top 5 similar nodes:")
for name, score in similar_nodes:
print(f"{name} (Score: {score})")
  • Lines 4–9: We utilize the db.index.vector.queryNodes procedure to search for the top k similar nodes based on the query_embedding, using the kgvectorindex index.

Conclusion and next steps#

The integration of vector search with knowledge graphs in Neo4j presents a powerful approach to enhancing data retrieval and semantic understanding. By utilizing embeddings to capture the nuanced meanings of entities and their interconnections, we can overcome the limitations of traditional keyword searches that often miss relevant results due to exact matching constraints.

This combination not only improves the precision of searches but also opens doors to various real-world applications. For example:

  • In recommendation systems, businesses can leverage vector search to provide personalized content suggestions by analyzing user preferences in the context of a rich knowledge graph.

  • In customer support, the technology can enable more intuitive query handling, allowing support agents to access relevant knowledge quickly, regardless of the specific terms used by the customer.

  • In content retrieval, organizations can enhance their search capabilities, enabling users to find pertinent information in large datasets efficiently.

Now that you know how to perform vector search on a knowledge graph, why not experiment with different embedding models based on your application's specific needs, such as choosing lightweight models for scalability or leveraging deeper models for more nuanced semantic understanding, creating embeddings for different data types, and explore how Neo4j can help you implement these solutions in real-world applications like personalized recommendations.

If you're eager to dive deeper, here are some resources to help you take the next steps:

If you want to learn how to construct knowledge graphs from unstructured raw text, effectively store and query them using Neo4j, integrate knowledge graphs with LLMs for context retrieval, and generate context-aware responses with LLMs, consider taking the following course.

If you want to learn how to construct knowledge graphs from unstructured raw text, effectively store and query them using Neo4j, integrate knowledge graphs with LLMs for context retrieval, and generate context-aware responses with LLMs, consider taking the following course.

To learn how to generate embedding for different types of data, you may consider taking the following course.

To learn how to generate embedding for different types of data, you may consider taking the following course.


Frequently Asked Questions

What is vector search in knowledge graphs?

Vector search in knowledge graphs refers to finding similar items or entities using their numerical embeddings within the graph. Each entity or node is represented as a vector in a high-dimensional space, capturing its semantic meaning. By leveraging similarity measures like cosine similarity or Euclidean distance, vector search enables the retrieval of entities that are most semantically related to a query. This approach combines the structural relationships in knowledge graphs with the power of embeddings for enhanced search and recommendation.

How does Neo4j implement vector search?

What are the benefits of combining vector search with knowledge graphs?

How does vector search enhance AI applications?


Written By:
Asmat Batool
Join 2.5 million developers at
Explore the catalog

Free Resources