Vector search on knowledge graph in Neo4j

Knowledge graphs (KGs) provide a powerful way to represent data by connecting entities (nodes) through relationships (edges). They help us uncover insights by revealing how various pieces of information interrelate—insights that are often hidden in unstructured text. But what happens when we need to find a specific entity in this intricate web of connections?

Traditional search methods rely on exact keyword matching: you enter a node’s name or label, and the system returns only results that match exactly. While this approach works for simple queries, it struggles with more complex scenarios. For example, searching for “William Shakespeare” in a knowledge graph where the node is labeled “Shakespeare” would fail unless the name matches precisely—even though both refer to the same person. Clearly, we need a more intelligent approach.

Knowledge graphs have gained attention in recent years for their role in enhancing the capabilities of large language models (LLMs). When paired with LLMs, knowledge graphs act as a structured context provider for answering questions. However, efficiently retrieving relevant nodes or entities from a knowledge graph to feed into an LLM often poses a challenge, especially in large and complex graphs.

This is where vector search comes in.

What is vector search?#

Vector search enables us to find similar entities by representing them as embeddings—numerical vectors that capture the semantic meaning of an entity based on its attributes and relationships. By comparing the embeddings of different entities, we can identify those that are most similar in meaning, even when their exact wording or structure differs.

Embeddings are generated using embedding models, which are specialized machine learning models designed to transform entities, texts, or other forms of data into vector representations.

Here’s how vector search works when used with LLMs for context retrieval:

A user inputs a query into a Q/A system that integrates a LLM for response generation and a knowledge graph for context retrieval.
The query is transformed into a numerical vector (embedding) using an embedding model.
This query vector is matched against a precomputed vector index derived from the knowledge graph.
The most similar vectors (and their corresponding entities) are retrieved, and the LLM converts the results into a natural language response.

Now, how do we implement vector search on knowledge graphs? Enter Neo4j, a leading graph database management system (GDBMS) designed for storing and querying data as a network of nodes and relationships.

In this blog, we’ll assume you already have a knowledge graph stored in Neo4j. We’ll guide you through the steps to integrate vector search, including:

Loading a knowledge graph from Neo4j.
Generating embeddings for nodes in the graph.
Storing the embeddings as properties in Neo4j.
Creating a vector index on these embeddings using Cypher, Neo4j’s query language.
Performing similarity searches with vector embeddings.

By the end of this post, you’ll be ready to transform your knowledge graph search experience with the power of Neo4j and vector search.

Implementation: Vector search in Neo4j#

Before we dive into the implementation, ensure you have the following installed:

Step 1: Loading the knowledge graph#

The first step involves connecting to your Neo4j database and loading the knowledge graph data into a NetworkX graph structure. NetworkX is a Python library used for the creation, manipulation, and analysis of graphs. It helps us represent the knowledge graph as a collection of nodes (entities) and edges (relationships), allowing us to iterate over the graph and generate node and relationship embeddings.

To do this, first of all, we need to import the GraphDatabase from Neo4j Python driver. The Neo4j Python driver is a package that allows Python applications to interact with a Neo4j database, providing an interface to execute queries, manage transactions, and retrieve data from the Neo4j instance.

def load_graph():
  driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
  with driver.session() as session:
    # Fetch all nodes with label 'Entity' and their properties
    nodes_query = """
    MATCH (n:Entity)
    RETURN id(n) AS id, n.name AS name
    """
    nodes = session.run(nodes_query)
    nodes_df = pd.DataFrame([record.data() for record in nodes])
    print(nodes_df)
    # Fetch all relationships of type 'RELATION'
    relationships_query = """
    MATCH (n:Entity)-[r:RELATION]->(m:Entity)
    RETURN n.name AS source, r.type AS relationship_type, m.name AS target
    """
    relationships = session.run(relationships_query)
    relationships_df = pd.DataFrame([record.data() for record in relationships])
    print(relationships_df)
  driver.close()
  return nodes_df, relationships_df

Step 2: Generating graph embeddings#

With the knowledge graph loaded, the next step is to generate node embeddings that capture the semantic, structural and relational nuances of the graph.

Node2Vec for structure and relationships#

Node2Vec is a widely-used algorithm for generating embeddings that capture structural and relational nuances of the graph. It leverages random walks to explore the graph's structure, capturing homophily and structural equivalence.

Homophily: It is a principle that suggests that similar nodes tend to connect to each other. In a social network, for example, friends often share similar interests or backgrounds. In a knowledge graph, this means that nodes with similar characteristics (e.g., books with the same genre) are likely to have direct connections. Node2Vec captures these connections so that the embeddings of connected nodes are similar, reflecting their shared characteristics
Structural equivalence: Structural equivalence describes nodes that play similar roles in the graph, even if they aren't directly connected. For instance, in a corporate network, two employees from different departments may not work together directly but may both report to managers, reflecting a similar "role" within the hierarchy. Node2Vec captures this by considering nodes with similar structural patterns, allowing embeddings to reflect role similarity rather than direct similarity.

To prepare graph input for the Node2Vec algorithm, let's first create the graph structure using the nodes and relationship tuples we retrieved from Neo4j using the NetworkX library.

Line 3: We configure the Node2Vec model with the following parameters:
- G: The input graph.
- dimensions (default=128): It specifies the size of each embedding vector. Higher dimensions provide more information in the embeddings but require more computational power.
- walk_length (default=80): It controls the number of steps in each random walk performed by Node2Vec. A longer walk length can capture more distant relationships in the graph but can also increase computational costs.
- num_walks (default=10): It defines the number of random walks to start from each node. More walks help capture a broader variety of relationships, enhancing the quality of the embeddings.
- workers (default=4): It is the number of CPU cores to use during training. Increasing the number of workers speeds up the computation by parallelizing tasks.
Line 4: We train our configured node2vec model on the graph G, learning embeddings that capture the graph's structure.
Line 5: For each node in the graph, we extract its corresponding embedding vector and store it in a dictionary for easy access.

BERT for semantic depth#

While Node2Vec embeddings effectively capture structural and relational nuances within the graph, they do not account for the semantic depth in node labels or descriptions. This limitation can lead to ambiguities in scenarios where node semantics play a crucial role, such as resolving the polysemous meanings of words. For example, consider the polysemous word "bank." Using Node2Vec alone, the embeddings might capture that "bank" is connected to nodes like "loans" or "interest rates" but fail to differentiate between "bank" as a financial institution and "bank" as a riverbank.

To address this, we can use pre-trained transformer-based models, such as BERT, which are highly effective at capturing the nuanced meanings of words and phrases. Using transformer embeddings alongside Node2Vec enables us to embed nodes with vectors that reflect not only their positions and connections within the graph but also their inherent meanings, creating a dual perspective of structure and semantics. This approach results in a more comprehensive embedding space, where nodes are positioned based on both their contextual relevance and their roles within the graph, enhancing applications like search and recommendation with greater precision and interpretability. So let's see how to create node embeddings using BERT.

Getting Started with Google BERT

Getting Started with Google BERT

This comprehensive course dives into Google’s BERT architecture, exploring its revolutionary role in natural language processing (NLP). Starting with BERT’s architecture and pre-training methods, you’ll uncover the mechanics of transformers, including encoder-decoder components and self-attention mechanisms. Gain hands-on experience fine-tuning BERT for NLP tasks like sentiment analysis, question-answering, and named entity recognition. Discover BERT variants such as ALBERT, RoBERTa, and DistilBERT alongside domain-specific adaptations like ClinicalBERT and BioBERT. Explore applications in text summarization, multilingual tasks, and advanced models like VideoBERT and BART. With practical coding exercises and quizzes, you’ll master embeddings, tokenization, and BERT libraries, equipping you to build cutting-edge NLP solutions. Whether you’re new to Google BERT or enhancing your expertise, this course is your guide to state-of-the-art NLP innovations.

25hrs

Intermediate

26 Playgrounds

9 Quizzes

from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
def get_transformer_embeddings(nodes_df):
  embeddings = {}
  for _, row in nodes_df.iterrows():
    inputs = tokenizer(row['name'] + row['description'], return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad(): 
      outputs = model(**inputs)
      # Use the mean of the last hidden states as the node embedding
      embeddings[str(row['id'])] = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
  return embeddings

Generating embedding using BERT transformer

Line 1: We import the BERT tokenizer and model from the Hugging Face Transformers library, which are used for generating embeddings from text.
Line 2: We import the PyTorch library for handling tensor operations, crucial for working with deep learning models.
Line 3: We load the pre-trained BERT tokenizer that converts input text into tokens suitable for the BERT model.
Line 4: We load the pre-trained BERT model, which generates embeddings based on the input tokens.
Line 6–14: We define a function to generate embeddings for each node’s textual data stored in the DataFrame nodes_df.
- Line 7: We initialize an empty dictionary to store the embeddings for each node, indexed by node ID.
- Line 8: We iterate over each row in the DataFrame, where row contains the data for each node.
- Line 9: We tokenize the concatenated name and description for each node, preparing it for the BERT model.
- Line 10: We disable gradient computation to speed up the process during inference.
- Line 11: We pass the tokenized inputs through the BERT model to obtain the output embeddings.
- Line 13: We calculate the mean of the last hidden states to create a single embedding for the node, then store it in the dictionary.
- Line 14: We returns the dictionary containing the generated embeddings for all nodes.

Now that we are done generating Node2Vec and BERT embeddings for all our nodes in the knowledge graph, we will concatenate these embeddings for each node.

Concatenating embeddings for a comprehensive representation#

It's a simple concatenation of Node2Vec and BERT embeddings for each node.

import numpy as np
nodes_df, relationships_df = load_graph()
G = create_graph(nodes_df, relationships_df)
node2vec_embeddings = generate_node2vec_embeddings(G)
transformer_embeddings = get_transformer_embeddings(nodes_df)
combined_embeddings = {}
for node in G.nodes():
  combined_embeddings[node] = np.concatenate((node2vec_embeddings[node], transformer_embeddings[node]))
# Display the combined embeddings
for node, embedding in combined_embeddings.items():
  print(f"Node ID: {node}, Combined Embedding Shape: {embedding.shape}")

Aggregating Node2Vec and BERT embeddings

Lines 2–5: We call the all the functions we defined above one by one.
Line 9: We concatenate the embeddings for each node using NumPy concatenation function.
Line 12–13: We display the embeddings for verification.

Step 3: Storing embeddings in Neo4j#

Now that we have the embeddings, it’s time to store them in Neo4j as properties of the nodes. This will allow us to later create a vector index and perform searches.

Cypher query for storing embeddings#

We execute a Cypher query to update each node in the Neo4j knowledge graph to store node embedding as a node property.

def store_embeddings_to_Neo4j(uri, user, password, combined_embeddings):
    driver = GraphDatabase.driver(uri, auth=(user, password))
    with driver.session() as session:
        for node_id, embedding in embeddings.items():
            update_query = """
            MATCH (n:Entity {name: $node_id})
            SET n.embedding = $embedding
            """
            session.run(update_query, node_id=node_id, embedding=embedding)
            print(f"Updated embedding for node '{node_id}'")
    driver.close()
    print(f"Stored embeddings for {len(embeddings)} nodes in Neo4j.")
store_embeddings_to_Neo4j(NEO4J_URI, NEO4J_USER_NAME, NEO4J_PASSWORD, combined_embeddings)

Lines 1–3: We create a vector index named kgvectorindex for nodes labeled Entity, specifically targeting the embedding property.
Lines 5–8: Here we configure the following parameters:
- Vector dimensions: It is set to 832, matching the dimensionality of our combined embeddings.
- Similarity function: It is configured to cosine similarity, a common metric for measuring the similarity between two vectors based on their orientation.

Following is the complete function of executing the Cypher query through code.

def create_vector_index(uri, user, password):
    create_index_query = """
    CREATE VECTOR INDEX kgvectorindex IF NOT EXISTS
    FOR (n:Entity)
    ON n.embedding
    OPTIONS { 
        indexConfig: {
            `vector.dimensions`: 128,
            `vector.similarity_function`: 'cosine'
        }
    }
    """
    driver = GraphDatabase.driver(uri, auth=(user, password))
    with driver.session() as session:
        try:
            session.run(create_index_query)
            print("Vector index 'kgvectorindex' has been created or already exists.")
        except Exception as e:
            print(f"An error occurred while creating the vector index: {e}")
    driver.close()
create_vector_index(NEO4J_URI, NEO4J_USER_NAME, NEO4J_PASSWORD)

This index is crucial for enabling fast and accurate similarity searches within the knowledge graph.

Step 5: Performing similarity searches#

With the vector index in place, we can now perform similarity searches. We need to ensure that the query embeddings are generated using the same Node2Vec and BERT model and concatenated the same way we concatenated node embeddings to maintain consistency in the vector space.

Generating query embedding#

Given a query, we can extract entities from it and generate embeddings for these entities using a Node2Vec model trained on our knowledge graph. In the code below, we assumed, there is a single entity in the query we get the embedding for using our Node2Vec model.

Lines 2–5: We retrieve the embedding vector for a specified query_entity from the Node2Vec model. If the entity exists in the model's vocabulary, it returns the vector as a list; otherwise, it raises a ValueError indicating that the entity is not found.

We can generate embedding for the complete query using the same pre-trained BERT model, just like we got node embeddings. We can then combine Node2Vec embedding for each entity in the query with the BERT embedding and then apply vector search on each of these combined embeddings to retrieve relevant context for the query.

Suppose the query entity is not found in the Node2Vec model. In that case, we can use a placeholder zero vector of the same dimension while concatenating it with the BERT embedding so that the combined embedding has the same dimension as node embeddings in the knowledge graph. This is necessary to find the similarity between embeddings.

Performing the similarity search#

Based on the query embedding, we perform a vector search on the Neo4j database containing the knowledge graph and its vector index to find the top 5 similar nodes. The number of nodes to retrieve is configurable.

def find_similar_nodes(uri, user, password, query_embedding, top_k=5):
    driver = GraphDatabase.driver(uri, auth=(user, password))
    with driver.session() as session:
        search_query = """
        CALL db.index.vector.queryNodes('kgvectorindex', $top_k, $query_embedding)
        YIELD node, score
        RETURN node.name AS name, score
        ORDER BY score DESC
        """
        results = session.run(search_query, top_k=top_k, query_embedding=query_embedding)
        similar_nodes = [(record['name'], record['score']) for record in results]
    driver.close()
    return similar_nodes
# Find top 5 similar nodes
similar_nodes = find_similar_nodes(NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD, query_embedding)
print("Top 5 similar nodes:")
for name, score in similar_nodes:
    print(f"{name} (Score: {score})")

Lines 4–9: We utilize the db.index.vector.queryNodes procedure to search for the top k similar nodes based on the query_embedding, using the kgvectorindex index.

Conclusion and next steps#

The integration of vector search with knowledge graphs in Neo4j presents a powerful approach to enhancing data retrieval and semantic understanding. By utilizing embeddings to capture the nuanced meanings of entities and their interconnections, we can overcome the limitations of traditional keyword searches that often miss relevant results due to exact matching constraints.

This combination not only improves the precision of searches but also opens doors to various real-world applications. For example:

In recommendation systems, businesses can leverage vector search to provide personalized content suggestions by analyzing user preferences in the context of a rich knowledge graph.
In customer support, the technology can enable more intuitive query handling, allowing support agents to access relevant knowledge quickly, regardless of the specific terms used by the customer.
In content retrieval, organizations can enhance their search capabilities, enabling users to find pertinent information in large datasets efficiently.

Now that you know how to perform vector search on a knowledge graph, why not experiment with different embedding models based on your application's specific needs, such as choosing lightweight models for scalability or leveraging deeper models for more nuanced semantic understanding, creating embeddings for different data types, and explore how Neo4j can help you implement these solutions in real-world applications like personalized recommendations.

If you're eager to dive deeper, here are some resources to help you take the next steps:

If you want to learn how to construct knowledge graphs from unstructured raw text, effectively store and query them using Neo4j, integrate knowledge graphs with LLMs for context retrieval, and generate context-aware responses with LLMs, consider taking the following course.

Frequently Asked Questions

What is vector search in knowledge graphs?

Vector search in knowledge graphs refers to finding similar items or entities using their numerical embeddings within the graph. Each entity or node is represented as a vector in a high-dimensional space, capturing its semantic meaning. By leveraging similarity measures like cosine similarity or Euclidean distance, vector search enables the retrieval of entities that are most semantically related to a query. This approach combines the structural relationships in knowledge graphs with the power of embeddings for enhanced search and recommendation.

How does Neo4j implement vector search?

Neo4j integrates vector search by enabling users to create vector indexes on nodes or relationships, specifying dimensions and similarity metrics. It supports querying these indexes to find entities similar to a given vector, along with similarity scores. Neo4j also provides built-in procedures for generating embeddings using tools like OpenAI, AWS Bedrock, and Google Vertex AI, streamlining the process of combining graph data with vector search capabilities. This allows for a seamless integration of semantic similarity searches with graph-based data modeling.

What are the benefits of combining vector search with knowledge graphs?

Combining vector search with knowledge graphs offers the following benefits:

Enhanced search accuracy: Vector search finds semantically similar entities, while the graph structure provides rich context through relationships.
Better insights: Together, they enable nuanced understanding of both implicit similarities and explicit relationships.
Improved AI applications: In tasks like retrieval-augmented generation (RAG), this combination delivers more contextually relevant results.
Sophisticated reasoning: Knowledge graphs add explainability to AI models, while vector search improves flexibility in data retrieval.

How does vector search enhance AI applications?

Vector search enhances AI applications by enabling efficient and meaningful similarity-based retrieval. It allows AI systems to understand semantic relationships, retrieve contextually relevant data, and recommend items based on embeddings. When combined with other techniques, such as knowledge graphs, it boosts explainability and inference capabilities, resulting in smarter and more context-aware AI solutions for use cases like search engines, personalized recommendations, and question-answering systems.