Home/Blog/Programming/From Bellman’s Shortest Path Algorithm to Reinforcement Learning
Home/Blog/Programming/From Bellman’s Shortest Path Algorithm to Reinforcement Learning

From Bellman’s Shortest Path Algorithm to Reinforcement Learning

13 min read
Aug 01, 2023

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

The objective of the shortest path problem in a directed graph is to determine the minimum weight path between two vertices, where the edge weights correspond to the cost or distance between vertices. To achieve this, we are given a directed graph G=(V,E)G=(V,E), where VV is the set of vertices and EE is the set of edges with associated weights, along with two vertices ss (the source) and tt (the target). This problem can be tackled by employing different algorithms, such as Dijkstra’s algorithm, Bellman-Ford algorithm, or the Floyd-Warshall algorithm, based on the specific requirements and properties of the graph.

This blog post will focus on the Bellman-Ford algorithm as a means of solving the shortest path problem in a directed graph. The algorithm’s core is based on Bellman’s equation, which also serves as the foundation for reinforcement learning–a popular field in data science.

Bellman’s equation is a fundamental concept in dynamic programming, which is a technique for solving optimization problems that involve making decisions over time. It is named after Richard Bellman, who first formulated it in the 1950s. In its simplest form, Bellman’s equation is a recursive equation that expresses the optimal value of a problem in terms of the optimal values of its subproblems. It is written as follows:

V(x)=maxa[f(x,a)+V(g(x,a))]V(x) = \max_{a}[f(x, a) + V(g(x, a))]

Here, V(x)V(x) is the optimal value of the problem at state xx, f(x,u)f(x, u) is the reward or cost of taking action aa in state xx; g(x,a)g(x, a) is the next state that results from taking action aa in state xx; and max denotes the maximum over all possible actions aa.

Richard E. Bellman 1920-1984
Richard E. Bellman 1920-1984

Note: Bellman’s equation for minimization problems has “min” in place of “max.”

Shortest path#

Consider the directed graph given in the figure below. The problem involves finding the shortest path from all vertices to a fixed destination vertex, denoted as zz. We can express the length of the shortest path from a given vertex ss to zz, which consists of no more than kk edges, as Vk(s)V_k(s). Using the notation E(s1,s2)E(s_1,s_2) to represent the weight of the edge from vertex s1s_1 to s2s_2, we can conclude that V1(y)=E(y,z)=2V_1(y) = E(y,z) = 2. However, for V2(y)V_2(y), we need to consider the sum of the weights of the edges from vertex yy to an intermediate vertex ww and from ww to zz. Therefore, we have V2(y)=E(y,w)+E(w,z)=35=2V_2(y) = E(y,w) + E(w,z) = 3 - 5 = -2. Finally, we have V1(x)=V_1(x) = \infty, meaning there is no path from vertex xx to zz that consists of one edge at most. Considering a shortest path can be found with no more than n1n-1 edges, where nn is the total number of vertices, our goal is to find V4(s),s{x,y,z,w,p}V_4(s), \quad \forall s \in \{x,y,z,w,p\} given that V0(s)=,s{x,y,w,p}V_0(s)=\infty, \quad \forall s \in \{x,y,w,p\} and V0(z)=0V_0(z)=0. We can express Bellman’s equation for the shortest path as follows:

Vk(s1)=mins2[E(s1,s2)+Vk1(s2)]V_k(s_1) = \min_{s_2}[E(s_1,s_2)+V_{k-1}(s_2)]

Directed graph
Directed graph

Implementation#

The implementation of the algorithm is straightforward. We first define the nodes and edges of the graph and then directly apply Bellman’s equation.

import numpy as np
# Define the nodes
nodes = list('xypwz')
x, y, p, w, z = 0, 1, 2, 3, 4
# Number of nodes
n = len(nodes)
# Create index list
idxs = list(np.arange(n))
# Create a mapping of indices to node labels
i2s = dict(zip(idxs, nodes))
# Initialize the cost matrix E and the value matrix V
V = np.inf * np.ones(n)
V[z] = 0
E = np.inf * np.ones((n, n))
# Set the costs for the edges
E[x, y] = 4
E[x, p] = 2
E[y, p] = 1
E[y, w] = 3
E[y, z] = 2
E[p, y] = 3
E[p, w] = 5
E[p, z] = 4
E[w, z] = -5
E[z, x] = 3
# Bellman's algorithm to find shortest paths
def Bellman(E, V):
valueChanged = True
numIters = 0
while valueChanged:
numIters += 1
valueChanged = False
for i in range(V.size):
minVi = V[i]
for j in range(V.size):
vij = E[i, j] + V[j]
if vij < minVi:
minVi = vij
valueChanged = True
V[i] = minVi
return V, numIters
# Function to print shortest paths
def print_shortest_paths(E, V, V_star, i2s):
for i in range(V.size):
minVi = V[i]
bestNext = i
for j in range(V.size):
vij = E[i, j] + V_star[j]
if vij < minVi:
minVi = vij
bestNext = j
print(i2s[i] + '-->' + i2s[bestNext])
# Run the Bellman's algorithm and print the shortest paths
V_star, numIters = Bellman(E.copy(), V.copy())
print('Iterations:', numIters)
print_shortest_paths(E.copy(), V.copy(), V_star.copy(), i2s)

Code explanation#

The code above implements Bellman’s algorithm to find the shortest paths in a directed graph. The graph is represented by an adjacency matrix E, where E[i, j] represents the cost of the edge from node i to node j. The algorithm initializes the cost vector V with infinite values, except for the destination node z, which is set to 0.

The Bellman function performs iterations of the algorithm until no further improvements can be made to the cost vector V. It updates the cost values for each node by considering the minimum cost of reaching that node through its neighbors.

The print_shortest_paths function takes the original cost matrix E, the initial cost vector V, the final cost vector V_star, and a mapping of indices to node labels i2s. It prints the shortest paths from each node to their best next nodes based on the final cost vector V_star.

Finally, the code runs Bellman’s algorithm and prints the number of iterations performed and the shortest paths for each node.

To achieve efficiency, the code can be vectorized. However, the benefits of vectorization extend beyond just speed. In particular, we redefine Vk(s)V_k(s) as the length of the shortest path that consists of exactly kk edges. To ensure consistency, we allow self-loops on each node with a cost of 00 in case the shortest path has fewer than kk edges.

Self-loop on each node/state
Self-loop on each node/state

We consider the nodes as states in an environment where an agent takes actions of the form asa_s, representing a jump to state ss. The cost R(s^,as)R(\hat{s},a_s) of an action asa_s in state s^\hat{s} is defined as the cost of the edge from s^\hat{s} to ss, denoted as R(s^,as)=E(s^,s)R(\hat{s},a_s)=E(\hat{s},s). The Bellman’s equation can now be rewritten as follows:

Vk(s^)=minas[R(s^,as)+Vk1(s)]V_k(\hat{s}) = \min_{a_s}[R(\hat{s},a_s)+V_{k-1}(s)]

We can interpret Vk(s)V_k(s) as the minimum cost of taking a sequence of kk actions starting from state ss. If an action does not lead to a state, its cost is considered infinite, such as R(x,aw)=R(x,a_w)=\infty in our example graph where there is no edge from xx to ww. Furthermore, Vk(s)V_k(s) is computed by estimating the costs over all actions in state ss and selecting the action with the minimum cost. If we denote Qk(s^,as)Q_k(\hat{s},a_s) as the minimum cost of a sequence of kk actions when the first in state s^\hat{s} is asa_s, then Qk(s^,as)=R(s^,as)+Vk1(s)Q_k(\hat{s},a_s)=R(\hat{s},a_s)+V_{k-1}(s). It is important to note that Vk(s)V_k(s) satisfies the following equation:

Vk(s)=mina[Qk(s,a)]V_k(s) = \min_{a}[Q_k(s,a)]

The algorithm can now be decomposed to calculate Qk(s,a)Q_k(s,a) values for every state-action pair (s,a)(s,a).

Additionally, we introduce a transition matrix denoted as TaT^a for an action aa. TaT^a is a 5×55\times 5 matrix consisting of zeros and ones. Each entry Ta(s^,s)T^a(\hat{s},s) equals 11 if action aa from state s^\hat{s} leads to state ss via an edge. In our example graph, when ordering the states as xx, yy, pp, ww, and zz, the transition matrix for action aya_y is as follows:

Tay=[0100001000010000000000000]T^{a_y}=\begin{bmatrix}0&1&0&0&0\\0&1&0&0&0\\0&1&0&0&0\\0&0&0&0&0\\0&0&0&0&0\end{bmatrix}

This matrix represents the connectivity between states when taking action aya_y. Each entry Tay(s^,s)T^{a_y}(\hat{s}, s) indicates whether action aya_y from state s^\hat{s} leads to state ss via an edge.

We are set for vectorization. In particular, if we denote Ra=[R(x,a)R(y,a)R(p,a)R(w,a)R(z,a)],Vk=[Vk(x)Vk(y)Vk(p)Vk(w)Vk(z)]R^a=\begin{bmatrix}R(x,a)\\R(y,a)\\R(p,a)\\R(w,a)\\R(z,a)\end{bmatrix}, \quad V_k=\begin{bmatrix}V_k(x)\\V_k(y)\\V_k(p)\\V_k(w)\\V_k(z)\end{bmatrix} and Qka=[Qk(x,a)Qk(y,a)Qk(p,a)Qk(w,a)Qk(z,a)]Q_k^a=\begin{bmatrix}Q_k(x,a)\\Q_k(y,a)\\Q_k(p,a)\\Q_k(w,a)\\Q_k(z,a)\end{bmatrix}, then we get the following equation:

Qka=Ra+TaVkQ^a_k=R^a+T^aV_k

Furthermore, if Qk=[QkaxQkayQkapQkawQkaz]Q_k=\begin{bmatrix}Q_k^{a_x}&Q_k^{a_y}&Q_k^{a_p}&Q_k^{a_w}&Q_k^{a_z}\end{bmatrix} then the vector VkV_k can be found by minimising the matrix QkQ_k along the horizontal axis.

Vectorized implementation#

Here is a vectorized implementation by utilizing np.einsum to avoid the loop over the actions.

import numpy as np
# Define the nodes and their corresponding indices
nodes = list('xypwz')
x, y, p, w, z = 0, 1, 2, 3, 4
# Define the action indices
ax, ay, ap, aw, az = 0, 1, 2, 3, 4
# Define the actions
actions = ['jump to x', 'jump to y', 'jump to p', 'jump to w', 'jump to z']
# Number of nodes
n = len(nodes)
# Create lists and dictionaries for node and action indices
idxs = list(np.arange(n))
i2s = dict(zip(idxs, nodes))
i2a = dict(zip(idxs, actions))
# Initialize the value function V and set the value of the goal state to 0
V = np.inf * np.ones(n)
V[z] = 0
# Initialize the cost/reward matrix R with infinity values
R = np.inf * np.ones((n, n))
# Define the costs/rewards for each state-action pair
R[x, ax] = 0
R[y, ay] = 0
R[p, ap] = 0
R[w, aw] = 0
R[z, az] = 0
R[x, ay] = 4
R[x, ap] = 2
R[y, ap] = 1
R[y, aw] = 3
R[y, az] = 2
R[p, ay] = 3
R[p, aw] = 5
R[p, az] = 4
R[w, az] = -5
R[z, ax] = 3
# Initialize the transition tensor T with zeros
T = np.zeros((n, n, len(actions)))
# Define the transitions for each state-action pair
T[x, x, ax] = 1
T[z, x, ax] = 1
T[y, y, ay] = 1
T[x, y, ay] = 1
T[p, y, ay] = 1
T[p, p, ap] = 1
T[x, p, ap] = 1
T[y, p, ap] = 1
T[w, w, aw] = 1
T[y, w, aw] = 1
T[p, w, aw] = 1
T[z, z, az] = 1
T[y, z, az] = 1
T[p, z, az] = 1
T[w, z, az] = 1
# Define the Bellman algorithm for finding the optimal value function
def Bellman_fast(T, R, V):
# Set infinite values in V and R to the maximum representable float value
V[V == np.inf] = np.finfo(np.float64).max
R[R == np.inf] = np.finfo(np.float64).max
# Number of iterations
num_iterations = len(V) - 1
# Initialize the policy/best-action and the updated value function
P = np.arange(len(V))
for i in range(num_iterations):
# Calculate the Q-values
Q = R + np.einsum('ijk,j->ik', T, V)
# Update the value function
Vi = np.min(Q, axis=1).copy()
Pi = np.argmin(Q, axis=1).copy()
idx = V - Vi > 0
P[idx] = Pi[idx]
V = Vi.copy()
return V, P
# Define a function to print the optimal path
def print_path(s, z, P, i2a):
while P[s] != z:
print('From ' + nodes[s] + ' ' + i2a[P[s]])
s = P[s]
print('From ' + nodes[s] + ' ' + i2a[P[s]])
# Find the optimal value function and policy using the Bellman algorithm
V_star, P = Bellman_fast(T.copy(), R.copy(), V.copy())
# Print the optimal path from state x to z
print_path(x, z, P, i2a)

Up to this point, our assumption has been that when taking action aya_y from state xx, it will always result in transitioning to state yy with a probability of P(yx,ay)=1P(y|x,a_y)=1.

Note: P(yx,ay)P(y|x,a_y) represents the probability of reaching state yy from state xx when taking action aya_y.

However, in many real-world environments, there is uncertainty involved. What if there is a non-zero probability that action aya_y in state xx leads to a state other than yy?

Consider the scenario of playing with a remote control toy car. Pressing the button to move the car right may not always guarantee that the car will move to the right. Factors such as a slippery floor or an unexpected earthquake can introduce uncertainty during the action. In reality, most environments are inherently uncertain, and it becomes necessary to incorporate this uncertainty into our algorithm. The Bellman equation can now be reformulated as follows:

Qk(s^,a)=R(s^,a)+sP(ss^,a)Vk1(s)Q_k(\hat{s},a) = R(\hat{s},a)+\sum_{s}P(s|\hat{s},a)V_{k-1}(s)

Fortunately, our vectorized algorithm can handle this uncertainty by incorporating transition probabilities into the corresponding transition matrices. We can assign the transition probability P(ss^,a)P(s|\hat{s},a) to the entry Ta(s^,s)T^a(\hat{s},s).

By making this adjustment, no other changes are necessary in the algorithm except for the number of iterations. However, it is important to note that when uncertainty is present, the number of iterations required to reach the optimal solution can potentially become infinite in the limit.

Markov decision process#

A Markov decision process (MDP) is a mathematical framework used to model decision-making problems in situations where outcomes are influenced by both stochastic (random) elements and the actions taken by an agent. It is named after the mathematician Andrey Markov.

The key components of an MDP are listed below:

  • State: A set of possible states that the environment can be in.
  • Action: A set of possible actions that the agent can take.
  • Transition probability: The probability of transitioning from one state to another state when an action is taken.
  • Reward: The immediate numerical value that the agent receives as feedback based on the chosen action and resulting state.
  • Policy: A strategy or rule that determines the agent’s action selection based on the current state.
  • Value function: A function that assigns a value to each state or state-action pair, representing the expected long-term return or desirability of being in that state or taking that action.

The goal in an MDP is to find an optimal policy that maximizes the expected cumulative reward over time. Various algorithms, such as dynamic programming, reinforcement learning, and Monte Carlo methods, can be used to solve MDPs and find the optimal policy.

Note: We have indeed written vectorized code that solves an MDP! Just replace “minimization” with “maximization,” “path” with “policy” and “cost” with “reward.”

So far we’ve been using Bellman’s equation to solve a Markov decision process (MDP). However, when the transition probabilities of an MDP are unknown, the approach to solving such problems is commonly referred to as reinforcement learning. In reinforcement learning, the objective is to approximate the MDP using samples collected through the agent’s actions. From these samples, an explicit transition model can be constructed, or the algorithm can proceed without explicitly estimating the transition model.

Furthermore, an MDP assumes that the transition model of the environment is fixed, often referred to as a stationary process. However, reinforcement learning relaxes this assumption as well, allowing for more complex and dynamic environments. This relaxation introduces additional challenges in learning an optimal policy, as the transition model may change over time. Moreover, the need for continuous action and state spaces leads to a multidimensional complexity of the problem. Now that we’ve discussed the Bellman-Ford algorithm in this blog, you can learn more about creating AI solutions in the Grokking AI for Engineering & Product Managers or Mastering Machine Learning Theory and Practice course.

Reinforcement learning involves addressing the exploration-exploitation trade-off, where the agent needs to balance between exploiting its current knowledge to maximize rewards and exploring new actions to gather more information about the environment. Various algorithms and techniques, such as Q-learning and policy gradients have been developed to tackle the complexity of reinforcement learning problems and learn optimal policies in uncertain and changing environments. When value functions and/or policy functions are represented using deep neural networks, it is referred to as deep reinforcement learning.


If you like this blog, you may want to explore the following courses on Educative:

Cover
Mastering Machine Learning Theory and Practice

The machine learning field is rapidly advancing today due to the availability of large datasets and the ability to process big data efficiently. Moreover, several new techniques have produced groundbreaking results for standard machine learning problems. This course provides a detailed description of different machine learning algorithms and techniques, including regression, deep learning, reinforcement learning, Bayes nets, support vector machines (SVMs), and decision trees. The course also offers sufficient mathematical details for a deeper understanding of how different techniques work. An overview of the Python programming language and the fundamental theoretical aspects of ML, including probability theory and optimization, is also included. The course contains several practical coding exercises as well. By the end of the course, you will have a deep understanding of different machine-learning methods and the ability to choose the right method for different applications.

36hrs
Beginner
109 Playgrounds
10 Quizzes
Cover
Game Data Science Using R

Game data science is emerging as a significant field of study due to the emergence of social games embedded in online social networks. The ubiquity of social games gives access to new data sources and impacts essential business decisions, given the introduction of freemium business models. Game data science covers collecting, storing, analyzing data, and communicating insights. This course will teach you game data extraction, processing, data abstraction, data analysis through visualization, data clustering, supervised learning, neural networks, and advanced sequence analysis using a case study and many examples in the R programming language.

45hrs
Intermediate
283 Playgrounds
11 Quizzes


 
Join 2.5 million developers at
Explore the catalog

Free Resources