Home/Blog/Programming/From Bellman’s Shortest Path Algorithm to Reinforcement Learning

From Bellman’s Shortest Path Algorithm to Reinforcement Learning

13 min read

Aug 01, 2023

content

Shortest path

Implementation

Code explanation

Vectorized implementation

Markov decision process

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

The objective of the shortest path problem in a directed graph is to determine the minimum weight path between two vertices, where the edge weights correspond to the cost or distance between vertices. To achieve this, we are given a directed graph $G=(V,E)$ , where $V$ is the set of vertices and $E$ is the set of edges with associated weights, along with two vertices $s$ (the source) and $t$ (the target). This problem can be tackled by employing different algorithms, such as Dijkstra’s algorithm, Bellman-Ford algorithm, or the Floyd-Warshall algorithm, based on the specific requirements and properties of the graph.

Bellman’s equation

Bellman’s equation is a fundamental concept in dynamic programming, which is a technique for solving optimization problems that involve making decisions over time. It is named after Richard Bellman, who first formulated it in the 1950s. In its simplest form, Bellman’s equation is a recursive equation that expresses the optimal value of a problem in terms of the optimal values of its subproblems. It is written as follows:

V(x) = \max_{a}[f(x, a) + V(g(x, a))]

Here, $V(x)$ is the optimal value of the problem at state $x$ , $f(x, u)$ is the reward or cost of taking action $a$ in state $x$ ; $g(x, a)$ is the next state that results from taking action $a$ in state $x$ ; and max denotes the maximum over all possible actions $a$ .

Richard E. Bellman 1920-1984

Shortest path#

Consider the directed graph given in the figure below. The problem involves finding the shortest path from all vertices to a fixed destination vertex, denoted as $z$ . We can express the length of the shortest path from a given vertex $s$ to $z$ , which consists of no more than $k$ edges, as $V_k(s)$ . Using the notation $E(s_1,s_2)$ to represent the weight of the edge from vertex $s_1$ to $s_2$ , we can conclude that $V_1(y) = E(y,z) = 2$ . However, for $V_2(y)$ , we need to consider the sum of the weights of the edges from vertex $y$ to an intermediate vertex $w$ and from $w$ to $z$ . Therefore, we have $V_2(y) = E(y,w) + E(w,z) = 3 - 5 = -2$ . Finally, we have $V_1(x) = \infty$ , meaning there is no path from vertex $x$ to $z$ that consists of one edge at most. Considering a shortest path can be found with no more than $n-1$ edges, where $n$ is the total number of vertices, our goal is to find $V_4(s), \quad \forall s \in \{x,y,z,w,p\}$ given that $V_0(s)=\infty, \quad \forall s \in \{x,y,w,p\}$ and $V_0(z)=0$ . We can express Bellman’s equation for the shortest path as follows:

V_k(s_1) = \min_{s_2}[E(s_1,s_2)+V_{k-1}(s_2)]

Python 3.10.4

import numpy as np
# Define the nodes
nodes = list('xypwz')
x, y, p, w, z = 0, 1, 2, 3, 4
# Number of nodes
n = len(nodes)
# Create index list
idxs = list(np.arange(n))
# Create a mapping of indices to node labels
i2s = dict(zip(idxs, nodes))
# Initialize the cost matrix E and the value matrix V
V = np.inf * np.ones(n)
V[z] = 0
E = np.inf * np.ones((n, n))
# Set the costs for the edges
E[x, y] = 4
E[x, p] = 2
E[y, p] = 1
E[y, w] = 3
E[y, z] = 2
E[p, y] = 3
E[p, w] = 5
E[p, z] = 4
E[w, z] = -5
E[z, x] = 3
# Bellman's algorithm to find shortest paths
def Bellman(E, V):
    valueChanged = True
    numIters = 0
    while valueChanged:
        numIters += 1
        valueChanged = False
        for i in range(V.size):
            minVi = V[i]
            for j in range(V.size):
                vij = E[i, j] + V[j]
                if vij < minVi:
                    minVi = vij
                    valueChanged = True
            V[i] = minVi
    return V, numIters
# Function to print shortest paths
def print_shortest_paths(E, V, V_star, i2s):
    for i in range(V.size):
        minVi = V[i]
        bestNext = i
        for j in range(V.size):
            vij = E[i, j] + V_star[j]
            if vij < minVi:
                minVi = vij
                bestNext = j
        print(i2s[i] + '-->' + i2s[bestNext])
# Run the Bellman's algorithm and print the shortest paths
V_star, numIters = Bellman(E.copy(), V.copy())
print('Iterations:', numIters)
print_shortest_paths(E.copy(), V.copy(), V_star.copy(), i2s)

Code explanation#

The code above implements Bellman’s algorithm to find the shortest paths in a directed graph. The graph is represented by an adjacency matrix E, where E[i, j] represents the cost of the edge from node i to node j. The algorithm initializes the cost vector V with infinite values, except for the destination node z, which is set to 0.

The Bellman function performs iterations of the algorithm until no further improvements can be made to the cost vector V. It updates the cost values for each node by considering the minimum cost of reaching that node through its neighbors.

The print_shortest_paths function takes the original cost matrix E, the initial cost vector V, the final cost vector V_star, and a mapping of indices to node labels i2s. It prints the shortest paths from each node to their best next nodes based on the final cost vector V_star.

Finally, the code runs Bellman’s algorithm and prints the number of iterations performed and the shortest paths for each node.

We consider the nodes as states in an environment where an agent takes actions of the form $a_s$ , representing a jump to state $s$ . The cost $R(\hat{s},a_s)$ of an action $a_s$ in state $\hat{s}$ is defined as the cost of the edge from $\hat{s}$ to $s$ , denoted as $R(\hat{s},a_s)=E(\hat{s},s)$ . The Bellman’s equation can now be rewritten as follows:

V_k(\hat{s}) = \min_{a_s}[R(\hat{s},a_s)+V_{k-1}(s)]

We can interpret $V_k(s)$ as the minimum cost of taking a sequence of $k$ actions starting from state $s$ . If an action does not lead to a state, its cost is considered infinite, such as $R(x,a_w)=\infty$ in our example graph where there is no edge from $x$ to $w$ . Furthermore, $V_k(s)$ is computed by estimating the costs over all actions in state $s$ and selecting the action with the minimum cost. If we denote $Q_k(\hat{s},a_s)$ as the minimum cost of a sequence of $k$ actions when the first in state $\hat{s}$ is $a_s$ , then $Q_k(\hat{s},a_s)=R(\hat{s},a_s)+V_{k-1}(s)$ . It is important to note that $V_k(s)$ satisfies the following equation:

V_k(s) = \min_{a}[Q_k(s,a)]

The algorithm can now be decomposed to calculate $Q_k(s,a)$ values for every state-action pair $(s,a)$ .

Additionally, we introduce a transition matrix denoted as $T^a$ for an action $a$ . $T^a$ is a $5\times 5$ matrix consisting of zeros and ones. Each entry $T^a(\hat{s},s)$ equals $1$ if action $a$ from state $\hat{s}$ leads to state $s$ via an edge. In our example graph, when ordering the states as $x$ , $y$ , $p$ , $w$ , and $z$ , the transition matrix for action $a_y$ is as follows:

T^{a_y}=\begin{bmatrix}0&1&0&0&0\\0&1&0&0&0\\0&1&0&0&0\\0&0&0&0&0\\0&0&0&0&0\end{bmatrix}

This matrix represents the connectivity between states when taking action $a_y$ . Each entry $T^{a_y}(\hat{s}, s)$ indicates whether action $a_y$ from state $\hat{s}$ leads to state $s$ via an edge.

We are set for vectorization. In particular, if we denote $R^a=\begin{bmatrix}R(x,a)\\R(y,a)\\R(p,a)\\R(w,a)\\R(z,a)\end{bmatrix}, \quad V_k=\begin{bmatrix}V_k(x)\\V_k(y)\\V_k(p)\\V_k(w)\\V_k(z)\end{bmatrix}$ and $Q_k^a=\begin{bmatrix}Q_k(x,a)\\Q_k(y,a)\\Q_k(p,a)\\Q_k(w,a)\\Q_k(z,a)\end{bmatrix}$ , then we get the following equation:

Q^a_k=R^a+T^aV_k

Furthermore, if $Q_k=\begin{bmatrix}Q_k^{a_x}&Q_k^{a_y}&Q_k^{a_p}&Q_k^{a_w}&Q_k^{a_z}\end{bmatrix}$ then the vector $V_k$ can be found by minimising the matrix $Q_k$ along the horizontal axis.

Vectorized implementation#

Here is a vectorized implementation by utilizing np.einsum to avoid the loop over the actions.

Python 3.10.4

import numpy as np
# Define the nodes and their corresponding indices
nodes = list('xypwz')
x, y, p, w, z = 0, 1, 2, 3, 4
# Define the action indices
ax, ay, ap, aw, az = 0, 1, 2, 3, 4
# Define the actions
actions = ['jump to x', 'jump to y', 'jump to p', 'jump to w', 'jump to z']
# Number of nodes
n = len(nodes)
# Create lists and dictionaries for node and action indices
idxs = list(np.arange(n))
i2s = dict(zip(idxs, nodes))
i2a = dict(zip(idxs, actions))
# Initialize the value function V and set the value of the goal state to 0
V = np.inf * np.ones(n)
V[z] = 0
# Initialize the cost/reward matrix R with infinity values
R = np.inf * np.ones((n, n))
# Define the costs/rewards for each state-action pair
R[x, ax] = 0
R[y, ay] = 0
R[p, ap] = 0
R[w, aw] = 0
R[z, az] = 0
R[x, ay] = 4
R[x, ap] = 2
R[y, ap] = 1
R[y, aw] = 3
R[y, az] = 2
R[p, ay] = 3
R[p, aw] = 5
R[p, az] = 4
R[w, az] = -5
R[z, ax] = 3
# Initialize the transition tensor T with zeros
T = np.zeros((n, n, len(actions)))
# Define the transitions for each state-action pair
T[x, x, ax] = 1
T[z, x, ax] = 1
T[y, y, ay] = 1
T[x, y, ay] = 1
T[p, y, ay] = 1
T[p, p, ap] = 1
T[x, p, ap] = 1
T[y, p, ap] = 1
T[w, w, aw] = 1
T[y, w, aw] = 1
T[p, w, aw] = 1
T[z, z, az] = 1
T[y, z, az] = 1
T[p, z, az] = 1
T[w, z, az] = 1
# Define the Bellman algorithm for finding the optimal value function
def Bellman_fast(T, R, V):
    # Set infinite values in V and R to the maximum representable float value
    V[V == np.inf] = np.finfo(np.float64).max
    R[R == np.inf] = np.finfo(np.float64).max
    # Number of iterations
    num_iterations = len(V) - 1
    # Initialize the policy/best-action and the updated value function
    P = np.arange(len(V))
    for i in range(num_iterations):
        # Calculate the Q-values
        Q = R + np.einsum('ijk,j->ik', T, V)
        # Update the value function
        Vi = np.min(Q, axis=1).copy()
        Pi = np.argmin(Q, axis=1).copy()
        idx = V - Vi > 0
        P[idx] = Pi[idx]
        V = Vi.copy()
    return V, P
# Define a function to print the optimal path
def print_path(s, z, P, i2a):
    while P[s] != z:
        print('From ' + nodes[s] + ' ' + i2a[P[s]])
        s = P[s]
    print('From ' + nodes[s] + ' ' + i2a[P[s]])
# Find the optimal value function and policy using the Bellman algorithm
V_star, P = Bellman_fast(T.copy(), R.copy(), V.copy())
# Print the optimal path from state x to z
print_path(x, z, P, i2a)

Uncertainty

Up to this point, our assumption has been that when taking action $a_y$ from state $x$ , it will always result in transitioning to state $y$ with a probability of $P(y|x,a_y)=1$ .

Note: $P(y|x,a_y)$ represents the probability of reaching state $y$ from state $x$ when taking action $a_y$ .

However, in many real-world environments, there is uncertainty involved. What if there is a non-zero probability that action $a_y$ in state $x$ leads to a state other than $y$ ?

Consider the scenario of playing with a remote control toy car. Pressing the button to move the car right may not always guarantee that the car will move to the right. Factors such as a slippery floor or an unexpected earthquake can introduce uncertainty during the action. In reality, most environments are inherently uncertain, and it becomes necessary to incorporate this uncertainty into our algorithm. The Bellman equation can now be reformulated as follows:

Q_k(\hat{s},a) = R(\hat{s},a)+\sum_{s}P(s|\hat{s},a)V_{k-1}(s)

Fortunately, our vectorized algorithm can handle this uncertainty by incorporating transition probabilities into the corresponding transition matrices. We can assign the transition probability $P(s|\hat{s},a)$ to the entry $T^a(\hat{s},s)$ .

By making this adjustment, no other changes are necessary in the algorithm except for the number of iterations. However, it is important to note that when uncertainty is present, the number of iterations required to reach the optimal solution can potentially become infinite in the limit.

Markov decision process#

A Markov decision process (MDP) is a mathematical framework used to model decision-making problems in situations where outcomes are influenced by both stochastic (random) elements and the actions taken by an agent. It is named after the mathematician Andrey Markov.

The key components of an MDP are listed below:

State: A set of possible states that the environment can be in.
Action: A set of possible actions that the agent can take.
Transition probability: The probability of transitioning from one state to another state when an action is taken.
Reward: The immediate numerical value that the agent receives as feedback based on the chosen action and resulting state.
Policy: A strategy or rule that determines the agent’s action selection based on the current state.
Value function: A function that assigns a value to each state or state-action pair, representing the expected long-term return or desirability of being in that state or taking that action.

The goal in an MDP is to find an optimal policy that maximizes the expected cumulative reward over time. Various algorithms, such as dynamic programming, reinforcement learning, and Monte Carlo methods, can be used to solve MDPs and find the optimal policy.

Note: We have indeed written vectorized code that solves an MDP! Just replace “minimization” with “maximization,” “path” with “policy” and “cost” with “reward.”

Reinforcement learning

So far we’ve been using Bellman’s equation to solve a Markov decision process (MDP). However, when the transition probabilities of an MDP are unknown, the approach to solving such problems is commonly referred to as reinforcement learning. In reinforcement learning, the objective is to approximate the MDP using samples collected through the agent’s actions. From these samples, an explicit transition model can be constructed, or the algorithm can proceed without explicitly estimating the transition model.

Furthermore, an MDP assumes that the transition model of the environment is fixed, often referred to as a stationary process. However, reinforcement learning relaxes this assumption as well, allowing for more complex and dynamic environments. This relaxation introduces additional challenges in learning an optimal policy, as the transition model may change over time. Moreover, the need for continuous action and state spaces leads to a multidimensional complexity of the problem. Now that we’ve discussed the Bellman-Ford algorithm in this blog, you can learn more about creating AI solutions in the Grokking AI for Engineering & Product Managers course.

Reinforcement learning involves addressing the exploration-exploitation trade-off, where the agent needs to balance between exploiting its current knowledge to maximize rewards and exploring new actions to gather more information about the environment. Various algorithms and techniques, such as Q-learning and policy gradients have been developed to tackle the complexity of reinforcement learning problems and learn optimal policies in uncertain and changing environments. When value functions and/or policy functions are represented using deep neural networks, it is referred to as deep reinforcement learning.

Written By: