Q-learning algorithm is model-free, meaning it doesn't require prior knowledge of how the environment works. It's also off-policy, which means it can explore different ways of acting before ultimately learning the optimal policy. The value function in Q-learning is represented as Q(s,a) where s represents the current state and a represents the action taken.
Key terminologies in Q-learning
Understanding the parameters used in the Q-learning algorithm is essential before diving into the algorithm itself. To help with this, let's take a look at an explanation of each parameter:
Q-values or action-values: This represents the anticipated reward that an agent can obtain by taking a specific action in a given state and subsequently following the optimal path.
Episode: An episode refers to a sequence of actions taken by the agent in the environment until it reaches a terminal state.
Starting state: This is the state from which the agent begins an episode.
Step: This is a single action taken by the agent in the environment.
Epsilon-greedy policy: This is a way for the agent to decide whether to explore new actions or exploit actions that have worked well in the past. The epsilon-greedy policy in the Q-learning algorithm helps the agent make decisions by either exploiting the current best action or exploring other actions. By balancing exploration With a probability of epsilon, the agent selects a random action, regardless of the Q-values. This allows the agent to explore different actions and potentially discover better choices that may have been overlooked. and exploitation With a probability of (1 - epsilon), the agent chooses the action that has the highest Q-value. This is the action believed to have the maximum potential for reward, based on the agent's current knowledge. , the agent can learn and adapt its behavior to achieve optimal long-term rewards in a reinforcement learning setting.
Chosen action: This is the action selected by the agent based on the epsilon-greedy policy.
Q-learning update rule: This mathematical formula updates the Q-value of a particular state-action pair. This update is based on the reward that is received and the maximum Q-value of the next state-action pair.
New state: It refers to the state that an agent transitions to after taking an action in the current state.
Goal state: This is a terminal state in the environment where the agent receives the highest reward.
Alpha (α): This is a learning rate parameter that controls the degree of weight given to newly acquired information when updating the Q-values.
Gamma (γ): This is a discount factor parameter that controls the degree of weight given to future rewards when calculating the expected cumulative reward.
Algorithm pseudocode
The pseudocode for the Q-Learning algorithm is given below:
Initialize:
Set all state-action pairs' Q-values to zero.
Repeat for each episode:
Set the initial state.
Repeat for each step:
Select an action for the current state using the epsilon-greedy policy.
Take the chosen action and observe the reward and the new state.
Update the Q-value for the current state-action pair using the Q-learning update rule:
Q(s,a)←Q(s,a)+α⋅(r+γ⋅maxa′Q(s′,a′)−Q(s,a)).
where:- Q(s,a) represents the Q-value for state s and action a,- r is the reward received after taking action a,- s′ is the next state, - maxa′Q(s′,a′) is the maximum Q-value for the next state s′ and all possible actions a′,- α is the learning rate,- γ is the discount factor.
Update the current state to the new state.
If the new state is the goal state, terminate the episode and go to step 2.
End episode.
How does Q-learning work?
We will learn Q-learning using Tom and Jerry as an example, where Tom's goal is to catch Jerry while avoiding obstacles (dogs). The best strategy for Tom is to reach Jerry through the shortest possible path while steering clear of all dogs.