PyTorch Reinforcement Learning Guide: From Supervised Learning to MDP, MRP, and Policy

This article focuses on the core framework of reinforcement learning: an agent takes actions in an environment, receives rewards, and aims to maximize long-term return. Reinforcement learning addresses sequential decision-making, delayed rewards, and the exploration-exploitation tradeoff—problems that static supervised learning cannot handle well. Keywords: Reinforcement Learning, Markov Decision Process, PyTorch.

Technical Specification Details
Primary Language Python / PyTorch
Theoretical Framework MDP, MRP, MP, Policy
Source Format Reworked original technical blog post from CSDN
Star Count Not provided in the original article
Core Dependencies PyTorch, NumPy, mathematical modeling concepts

Reinforcement learning is a learning paradigm for sequential decision-making

Reinforcement Learning (RL) studies how an agent can act continuously in a dynamic environment and learn optimal behavior through reward signals. Unlike supervised learning, which depends on labeled data, RL does not provide a direct “correct answer.” Instead, it only tells the system whether an outcome is good or bad.

This mechanism is especially well suited for problems with a temporal dimension. A model’s current action affects not only the immediate reward, but also the future state distribution. That is why RL naturally fits continuous decision-making tasks such as gaming, robotics, autonomous driving, quantitative trading, and recommendation systems.

Reinforcement learning differs fundamentally from supervised and unsupervised learning

Supervised learning focuses on mapping inputs to outputs and is fundamentally about fitting labeled data. Unsupervised learning, by contrast, discovers structure in data through clustering, representation learning, or generative modeling. Reinforcement learning sits between them: it often uses deep neural networks for training, but it depends on environmental feedback rather than human-provided labels.

Its most important distinction is that the training data is not a fixed dataset. Instead, the agent continuously generates data through its own behavior. This means the data distribution changes as the policy changes, so the training process is inherently non-IID.

# Use pseudocode to describe the RL interaction loop
state = env.reset()
for step in range(max_steps):
    action = policy(state)  # Select an action based on the current state
    next_state, reward, done, info = env.step(action)  # Interact with the environment
    buffer.append(state, action, reward, next_state)  # Record experience
    state = next_state
    if done:
        break

This code captures the minimal reinforcement learning workflow: observation, decision, feedback, and update.

Reinforcement learning is challenging because feedback is sparse and environments are unstable

RL is harder than classical machine learning mainly because rewards often appear only after a delay. A critical action may reveal its value many steps later, which makes credit assignment a central challenge.

Another difficulty is balancing exploration and exploitation. If the agent only exploits known high-reward actions, it may get stuck in a local optimum. If it explores too aggressively, it sacrifices current reward. This exploration-exploitation dilemma remains one of the core issues in RL.

Five basic elements form the theoretical backbone of reinforcement learning

Reinforcement learning is typically defined by five elements: reward, agent, environment, action, and observation. Reward is scalar feedback. The agent is the decision-making entity. The environment is the external system around the agent. Actions are the controls the agent can apply. Observations are the slices of information visible to the agent.

It is especially important to distinguish between state and observation. A state is the complete internal description of the environment, while an observation is usually only a partial, noisy, or limited view of that information. Most real-world tasks are partially observable.

Environment AI Visual Insight: This image shows a typical maze-style RL scenario: the agent moves through a spatial environment that contains both positive-reward goals and negative-reward regions, emphasizing the sequential decision-making nature of RL, where actions change future observations and returns.

Reinforcement Learning AI Visual Insight: This diagram uses bidirectional arrows to illustrate the closed loop between the agent and the environment: the agent outputs an action, and the environment returns an observation and reward. It shows that RL is not a one-time prediction task, but a continuous interactive control system.

Markov processes provide the minimal mathematical foundation for reinforcement learning

A Markov Process (MP) describes a stochastic system that can be observed but not controlled. It satisfies the Markov property: the future depends only on the current state, not on earlier history.

If the state space is finite, you can represent the dynamics with a state transition matrix. The entry in row i and column j gives the probability of transitioning from state i to state j. This creates a unified representation for later value computation and policy analysis.

import numpy as np

# Transition matrix for a two-state weather system
T = np.array([
    [0.8, 0.2],  # Sunny -> Sunny/Rainy
    [0.1, 0.9],  # Rainy -> Sunny/Rainy
])

next_prob = T[0]  # Next-step distribution when the current state is Sunny
print(next_prob)

This code demonstrates the most basic state transition representation of a Markov process.

Markov reward processes extend state transitions with reward definitions

A Markov Reward Process (MRP) adds a reward function and discount factor γ on top of an MP. Now the system not only transitions to the next state, but also produces a reward during the transition. This makes it possible to define return and state value.

You can write return as the discounted sum of future rewards from the current time step onward. The closer γ is to 1, the more the model values long-term reward. The closer it is to 0, the more it prioritizes immediate reward. In infinite-horizon tasks, γ is also essential for preventing divergence in the value function.

def discounted_return(rewards, gamma):
    total = 0.0
    for t, r in enumerate(rewards):
        total += (gamma ** t) * r  # Discount future rewards
    return total

print(discounted_return([1, 2, 3], gamma=0.9))

This code computes the discounted return of a trajectory and serves as the foundation of the value function.

Markov decision processes formally introduce actions into the system

A Markov Decision Process (MDP) is the core formal model in reinforcement learning. Compared with an MRP, it adds an action set A and extends transition probabilities into action-conditioned distributions. In other words, taking different actions in the same state leads to different next-state probabilities.

As a result, an MDP is usually defined by a state set S, action set A, transition function P, reward function R, and discount factor γ. It completely describes a controllable stochastic process and serves as the shared abstraction behind algorithms such as Q-Learning, DQN, and Policy Gradient.

State Transition Graph AI Visual Insight: This figure uses nodes and weighted edges to represent a finite-state Markov chain. The probabilities on the edges describe the state transition distribution, making it useful for explaining how a system evolves over time when no action control exists.

State Transition Graph with Rewards AI Visual Insight: This figure overlays reward values on the state transition edges, clearly showing the dual structure of an MRP: probabilistic dynamics plus reward feedback. It is a key step in moving from stochastic processes to value evaluation.

State Transition Matrix AI Visual Insight: This figure expands a two-dimensional transition matrix into a three-dimensional structure by adding an action dimension. It shows that in an MDP, transition probabilities are no longer determined only by the current state, but jointly by the state-action pair.

A policy defines how the agent acts in each state

A policy is a mapping from states to action distributions, written as π(a|s). A deterministic policy outputs exactly one action for each state, while a stochastic policy outputs a probability distribution over actions.

The essence of reinforcement learning training is to find a policy that maximizes expected cumulative return. When the policy is fixed, an MDP can reduce to an MRP, which makes policy the bridge between control and evaluation.

import random

def epsilon_greedy(q_values, epsilon=0.1):
    if random.random() < epsilon:
        return random.randrange(len(q_values))  # Explore with a small probability
    return int(np.argmax(q_values))  # Otherwise choose the current best action

This code implements the classic ε-greedy policy for balancing exploration and exploitation.

Reinforcement learning draws long-term value from its interdisciplinary foundation

RL is not just a branch of machine learning. It is also deeply connected to control theory, psychology, neuroscience, economics, and operations research. Control theory provides a framework for optimal decision-making. Economics focuses on return maximization. Neuroscience offers biological inspiration for reward learning.

That is also why RL has such strong research value: it provides a unified description of the general problem of how to act in uncertain environments. From an engineering perspective, understanding MDPs, value, policy, and return matters more than simply applying algorithms mechanically.

The Relationship Between Reinforcement Learning and Other Disciplines AI Visual Insight: This image places reinforcement learning at the center and radiates outward into multiple disciplines, showing that RL not only draws on methods from machine learning, mathematics, and control theory, but also forms methodological connections with psychology, economics, and neuroscience.

FAQ

Q1: What is the fundamental difference between reinforcement learning and supervised learning?

Supervised learning learns a mapping from static labeled data. Reinforcement learning learns a policy through interaction with an environment. The former optimizes prediction error, while the latter optimizes long-term cumulative reward.

Q2: Why is the discount factor γ so important?

γ determines whether the model prioritizes long-term or short-term reward. If it is too small, the policy becomes myopic. If it is too large, value estimation variance may increase and can even destabilize training in infinite-horizon settings.

Q3: Why is the MDP considered the core modeling tool in reinforcement learning?

Because an MDP simultaneously describes states, actions, transitions, and rewards, it fully defines a decision-capable stochastic environment. Most RL algorithms ultimately solve an optimal policy problem on some class of MDPs.

Core Summary: This article restructures the original reinforcement learning notes into a high-density technical article. It systematically explains the differences between RL, supervised learning, and unsupervised learning; clarifies the five core elements of reward, agent, environment, action, and observation; and provides an in-depth explanation of the mathematical framework behind MP, MRP, MDP, discount factors, and policy.