Gymnasium Reinforcement Learning Guide: From Env API Fundamentals to a CartPole Random Agent

Gymnasium is the standard environment interface for reinforcement learning experiments. Its core value is to unify environment creation, action space and observation space definitions, and the reset/step interaction flow. It solves two common problems in RL projects: inconsistent environment APIs and repeated boilerplate code. Keywords: Gymnasium, CartPole, Reinforcement Learning.

Technical Specification Details
Language Python
Protocol Compatible with the OpenAI Gym API lineage; original source context under CC BY-SA
GitHub Stars Not provided in the original source
Core Dependencies gymnasium, numpy, torch, opencv-python, pytorch-ignite

Gymnasium provides a unified abstraction layer for reinforcement learning environments.

In reinforcement learning, an agent does not encounter algorithm details first. It interacts with an environment interface first. Gymnasium matters because it standardizes different tasks behind the same calling pattern: create an environment, reset it, perform an action, and receive an observation and reward.

This lets researchers focus on policies, value functions, and training pipelines instead of rewriting interaction templates for every environment. For PyTorch users, Gymnasium is the de facto entry point for building RL experiment baselines.

A minimal environment reveals the essence of agent-environment interaction.

import random
from typing import List

class Environment:
    def __init__(self):
        self.steps_left = 10  # Limit an episode to at most 10 steps

    def get_observation(self) -> List[float]:
        return [0.0, 0.0, 0.0]  # Return a fixed observation to focus on the interface itself

    def get_actions(self) -> List[int]:
        return [0, 1]  # Define a discrete action set

    def is_done(self) -> bool:
        return self.steps_left == 0  # The episode ends when no steps remain

    def action(self, action: int) -> float:
        if self.is_done():
            raise RuntimeError("Game is over")
        self.steps_left -= 1  # Update internal state after each step
        return random.random()  # Return a random reward to simulate feedback

This code shows the minimum responsibilities of an environment: maintain state, provide observations, accept actions, return rewards, and expose an episode termination signal.

An agent’s core responsibility is to choose actions from observations and accumulate returns.

Even the simplest agent follows a fixed pattern: read the observation, choose an action, call the environment, and accumulate rewards. More advanced algorithms simply replace the action-selection logic with neural networks and optimizers.

class Agent:
    def __init__(self):
        self.total_reward = 0.0  # Accumulate rewards over the entire episode

    def step(self, env: Environment):
        obs = env.get_observation()  # Get the current observation
        actions = env.get_actions()  # Read the set of available actions
        action = random.choice(actions)  # Select one action at random
        reward = env.action(action)  # Submit the action and receive a reward
        self.total_reward += reward  # Update the cumulative return

This example illustrates the data flow between agent and environment: observations go into the policy, actions go back to the environment, and rewards flow back to the agent.

Gymnasium models actions and observations explicitly through spaces.

The two most important fields in Gymnasium are action_space and observation_space. The former constrains what the agent can do, while the latter describes what the environment will return. This explicit modeling is a prerequisite for implementing reusable algorithms.

There are three common space types you should learn first: Discrete represents a finite action set, Box represents a continuous tensor range, and Tuple represents a composite space. Most classic control tasks rely mainly on the first two.

Observation Space AI Visual Insight: This diagram shows the abstraction hierarchy of the Gymnasium space system. At the top is the unified Space abstraction, which branches into discrete, continuous, and composite spaces. The technical focus is how shape, sample(), contains(), and seed() describe data structure, sampling behavior, validity checks, and experiment reproducibility.

from gymnasium.spaces import Box, Discrete, Tuple
import numpy as np

action_space = Tuple((
    Box(low=-1.0, high=1.0, shape=(3,), dtype=np.float32),  # Continuous control values
    Discrete(3),  # For example, three turn-signal states
    Discrete(2)   # For example, horn on/off
))

This code shows that Gymnasium supports not only single-action setups but also composite action modeling for more complex control scenarios.

The standard Env API defines the boundaries of the reinforcement learning training loop.

Every environment implements reset() and step(). reset() returns the initial observation and auxiliary information. step(action) returns the next observation, reward, termination flag, truncation flag, and an info dictionary.

The most commonly overlooked field is truncated. By separating it from done, developers can distinguish between a task that ends naturally and one that stops because of a time limit. This distinction is critical when computing target values and tracking metrics.

import gymnasium as gym

env = gym.make("CartPole-v1")  # Create a classic control environment
obs, info = env.reset()  # You must reset a new environment before using it
print(obs)
print(info)

This code completes the first step in the Gymnasium environment lifecycle: instantiate the environment and get the initial observation.

CartPole is the most practical starter environment for understanding the Gymnasium API.

The goal of CartPole is to keep a pole upright by pushing the cart left or right. Its action space is Discrete(2), and its observation space is a 4-dimensional Box corresponding to position, velocity, angle, and angular velocity.

Its reward mechanism is extremely simple: the agent receives a reward of 1 for every step it survives. That makes CartPole a natural choice for validating training loops, sampling logic, and baseline algorithms.

Breakout AI Visual Insight: This image presents Atari Breakout as a pixel-level observation environment, highlighting that Gymnasium supports not only low-dimensional state vectors but also high-dimensional visual inputs. In practice, these environments are often paired with convolutional networks, frame stacking, and preprocessing pipelines.

CartPole-v1 AI Visual Insight: The figure shows the CartPole dynamics: a cart moving along a horizontal track with an inverted pendulum mounted on top, forming a classic control system. In reinforcement learning terms, the observation vector maps to a binary discrete action, and the goal is to maximize the system’s stable duration under limited control.

print(env.action_space)       # Print the action space definition
print(env.observation_space)  # Print the observation space bounds and shape

next_obs, reward, done, truncated, info = env.step(0)  # Take the left action
print(next_obs, reward, done, truncated, info)

This code demonstrates the full five-value return signature of step(), which forms the shared skeleton of all Gymnasium training code.

A random policy is the fastest way to validate environment integration.

Before introducing DQN, PPO, or A2C, it is good engineering practice to run a random agent first. If the code can sample actions, accumulate rewards, and exit cleanly when the episode ends, then your environment integration, loop control, and API understanding are likely correct.

import gymnasium as gym

if __name__ == "__main__":
    env = gym.make("CartPole-v1")
    total_reward = 0.0
    total_steps = 0
    obs, _ = env.reset()  # Initialize the first observation

    while True:
        action = env.action_space.sample()  # Randomly sample from the action space
        obs, reward, done, truncated, _ = env.step(action)  # Execute one interaction step
        total_reward += reward  # Accumulate reward
        total_steps += 1  # Accumulate step count
        if done or truncated:  # Stop and reset when either termination signal appears
            break

    print(f"Episode done in {total_steps} steps, total reward {total_reward:.2f}")

This code implements a minimal runnable CartPole random agent that you can use to verify both the training loop and the environment interface.

Gymnasium’s real value lies in unified experiment baselines and environment extensibility.

From classic control to Atari, Box2D, and MuJoCo, Gymnasium provides a unified API that lets algorithm code transfer across tasks. Developers can often reuse the same sampling, training, and evaluation framework by changing only the environment name.

For practical PyTorch reinforcement learning, Gymnasium is not an optional helper library. It is the infrastructure that carries the full state-action-reward loop.

FAQ

Q1: What is the relationship between Gymnasium and OpenAI Gym?

A: Gymnasium is a compatible continuation of OpenAI Gym after Gym stopped active maintenance. It preserves the original API while clarifying interface semantics such as the split between done and truncated, which makes it a strong default choice for new projects.

Q2: Why must you call reset() before calling step()?

A: The environment must initialize its internal state and generate the first valid observation. Calling step() before reset() usually means the episode has not started, the state is incomplete, and many environments will either raise an error or behave unpredictably.

Q3: Why should you implement a random agent first?

A: A random agent is the smallest acceptance test. It verifies that the environment can be created, the actions are valid, the episode termination conditions work correctly, and the reward flow is normal. It is an essential baseline before starting DQN or PPO training.

AI Readability Summary: This article reconstructs the core concepts of Gymnasium in a practical, structured way: Env, action_space, observation_space, and the standard reset() and step() interfaces. Using CartPole, it demonstrates environment creation, space inspection, and a minimal random agent so developers can quickly build a reliable reinforcement learning experiment baseline.