Generate Atari Game Screens with PyTorch GANs: From Gymnasium Frame Preprocessing to Adversarial Training - Devuly | Smart Analytics for Developers & Projects

This article shows how to build a GAN with PyTorch, collect Atari screenshots through Gymnasium, and generate synthetic game frames to solve the problem of fragmented training workflows for high-dimensional game image generation. Keywords: PyTorch, GAN, Atari.

Table of Contents

The technical specification snapshot summarizes the stack

Parameter	Description
Core language	Python
Deep learning framework	PyTorch
Environment protocol	Gym/Gymnasium API
Image processing	OpenCV, NumPy
Visualization	TensorBoard
Typical environments	Breakout-v4, AirRaid-v4, Pong-v4
Data format	64×64 RGB tensors
Repository stars	Not provided in the source
Core dependencies	torch, gymnasium/gym, opencv-python, numpy

This approach connects GAN training with Atari image sampling into a trainable pipeline

The core of this hands-on project is not reinforcement learning policy optimization. Instead, it continuously collects Atari screen frames with multiple random agents and trains a GAN to learn the distribution of those images. The value of this design is that it turns environment interaction, image preprocessing, generative modeling, and training monitoring into a closed loop.

Unlike a conventional static image dataset, Atari data comes from live environments. Developers do not need to manually organize image files. They only need to keep the environments running, and the model can continuously receive training samples. This makes the setup especially useful for validating whether generative models can learn visual patterns in the game domain.

The input wrapper converts raw observations into tensors suitable for convolutional networks

Raw Atari frames are typically 210×160 color images. Feeding them directly into a network creates inconsistency and often hurts training stability. The wrapper handles resizing, channel reordering, and type conversion so that every downstream convolution layer receives data in a consistent format.

class InputWrapper(gym.ObservationWrapper):
    def __init__(self, *args):
        super().__init__(*args)
        old_space = self.observation_space
        assert isinstance(old_space, spaces.Box)
        self.observation_space = spaces.Box(
            self.observation(old_space.low),   # Rebuild the new observation space bounds in sync
            self.observation(old_space.high),
            dtype=np.float32,
        )

    def observation(self, observation):
        new_obs = cv2.resize(observation, (64, 64))  # Resize uniformly to 64×64
        new_obs = np.moveaxis(new_obs, 2, 0)         # Convert HWC to CHW for PyTorch
        return new_obs.astype(np.float32)            # Cast to float tensor input

This code standardizes raw environment images into an input structure that works cleanly with PyTorch convolutional networks.

The batch sampling function continuously collects training frames from multiple environments

GAN training depends on a stable data stream. Instead of loading a dataset from disk, this implementation randomly selects one Atari environment, executes an action, and collects observation frames in real time. Once the batch is large enough, it normalizes the data into the range [-1, 1].

One detail is worth noting: the code uses np.mean(obs) > 0.01 to filter nearly black frames. This is not a universal rule, but in Atari scenes with flickering frames, it can reduce the number of low-information samples entering the training batch and lower the chance that the discriminator learns noisy patterns.

def iterate_batches(envs, batch_size):
    batch = [e.reset()[0] for e in envs]
    env_gen = iter(lambda: random.choice(envs), None)

    while True:
        e = next(env_gen)
        action = e.action_space.sample()              # Use random actions to generate diverse screen data
        obs, reward, is_done, is_trunc, _ = e.step(action)

        if np.mean(obs) > 0.01:                       # Filter flickering or near-black frames
            batch.append(obs)

        if len(batch) == batch_size:
            batch_np = np.array(batch, dtype=np.float32)
            yield torch.tensor(batch_np * 2.0 / 255.0 - 1.0)  # Normalize to [-1, 1]
            batch.clear()

        if is_done or is_trunc:
            e.reset()

This code turns live game observations into continuous, iterable GAN training batches.

The generator and discriminator learn synthesis and discrimination separately

The discriminator takes a 64×64 color image and outputs a probability after several convolution layers, indicating whether the input comes from real samples. The generator starts from a latent vector and progressively upsamples through transposed convolutions until it produces an RGB image in an Atari-like style.

The key to this adversarial structure is not which model is stronger, but how they optimize in alternation. The discriminator learns to distinguish real from fake, while the generator learns to fool the discriminator. Training stability usually depends on learning rate, normalization strategy, and the balance between real and synthetic samples.

Data samples AI Visual Insight: The figure shows the typical visual distribution of Atari training samples, including high-contrast backgrounds, sparse target objects, pixel-art color blocks, and clearly separated interface boundaries. This means the generator must learn not only local textures, but also game UI structure, target position patterns, and the global layout characteristics produced by mixing multiple game scenes.

The main training loop uses the standard two-stage GAN update strategy

The main program first builds an environment pool, then initializes the generator, discriminator, binary cross-entropy loss, and two Adam optimizers. The two optimizers serve two separate networks, which is one of the biggest differences between GAN training and standard single-model training.

device = torch.device(args.dev)
envs = [InputWrapper(gym.make(name)) for name in ("Breakout-v4", "AirRaid-v4", "Pong-v4")]
shape = envs[0].observation_space.shape

net_discr = Discriminator(input_shape=shape).to(device)
net_gener = Generator(output_shape=shape).to(device)
objective = nn.BCELoss()

gen_optimizer = optim.Adam(net_gener.parameters(), lr=1e-4, betas=(0.5, 0.999))
dis_optimizer = optim.Adam(net_discr.parameters(), lr=1e-4, betas=(0.5, 0.999))

This code initializes the environment pool, models, loss function, and optimizers.

Discriminator training must block gradient backpropagation into the generator

Each training step first samples real images, then generates synthetic images from Gaussian noise. When training the discriminator, real samples use label 1 and fake samples use label 0. At this point, you must call detach() on generated outputs. Otherwise, the discriminator backward pass will incorrectly update the generator.

gen_input_v = torch.randn(BATCH_SIZE, LATENT_VECTOR_SIZE, 1, 1, device=device)  # Sample latent vectors
batch_v = batch_v.to(device)
gen_output_v = net_gener(gen_input_v)

dis_optimizer.zero_grad()
dis_output_true_v = net_discr(batch_v)
dis_output_fake_v = net_discr(gen_output_v.detach())  # Block gradients to avoid updating the generator

dis_loss = objective(dis_output_true_v, true_labels_v) + \
           objective(dis_output_fake_v, fake_labels_v)
dis_loss.backward()
dis_optimizer.step()

This code updates only the discriminator so it becomes better at distinguishing real and fake images.

Generator training aims to make synthetic samples look real to the discriminator

In the second stage, the generator output goes through the discriminator again, but this time without detach(). The target labels are no longer 0, but 1. With this loss definition, the generator updates its parameters in the direction of making the discriminator believe the fake samples are real.

gen_optimizer.zero_grad()
dis_output_v = net_discr(gen_output_v)
gen_loss_v = objective(dis_output_v, true_labels_v)  # Use real labels to train the generator to fool the discriminator
gen_loss_v.backward()
gen_optimizer.step()

This code updates the generator so it produces increasingly realistic Atari frames.

TensorBoard makes the training process interpretable

GAN training can easily suffer from mode collapse, oscillating losses, or degraded visual quality. Writing loss curves and image grids for real and fake samples to TensorBoard makes it much easier to judge whether the model is converging and whether generated images are starting to show recognizable game structure.

After roughly 10,000 to 20,000 iterations, the initial random noise gradually evolves into geometric boundaries, color regions, and object contours that resemble Atari screenshots. This indicates that the model has partially learned the target image distribution.

AI Visual Insight: The generated results show classic pixel-art game characteristics: regular color blocks, horizontal and vertical boundaries, localized bright targets, and clearly segmented background layers. Although the images still contain blur and artifacts, the overall screen layout, object density, and color statistics have already been modeled successfully, which suggests that the generator has learned the dominant modes of the Atari image distribution.

This example is best suited to understanding PyTorch generative modeling workflows

From an engineering perspective, this example covers five important points: observation wrapping, environment batch sampling, coordinated dual-network training, gradient control, and visualization monitoring. It is not the most advanced image generation solution, but it is an excellent entry point for understanding how GANs train on data collected from interactive environments.

If you want to extend it further, you can try DCGAN-style normalization, WGAN-GP for more stable losses, a fixed replay buffer, or a generator trained on a single game to achieve clearer visual results.

FAQ structured Q&A

Q1: Why use random agents for sampling instead of training an RL agent first?

A1: The goal is to learn the distribution of game frames, not an optimal policy. Random actions can already cover a wide range of visual states, which is enough to provide GAN training samples while keeping the system simpler.

Q2: Why normalize images to [-1, 1]?

A2: This usually matches the activation range of the generator output layer, which improves training stability and avoids mismatched value ranges between real and synthetic samples.

Q3: Why is `detach()` important in GAN training?

A3: If you do not cut off the generator computation graph when training the discriminator, discriminator loss will propagate gradients back into the generator, mixing the parameter updates of the two stages and breaking the intended optimization objective.

Core Summary: This article reconstructs a practical Atari image generation workflow based on PyTorch, Gymnasium, and GANs. It covers input wrapping, batch sampling, discriminator and generator training, gradient isolation, and TensorBoard visualization to help developers quickly understand how to train a game frame generator with screenshots collected by random agents.