Transformer Time Series Modeling in Practice: From Moving Average Baselines to Stock Sequence Classification - Devuly | Smart Analytics for Developers & Projects

Table of Contents

Technical Specification Snapshot

Parameter	Description
Primary Language	Python
Deep Learning Framework	TensorFlow / Keras
Data Processing	NumPy, Pandas
Visualization	Matplotlib
Evaluation Metrics	MSE, Classification Accuracy
Data Type	Univariate stock closing price time series
Modeling Approach	Sliding Window + Transformer Encoder
GitHub Stars	Not provided in the original article
Core Dependencies	tensorflow, keras, numpy, pandas, matplotlib, scikit-learn

Transformer models handle time series modeling effectively

This article uses stock closing prices as an example to show how to apply a Transformer to a time series classification task, compare the limitations of a moving average baseline, and walk through data windowing, normalization, encoder construction, training, and evaluation. Keywords: Transformer, time series forecasting, stock classification.

Time series data contains explicit ordering and implicit dependencies. Common use cases include price forecasting, sensor monitoring, device alerting, and behavior classification. The core challenge is not fitting a single point, but extracting trends, volatility, and contextual relationships from historical segments.

Traditional methods such as Moving Average and ARIMA work well for short-term smoothing and linear modeling, but they are less effective for complex nonlinear patterns, long-range dependencies, and robustness to missing values. The key advantage of the Transformer is that it uses self-attention to model internal sequence relationships in parallel.

Time series tasks must first be defined as regression or classification

The original example does not directly predict the exact price for the next day. Instead, it reframes the problem as classification: given the past 30 days of closing prices, predict whether the next day’s price will be above or below the mean of that window. This reformulation reduces modeling difficulty and makes the task a better introductory Transformer example.

import pandas as pd
from matplotlib import pyplot as plt

# Load stock closing price data
data = pd.read_csv("stock_data.csv")

# Visualize the closing price series
plt.plot(data["Close"])  # Show the trend of the raw time series
plt.show()

This code loads CSV data and helps you inspect the overall trend and volatility structure of the closing price series.

Output AI Visual Insight: This image shows the first few rows of tabular stock data. It typically includes columns such as date, open, high, low, close, and volume, which indicates that the raw data is ready for direct use in Pandas-based time series preprocessing and feature extraction.

Visualization AI Visual Insight: This line chart highlights the non-stationary nature of closing prices over time. You can observe local trends, abrupt shifts, and volatility clustering, which makes this a strong example of the kind of input where Transformers can capture long-range dependencies and regime-level patterns.

Moving Average is only a weak baseline, not a final solution

A Moving Average computes the mean over a fixed window and uses that average as a future estimate. Its strengths are simplicity, interpretability, and low implementation cost. Its weaknesses are equally clear: it only uses local history and cannot represent sudden events or long-range dependencies.

In a noisy domain such as stock prices, a short window can overfit fluctuations, while a long window can smooth away turning points. That is why it works best as a baseline for measuring whether a deep learning model actually delivers a meaningful gain.

from sklearn.metrics import mean_squared_error

# Compute the 10-day moving average
data["moving_average"] = data["Close"].rolling(10).mean()

# Measure the error between the moving average and the ground truth
mse = mean_squared_error(data["Close"][9:], data["moving_average"][9:])
print(mse)  # The error is usually large, which shows limited expressive power

This code builds a Moving Average baseline and uses MSE to quantify its prediction error.

AI Visual Insight: The two curves show that the Moving Average result is smoother than the raw closing price, but it lags noticeably around turning points and sharp fluctuations. This demonstrates that the method can capture low-frequency trends, but it struggles to fit the rapid changes of real market dynamics.

Splitting the time series into sample windows is the key preprocessing step

To let a neural network learn from the data, you need to split a long sequence into multiple fixed-length windows. This example uses a 31-day sliding window: the first 30 days serve as the input, and the 31st day is used to create the label. Each window is also standardized to prevent absolute price scale from interfering with training.

import numpy as np

dataset_x, dataset_y = [], []

for i in range(0, data.shape[0] - 31):
    # Compare the mean of the first 30 days with day 31 to create a binary label
    label = 1 if data[i:i+30].Close.mean() < data.iloc[i+30].Close else 0
    dataset_y.append(label)

    mean_ = data[i:i+30].Close.mean()
    std_ = data[i:i+30].Close.std()

    # Standardize the window to improve training stability
    window = (data[i:i+30].Close - mean_) / std_
    dataset_x.append(list(window))

dataset_x = np.array(dataset_x).reshape(-1, 30, 1)
dataset_y = np.array(dataset_y)

This code converts a continuous price series into a 3D tensor that is suitable as Transformer input.

The Transformer encoder models global relationships within each window

This implementation uses a lightweight Transformer Encoder with multi-head attention, residual connections, Layer Normalization, and a 1×1 convolution-based feed-forward network. In essence, it allows every time step to interact directly with every other time step in the window.

Compared with the sequential hidden-state updates used by RNNs, the Transformer is better suited for parallel computation and more effective at preserving long-range dependency information in longer windows.

from tensorflow import keras
from tensorflow.keras import layers

def encoder(inputs, head_size, num_heads, ff_dim, dropout=0.0):
    # Multi-head self-attention learns dependencies across time steps
    x = layers.MultiHeadAttention(
        key_dim=head_size,
        num_heads=num_heads,
        dropout=dropout
    )(inputs, inputs)

    x = layers.Dropout(dropout)(x)
    x = layers.LayerNormalization(epsilon=1e-6)(x)
    x = x + inputs  # Residual connection for stable training

    # 1x1 convolution acts as the feed-forward network
    y = layers.Conv1D(ff_dim, kernel_size=1, activation="relu")(x)
    y = layers.Dropout(dropout)(y)
    y = layers.Conv1D(inputs.shape[-1], kernel_size=1)(y)
    y = layers.LayerNormalization(epsilon=1e-6)(y)

    return x + y  # Second residual connection

This code defines a single Transformer encoder block, which is the core of the time series classifier.

def build_model(input_shape):
    inputs = keras.Input(shape=input_shape)
    x = inputs

    # Stack multiple encoder blocks to increase representational power
    for _ in range(4):
        x = encoder(x, head_size=256, num_heads=4, ff_dim=4, dropout=0.25)

    x = layers.GlobalAveragePooling1D()(x)  # Aggregate features across the full time window
    x = layers.Dense(128, activation="relu")(x)
    x = layers.Dropout(0.4)(x)
    outputs = layers.Dense(2, activation="softmax")(x)  # Output class probabilities for up or down

    return keras.Model(inputs, outputs)

This code stacks encoder blocks into a complete classification model and outputs binary class probabilities.

Model Architecture AI Visual Insight: This architecture diagram shows the input sequence passing through multiple attention and feed-forward layers, followed by pooling and fully connected layers for classification. It reflects the standard time series Transformer pipeline: local time-step representation → global sequence aggregation → final decision.

The training stage should use early stopping and an appropriate loss function

For classification, the model uses sparse_categorical_crossentropy. Adam is a suitable optimizer, and a small learning rate helps attention-based models converge more stably. To reduce overfitting, you can use EarlyStopping to stop training once validation performance stops improving.

model = build_model(train_x.shape[1:])

model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer=keras.optimizers.Adam(learning_rate=1e-4),
    metrics=["sparse_categorical_accuracy"]
)

callbacks = [
    keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
]

# Start training and evaluate performance on the test set
model.fit(train_x, train_y, epochs=5, batch_size=32, callbacks=callbacks)
model.evaluate(test_x, test_y, verbose=1)

This code completes model compilation, training, and evaluation, and serves as the execution entry point for validating time series classification performance.

Training Results AI Visual Insight: The training log shows that the model reaches relatively stable accuracy and loss convergence within a small number of epochs. This suggests that the sliding-window setup and Transformer encoder are already able to extract distinguishable up/down pattern features.

The same architecture can be smoothly adapted to regression forecasting

If the goal changes from up/down classification to predicting the next day’s numeric value, you only need to adjust the final layer and the loss function. Specifically, replace the output layer with a single neuron, use a linear or sigmoid activation, and switch the loss to MSE.

This shows that the value of the Transformer in time series is not limited to classification. It is fundamentally a general-purpose sequential representation learner that can extend to regression, anomaly detection, missing value imputation, and multivariate forecasting.

The conclusion is that Transformers are a strong backbone for medium- and long-range time series modeling

This experiment shows that a Moving Average can provide a minimal reference line, but it cannot capture complex market patterns. By contrast, the Transformer explicitly models relationships among all time steps inside a window through attention, which gives it significantly stronger expressive power and scalability.

As data volume grows and features become richer, you can further introduce positional encoding, multivariate inputs, exogenous variables, and more efficient time series variants such as Informer, Autoformer, and PatchTST.

FAQ

Q1: Why start with classification instead of predicting the exact price directly?

A1: Classification labels are more stable and the optimization target is clearer, which reduces modeling difficulty for beginners. It is usually safer to first verify that the model can recognize trend direction before switching to regression.

Q2: What is the root cause of the large Moving Average error?

A2: It only uses the mean of a fixed window and cannot represent nonlinear relationships, long-range dependencies, or sudden fluctuations. In noisy financial time series, it usually reflects only a smoothed trend.

Q3: How can this Transformer implementation be improved further?

A3: You can add positional encoding, multi-feature inputs, validation-set monitoring, learning rate scheduling, class imbalance handling, and longer-sequence architectures such as Informer or Autoformer.

[AI Readability Summary]

This article systematically reconstructs a practical Transformer workflow for time series modeling. It covers task definition, a Moving Average baseline, sliding-window construction, a Keras-based Transformer encoder implementation, and training and evaluation. Using stock closing prices as an example, it shows how to move from traditional methods to deep learning-based time series classification.