Build a Keras MLP for Financial Prediction: Feature Engineering, Training, Regularization, and Model Saving

This hands-on guide fully implements an MLP binary classification workflow in Keras for financial up/down prediction, covering feature engineering, Sequential and Functional API design, EarlyStopping, Dropout, and model persistence. The core goal is to upgrade a model from merely “running” to being evaluable, reproducible, and tunable. Keywords: Keras, MLP, Dropout.

This hands-on article fully reviews a Keras MLP project for financial classification

Parameter Description
Language Python
Framework TensorFlow 2.20.0 / Keras
Task Type Binary classification for next-trading-day direction prediction
Dataset Size 2,451 samples, 11 features
Train/Test Split 1,715 / 736
Validation Set 343
License CC 4.0 BY-SA
Stars Not provided in the original article
Core Dependencies tensorflow, pandas, numpy, scikit-learn, matplotlib, seaborn

The project goal is to master the minimum engineering loop for MLPs by the shortest path

The original project focuses on building a multilayer perceptron with Keras. Its goal is not just to define a few Dense layers, but to establish a complete pipeline from data processing and modeling to training, evaluation, and persistence. For deep learning beginners, this is one of the most valuable steps.

The business scenario is financial time-series direction prediction. The label is defined as whether the next day’s return is greater than 0. In essence, this is a standard binary classification problem, which fits naturally with a sigmoid output layer and the binary_crossentropy loss function.

Data processing determines whether the model has anything worth learning

The code constructs 11 features, including RSI, MACD, MACD Signal, moving-average ratio, volatility, volume ratio, and multi-order momentum. The value of this feature set is that it encodes trend, volatility, price-volume behavior, and short-term inertia at the same time.

def generate_features(df):
    df['return'] = df['close'].pct_change()  # Compute returns
    df['ma5'] = df['close'].rolling(5).mean()  # Short-term moving average
    df['ma20'] = df['close'].rolling(20).mean()  # Medium-term moving average
    df['ma_ratio'] = df['ma5'] / df['ma20'] - 1  # Moving-average deviation ratio
    df['volatility'] = df['return'].rolling(20).std()  # Volatility
    df['target'] = (df['return'].shift(-1) > 0).astype(int)  # Next-day up/down label
    return df.dropna()

This code transforms raw candlestick price series into supervised learning samples suitable for MLP input.

The model architecture uses two Keras APIs to cover different complexity levels

The Sequential API is a good fit for linearly stacked networks. Its code is concise and makes it ideal as a teaching starting point. In this experiment, the baseline model uses three hidden layers with sizes 128-64-32 and a total of 11,905 parameters. It is small enough for fast iteration.

The Functional API offers more flexible topology. You can insert BatchNormalization, branching structures, or multi-input networks. For future scenarios that combine technical indicators with fundamental factors, it is more extensible than Sequential.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

model = Sequential([
    Dense(128, activation='relu', input_shape=(11,)),  # First hidden layer
    Dropout(0.3),  # Randomly drop units to reduce overfitting
    Dense(64, activation='relu'),  # Second hidden layer
    Dropout(0.3),
    Dense(32, activation='relu'),  # Third hidden layer
    Dense(1, activation='sigmoid')  # Output probability of an upward move
])

This code builds a three-layer MLP for binary classification and adds Dropout regularization.

Training configuration determines whether the model can converge stably

The project uses the Adam optimizer, binary_crossentropy loss, and accuracy as the evaluation metric. That is the standard setup for binary classification. More importantly, it introduces three callbacks: EarlyStopping, ModelCheckpoint, and ReduceLROnPlateau.

EarlyStopping stops training early when validation loss no longer improves. ModelCheckpoint persists the best model. ReduceLROnPlateau automatically lowers the learning rate when the training process reaches a plateau. Together, these three callbacks form a gold-standard template for Keras training control.

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau

callbacks = [
    EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True),  # Stop when validation performance no longer improves
    ModelCheckpoint('mlp_model.keras', monitor='val_accuracy', save_best_only=True),  # Save the best model
    ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6)  # Reduce learning rate automatically
]

This code adds early stopping, best-model checkpointing, and adaptive learning-rate scheduling to the training process.

The training results show clear generalization pressure in this task

The baseline model reaches a best validation accuracy of about 0.6093, but test accuracy is only 0.5136, with an AUC of 0.5538. This indicates that the model learns some patterns during training, but gains limited value when transferred to unseen samples.

From the training logs, the model stops early at epoch 11, and the best weights roll back to around epoch 1. This suggests that later training does not improve generalization and may instead begin to overfit or memorize noise.

The training curves directly expose both overfitting and weak signal strength

Training Curves

AI Visual Insight: The left chart shows training loss continuing to decline while validation loss stays mostly flat and rises slightly. The right chart shows training accuracy steadily improving while validation accuracy remains stuck around 0.59 to 0.61. This is a classic pattern of mild to moderate overfitting, indicating that the dataset contains limited predictive signal while the model already has enough capacity to memorize the training set.

Dropout significantly improves generalization performance in this experiment

The comparison experiment shows that the model without Dropout reaches a training accuracy of 0.7726, but validation accuracy is only 0.5860, resulting in an overfitting gap of 0.1866. After adding Dropout, training accuracy falls to 0.6181, while validation accuracy rises to 0.6152, leaving a gap of only 0.0029.

This result is highly representative: in weak-signal financial tasks, higher training accuracy does not mean stronger predictive power. A constrained model is often more robust than a high-capacity one.

Confusion Matrix

AI Visual Insight: This heatmap shows the hit distribution of binary predictions across the “up” and “down” classes. It helps identify whether the model is biased toward one class. If the diagonal advantage is not obvious, the model usually lacks sufficient discriminative power and the class boundary remains weak.

ROC Curve

AI Visual Insight: The curve stays only slightly above the random diagonal. Combined with an AUC of about 0.55, this suggests that the model captures only weakly separable signal. It is better suited for further feature enhancement, threshold optimization, and ensemble modeling than for direct use in high-confidence trading decisions.

Comparative experiments further show that regularization beats blindly deepening the network

Dropout Comparison Curves

AI Visual Insight: This figure compares validation loss and validation accuracy trajectories with and without Dropout. After adding Dropout, the curves become smoother and validation metrics become more stable. This indicates that random unit dropping reduces neuron co-adaptation and improves robustness on unseen samples.

Hyperparameter search results show that returns depend more on constraint than pure scaling

In the hidden-layer experiment, the larger three-layer architecture (256, 128, 64) achieves the highest validation accuracy of 0.6181, but its AUC is only about 0.5657, so the gain is not substantial. This suggests that widening the network improves fit but does not fundamentally change task difficulty.

The Dropout-rate experiment is more revealing. When Dropout increases to 0.5, AUC reaches about 0.5735, the best among all candidates. This indicates that under the current feature system, controlling overfitting matters more than increasing parameter count.

Dropout Rate Experiment

AI Visual Insight: The left chart shows the relationship between Dropout rate and AUC, while the right chart shows the relationship between Dropout rate and the overfitting gap. The overall trend shows that higher Dropout significantly compresses the train-validation gap and achieves better AUC in the higher range, indicating that this dataset favors stronger regularization.

Model saving and loading provide the foundation for reproducibility and deployment

The project saves both the full model file best_mlp.model.keras and the weights file best_mlp.weights.h5. The reloaded model produces the same AUC as the original model, which confirms that the persistence workflow is valid. This step is a prerequisite for moving from notebook experimentation to service deployment.

from tensorflow.keras.models import load_model

best_model.save('best_mlp.model.keras')  # Save the full model
loaded_model = load_model('best_mlp.model.keras')  # Reload the model
pred = loaded_model.predict(X_test)  # Run inference with the reloaded model

This code persists a trained Keras model and verifies that loading it reproduces the same results.

The core conclusion from this hands-on project is that weak-signal finance demands generalization control

If this entire project can be reduced to one sentence, it is this: building an MLP with Keras is easy, but the real challenge is keeping validation and test performance consistent. In this case, Dropout, early stopping, and disciplined data splitting are more effective than simply stacking more network layers.

For future extension, the best next steps are to add more time-series features, use rolling-window validation, optimize classification thresholds, and build ensembles that combine MLPs with tree-based models instead of relying only on deeper fully connected networks.

FAQ

Why is this MLP’s test accuracy close to random?

Short-term financial direction prediction is inherently a weak-signal problem, and 11 technical indicators are not enough to characterize future return direction reliably. The model can fit the training set to some extent, but its generalization ability on the test set is limited, which is why the AUC stays only slightly above 0.5.

How should you choose between the Sequential API and the Functional API in practice?

If the network is a simple linear stack, use Sequential first because it gives you the shortest and clearest code. If you need multi-input design, skip connections, shared layers, pluggable BatchNormalization, or more complex topology, you should move directly to the Functional API.

Is a higher Dropout rate always better?

No. If Dropout is too low, regularization may be insufficient. If it is too high, the model’s representational power may degrade. In this experiment, 0.5 performs best, but that only means stronger regularization is more suitable for the current data, features, and model setup. You must revalidate this choice on a different dataset.

AI Readability Summary: This article rebuilds a practical MLP workflow for binary financial direction prediction with Keras and TensorFlow, covering data generation, Sequential and Functional API modeling, callback-driven training, Dropout-based overfitting control, hyperparameter experiments, and model persistence, while interpreting performance through real evaluation metrics.