PyTorch Loss Functions Explained: From Core Concepts to CrossEntropyLoss in Practice

Loss functions are the core metric in deep learning training. They quantify the gap between predictions and ground-truth labels and directly drive parameter updates. This article focuses on classification and regression loss selection, the computation pipeline behind cross-entropy, and practical PyTorch usage. Keywords: loss functions, CrossEntropyLoss, PyTorch.

Technical Specifications Snapshot

Parameter Details
Domain Deep learning training fundamentals
Primary Language Python
Core Framework PyTorch
Core Task Classification and regression loss design
Focus Module torch.nn.CrossEntropyLoss
Article Type Conceptual explanation + engineering practice
Source Popularity 61 likes / 44 saves shown in the original CSDN post
Core Dependencies torch, torch.nn

Loss functions provide the unified metric for model optimization

A loss function has a very focused responsibility: convert “how wrong the prediction is” into a numeric value that optimization can act on. The smaller this value is, the closer the current parameters are to the target distribution. The optimizer uses it for backpropagation and weight updates.

In engineering contexts, Loss Function, Cost Function, Objective Function, and Error Function are often used interchangeably. They differ slightly in emphasis, but all serve the same role in the training loop: defining the optimization target.

The position of the loss in the training loop is critical

import torch
import torch.nn as nn

model = nn.Linear(8, 3)
criterion = nn.CrossEntropyLoss()  # Common loss for multiclass tasks
x = torch.randn(4, 8)
y = torch.tensor([0, 2, 1, 1])     # Class index labels

logits = model(x)                   # Output raw scores without applying Softmax first
loss = criterion(logits, y)         # Compute the loss directly
print(loss.item())

This example shows that the loss function sits at the critical point between model output and parameter updates.

Loss functions must be selected explicitly by task type

Classification and regression tasks operate in different output spaces, so you cannot use their loss functions interchangeably. Classification focuses on class probability distributions, while regression focuses on continuous-value error. Their optimization objectives are fundamentally different.

Task Type Loss Function English Name Typical Use Case Core Characteristic
Binary Classification Binary Cross-Entropy BCE Loss Yes/no decisions Suitable for a single output probability
Multiclass Classification Multiclass Cross-Entropy Cross Entropy Loss Single-label multiclass tasks Softmax built in
Regression Mean Absolute Error MAE Robust regression More stable against outliers
Regression Mean Squared Error MSE Standard numerical fitting Smooth gradients
Regression Smooth L1 Smooth L1 Detection / robust regression Balances MAE and MSE

In practice, start by identifying the output structure

If each sample belongs to exactly one class, multiclass cross-entropy is usually the default choice. If the output is continuous, such as house price, temperature, or position offset, you should switch to MSE, MAE, or Smooth L1.

def choose_loss(task_type: str):
    if task_type == "multiclass":
        return "CrossEntropyLoss"  # Single-label multiclass
    if task_type == "binary":
        return "BCEWithLogitsLoss" # More common for binary classification
    if task_type == "regression":
        return "SmoothL1Loss"      # Common robust option for regression tasks
    return "Unknown"

print(choose_loss("multiclass"))

This pseudo-strategy illustrates the first principle of loss selection: task type matters more than personal preference.

Multiclass cross-entropy is the most common classification training objective

The essence of cross-entropy loss is to measure the difference between the true distribution and the predicted distribution. For single-label multiclass tasks, the true label usually corresponds to exactly one correct class, so the model must push the probability of that class as high as possible.

The most commonly misunderstood point is this: CrossEntropyLoss expects logits, not probabilities that have already passed through Softmax. That is because the module internally combines LogSoftmax + NLLLoss in a numerically stable implementation.

You must remember that CrossEntropyLoss already includes Softmax

import torch
import torch.nn.functional as F

logits = torch.tensor([[0.12, 1.0, 0.3]])
probs = F.softmax(logits, dim=1)   # Demonstration of the probability distribution only
print(probs)
print(probs.sum(dim=1))            # The probabilities should sum to 1

This example is only meant to explain the probability distribution after Softmax. You should not apply it again before passing inputs to CrossEntropyLoss.

The cross-entropy computation pipeline can be broken into three stable steps

First, the network outputs logits, which are just unnormalized scores. Second, Softmax maps these scores into probability space. Third, the model takes the log of the predicted probability for the correct class and applies a negative sign. The lower the probability, the larger the loss.

The formula can be written as: L = -Σ y_i log(p_i). With one-hot labels, only the position of the true class is 1, so in practice only that term contributes to the computation.

Intuition explains why cross-entropy works so well

If the correct class probability is 0.9, the loss is small. If it is only 0.1, then -log(0.1) becomes large, and the model receives a much stronger penalty. This nonlinear penalty mechanism effectively pushes the decision boundary toward convergence.

import math

for p in [0.9, 0.5, 0.1]:
    loss = -math.log(p)            # The smaller the correct-class probability, the larger the penalty
    print(p, round(loss, 4))

This example numerically illustrates the inverse relationship between the correct-class probability and the loss value.

In PyTorch practice, label format and input shape deserve special attention

In PyTorch, the standard label format for multiclass classification is a class index, not a one-hot vector. The input tensor shape is typically [batch_size, num_classes], and the label shape is [batch_size].

The code pattern in the original material is correct in spirit, but directly passing floating-point one-hot labels into nn.CrossEntropyLoss() is not the recommended approach in most common scenarios. In production code, passing class indices is safer and more idiomatic.

Use class indices for multiclass training whenever possible

import torch
import torch.nn as nn

def demo_cross_entropy():
    y_true = torch.tensor([1])  # Correct class index, meaning the second class
    y_pred = torch.tensor([[0.12, 1.0, 0.3]], dtype=torch.float32, requires_grad=True)

    criterion = nn.CrossEntropyLoss()   # Softmax is handled internally
    loss = criterion(y_pred, y_true)    # Pass logits and index labels directly

    print("Multiclass cross-entropy loss:", loss.item())
    loss.backward()                     # Backpropagate to compute gradients
    print("Gradients:", y_pred.grad)

if __name__ == "__main__":
    demo_cross_entropy()

This example shows the standard, directly trainable pattern and demonstrates both loss computation and gradient backpropagation.

Diagrams help clarify how information flows through loss functions during training

Deep Learning Core: A Complete Guide to Loss Functions — From Theory to PyTorch Practice

AI Visual Insight: This image serves as the article’s thematic illustration and emphasizes the central role of the loss function in deep learning training. It reinforces the conceptual training loop from network output to error computation to parameter updates, rather than presenting experimental curves or detailed model architecture.

The most common engineering mistakes usually fall into three categories

First, developers apply Softmax twice at the output layer, which weakens numerical stability. Second, they use the wrong label format and mix one-hot vectors, floating-point tensors, and class indices. Third, they mismatch the task and the loss function, such as using a classification loss for a regression problem.

During debugging, if the loss does not decrease, gradients are unexpectedly small, or accuracy plateaus for a long time, first verify that the final layer output, the loss definition, and the label dtype are all aligned.

A minimal training snippet is enough to validate whether the configuration is correct

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(10, 3)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

x = torch.randn(16, 10)
y = torch.randint(0, 3, (16,))       # Randomly generate class index labels

optimizer.zero_grad()
logits = model(x)
loss = criterion(logits, y)           # Compute loss for the current batch
loss.backward()                       # Backpropagate to update gradients
optimizer.step()                      # Apply the parameter update

This snippet shows how the loss function fits into a standard training loop and helps you quickly diagnose configuration errors.

FAQ

Q1: Why should you not manually apply Softmax when using CrossEntropyLoss?

Because nn.CrossEntropyLoss already includes a numerically stable implementation of Softmax and the logarithm operation. Applying Softmax outside the loss function destroys the intended numerical properties of logits and can hurt training stability.

Q2: Should multiclass labels use one-hot encoding or class indices?

In standard PyTorch training, prefer class indices. They are simpler, more compatible, and align directly with the default input requirements of CrossEntropyLoss.

Q3: Why is cross-entropy not suitable for regression tasks?

Because regression targets are continuous values, not discrete class probabilities. Cross-entropy optimizes distribution differences, while regression requires minimizing numerical error, so MSE, MAE, or Smooth L1 are usually better choices.

Core Summary: This article systematically reconstructs the knowledge framework around loss functions, covering definitions, categories, cross-entropy principles, and PyTorch practice. It highlights that CrossEntropyLoss already includes Softmax, explains label format requirements, and calls out common pitfalls to help developers quickly choose the right loss function and tune classification training more effectively.