How BCE Loss Works in Binary Classification: Principles, Sigmoid, and PyTorch Examples

BCE is the core loss function for binary classification tasks. It measures the gap between predicted probabilities and ground-truth labels, helping the model learn the decision boundary between positive and negative samples. This article focuses on BCE fundamentals, its relationship with Sigmoid, and practical PyTorch usage. Keywords: BCE, Sigmoid, PyTorch.

Technical Specification Snapshot

Parameter Description
Core Topic BCE loss function in binary classification
Primary Language Python
Framework PyTorch
Task Type Binary Classification
Key Formula Binary Cross Entropy
Activation Function Sigmoid
Related License Original article declared CC 4.0 BY-SA
Star Count Not provided in the original content
Core Dependencies torch, torch.nn

Binary classification and multiclass classification require different loss designs

In deep learning, the loss function determines how the model updates its parameters. Binary classification and multiclass classification may look similar, but they differ in output space and probability interpretation. As a result, they require different combinations of loss functions and activation functions.

For binary classification, the model usually outputs a single score, which is then compressed into the range from 0 to 1 by Sigmoid to represent the probability of the positive class. In this setup, the most common loss function is BCE.

You can remember the core difference between binary and multiclass classification like this

Task Common Loss Activation Handling Practical Rule
Binary Classification BCELoss Requires manual Sigmoid Input must be probabilities
Multiclass Classification CrossEntropyLoss Includes internal Softmax logic Input should be raw logits

This difference is one of the most common pitfalls for beginners: BCELoss does not apply Sigmoid for you, while CrossEntropyLoss already includes the corresponding probability normalization logic.

BCE loss essentially compares ground-truth labels with predicted probabilities

BCE applies to scenarios where labels take values of 0 or 1. It measures whether the model’s predicted probability is close to the true class. In essence, it minimizes the negative log-likelihood under a Bernoulli distribution.

Let the ground-truth label be y and the predicted probability be p, where y∈{0,1} and p∈[0,1]. The standard BCE form is shown below.

loss = - y * log(p) - (1 - y) * log(1 - p)

The key to this formula is not memorization, but understanding how it automatically switches the penalty direction based on the label.

When the label is 1, the loss only cares whether the prediction is close enough to 1

If y = 1, the formula simplifies to:

loss = -log(p)

In this case, the closer p is to 1, the smaller the loss becomes. If p is very low, the loss rises rapidly. In other words, the model is strongly pushed to increase the predicted probability for positive samples.

When the label is 0, the loss only cares whether the prediction is close enough to 0

If y = 0, the formula simplifies to:

loss = -log(1 - p)

In this case, p should be as close to 0 as possible. If the model assigns a high positive-class probability to a negative sample, the loss increases significantly.

Sigmoid is a prerequisite for using BCELoss correctly

BCELoss expects probabilities as input, not arbitrary real-valued numbers. If you pass the model’s raw outputs directly into BCELoss, the values may fall outside [0,1], which makes the loss definition invalid and often leads to unstable training.

Therefore, the standard pipeline is: the model outputs logits, the logits are passed through Sigmoid, and the resulting probabilities are then fed into BCELoss. This process can be summarized as: binary classification = Sigmoid + BCELoss.

The computation pipeline can be summarized as follows

Input features -> Model outputs logits -> Sigmoid maps logits to probabilities -> Compute BCE Loss against ground-truth labels

This sentence is short, but it determines whether your binary classification training code is correct.

The most basic BCE implementation in PyTorch must satisfy type and range requirements

The following minimal runnable example shows how to use nn.BCELoss() to compute binary classification loss. Note that predicted values must be probabilities, and labels must use a floating-point type.

import torch
import torch.nn as nn

def demo_bce_loss():
    # Ground-truth labels: binary classification usually uses 0/1 and converts them to float
    y_true = torch.tensor([0, 1, 0], dtype=torch.float32)

    # Predicted probabilities: assume Sigmoid has already been applied, so values are in [0, 1]
    y_pred = torch.tensor([0.69, 0.54, 0.26], dtype=torch.float32)

    # Create the BCE loss object; by default it averages over the batch
    criterion = nn.BCELoss()

    # Compute the loss
    loss = criterion(y_pred, y_true)

    print("Ground-truth labels:", y_true)
    print("Predicted probabilities:", y_pred)
    print("BCE loss:", loss.item())

if __name__ == "__main__":
    demo_bce_loss()

This code demonstrates the minimal end-to-end BCE workflow: prepare labels, provide probabilities, instantiate the loss function, and compute the result.

If your model outputs logits, apply Sigmoid first

In many real training pipelines, the final network layer does not output probabilities directly. Instead, it outputs unnormalized scores, or logits. In that case, you need to call torch.sigmoid explicitly.

import torch
import torch.nn as nn

# Simulated raw logits from the model
logits = torch.tensor([0.8, 0.2, -1.1], dtype=torch.float32)
labels = torch.tensor([1, 1, 0], dtype=torch.float32)

# Convert logits to probabilities first
probs = torch.sigmoid(logits)  # Key step: compress values into [0, 1]

criterion = nn.BCELoss()
loss = criterion(probs, labels)

print("Probability values:", probs)
print("Loss value:", loss.item())

This example shows that the direct input to BCELoss is not logits, but probabilities after Sigmoid.

BCEWithLogitsLoss is usually the safer choice in production

Although this article focuses on BCELoss, BCEWithLogitsLoss is more commonly recommended in real-world engineering. The reason is that it combines Sigmoid and BCE into a single operator, which improves numerical stability and reduces errors caused by applying Sigmoid twice or forgetting it entirely.

import torch
import torch.nn as nn

logits = torch.tensor([0.8, 0.2, -1.1], dtype=torch.float32)
labels = torch.tensor([1, 1, 0], dtype=torch.float32)

# Feed logits directly; no need to apply Sigmoid manually
criterion = nn.BCEWithLogitsLoss()
loss = criterion(logits, labels)

print("BCEWithLogitsLoss:", loss.item())

This code combines activation and loss into one step, reducing numerical error and simplifying training logic.

The original diagram highlights the BCE computation path

Core of Binary Classification: BCE Loss from Principles to PyTorch Practice

AI Visual Insight: This diagram emphasizes the binary classification theme and the learning path of BCE loss. Visually, it highlights the article structure of “principles + practice.” When aligned with the article semantics, it maps to the full training pipeline: model output, Sigmoid-based probability mapping, and the selection of different logarithmic penalty terms based on the label.

Developers should remember three high-frequency takeaways first

First, BCELoss accepts probabilities only; it does not take logits directly. Second, binary classification labels usually need to be converted to float32. Third, if you want more stable training, prefer BCEWithLogitsLoss.

These rules look simple, but they often determine whether a binary classification model converges successfully.

FAQ

1. Why does my model fail to converge after I use BCELoss?

The most common reason is that logits were passed directly into BCELoss, or the label type is not floating point. Check whether you applied Sigmoid first and make sure the labels use float32.

2. How should I choose between BCELoss and BCEWithLogitsLoss?

For teaching and formula understanding, you can start with BCELoss plus Sigmoid. For real training, BCEWithLogitsLoss is usually the better choice because it is more numerically stable and less error-prone.

3. Is BCE only for single-output binary classification?

No. It can also extend to multilabel classification, where each label is treated as an independent binary decision. However, the core requirement remains the same: each output dimension must correspond to a 0/1 label and a probability interpretation.

Core Summary: This article reconstructs the explanation of BCE (Binary Cross Entropy) loss in binary classification. It systematically covers how BCE differs from multiclass loss, its mathematical formula, why Sigmoid is necessary, and its standard usage and common pitfalls in PyTorch. It is well suited for quickly building a practical understanding of binary classification training.