RoI Align Explained: How Bilinear Interpolation Eliminates RoI Pooling Misalignment in Mask R-CNN

[AI Readability Summary] RoI Align is the precise feature alignment method introduced in Mask R-CNN to replace RoI Pooling. Its core idea is to remove coordinate rounding and sample on continuous coordinates with bilinear interpolation, which reduces localization bias in object detection and instance segmentation. It addresses three major pain points: quantization error, boundary misalignment, and degraded accuracy in pixel-level tasks. Keywords: RoI Align, bilinear interpolation, Mask R-CNN.

Technical specifications provide a quick snapshot

Parameter Details
Technical topic RoI Align
Primary applications Object detection, instance segmentation, keypoint detection
First introduced in Mask R-CNN (2017)
Core idea Remove rounding and use bilinear interpolation for continuous sampling
Input RoI proposals + feature map
Output Fixed-size aligned feature blocks
Key operations Continuous mapping, bin partitioning, regular sampling, pooled aggregation
Languages Commonly implemented in Python / C++ / CUDA
Protocol Paper-defined method, with no single protocol constraint
Stars Not provided in the source input
Core dependencies Deep learning frameworks, tensor operators, bilinear interpolation implementation

RoI Align was introduced to solve the quantization error in RoI Pooling

The problem with RoI Pooling is not pooling itself. The real issue comes from rounding during coordinate mapping and bin partitioning. Once a proposal box is projected onto an integer grid, its boundaries shift, and that error is amplified during later pooling.

This has limited impact on classification tasks, but it becomes significant for pixel-level tasks such as instance segmentation and keypoint localization. Mask R-CNN therefore introduced RoI Align to correct feature alignment without changing the main detection framework.

image AI Visual Insight: This figure compares traditional RoI Pooling with RoI Align, highlighting how boundary rounding causes region misalignment and how continuous-coordinate sampling produces more accurate feature alignment. It is useful for understanding why this method improves pixel-level prediction quality.

Core conclusion

The essence of RoI Align is simple: keep floating-point coordinates, avoid discrete truncation, and estimate feature values at non-integer locations through interpolation.

def roi_align_core(roi, feature_map):
    # 1. Keep floating-point RoI coordinates without rounding
    x1, y1, x2, y2 = roi

    # 2. Uniformly divide bins based on the output size
    bin_w = (x2 - x1) / 2  # Use a 2x2 output as an example
    bin_h = (y2 - y1) / 2

    # 3. Perform regular sampling and interpolation inside each bin
    # Core idea: continuous coordinates + bilinear interpolation
    return "aligned_features"

This snippet captures the minimal implementation logic behind RoI Align: no rounding, then resampling, then aggregation.

RoI Align preserves original spatial information through continuous coordinate mapping

At the proposal mapping stage, RoI Align removes the integer quantization step entirely. In other words, after a box is projected from the input image onto the feature map, its boundaries remain floating-point values.

This may look like a small change, but it is the prerequisite for the entire method. Once coordinates are quantized too early, no amount of fine-grained sampling can recover the spatial information that has already been lost.

image.png AI Visual Insight: This figure shows the difference in coordinate representation when a proposal box is mapped from the original image to the feature map. It emphasizes that RoI Align keeps floating-point boundaries instead of rounding to the integer grid, which directly reduces spatial shift and forms the basis for precise downstream sampling.

Continuous bin partitioning avoids a second alignment loss

Assume the RoI is (0.8, 0, 2.4, 4.5) and the output size is 2×2. Its width and height are therefore 1.6 and 4.5. Each bin then has size 0.8 and 2.25.

This means the entire region is evenly split into four continuous subregions instead of being forced onto the pixel grid. Continuous partitioning preserves the spatial structure more faithfully, but it also introduces a new issue: sampling points often fall on non-integer coordinates.

image.png AI Visual Insight: This figure illustrates how a continuous RoI region is evenly divided into multiple bins. The key technical point is that bin boundaries no longer have to align to the discrete pixel grid, which avoids the subregion size distortion common in RoI Pooling.

roi = (0.8, 0.0, 2.4, 4.5)
out_h, out_w = 2, 2

# Compute the width and height of the continuous RoI
w = roi[2] - roi[0]  # width = 1.6
h = roi[3] - roi[1]  # height = 4.5

# Compute the size of each bin
bin_w = w / out_w    # 0.8
bin_h = h / out_h    # 2.25

This snippet shows the core of RoI Align bin partitioning: equal division only, with no rounding.

Bilinear interpolation gives stable feature values for non-integer sampling points

When a sampling point lies at a floating-point location such as (1.3, 2.7), it does not coincide with the center of a single pixel. Instead, it falls inside the square formed by four neighboring grid points. In that case, the feature value must be estimated from the local neighborhood.

Bilinear interpolation does this by weighting the top-left, top-right, bottom-left, and bottom-right points according to distance. The closer a point is to the sampling location, the larger its weight. The four weights always sum to 1.

image.png AI Visual Insight: This figure visualizes the spatial relationship between a floating-point sampling point and its four surrounding integer grid points, showing how bilinear interpolation assigns weights based on horizontal and vertical distances to construct a continuous feature response.

The interpolation formula is the mathematical core of RoI Align precision

Let the sampling point be (x, y), and let its top-left integer corner be (x1, y1). Then dx = x - x1 and dy = y - y1. The final feature value is obtained by the weighted sum of the four neighboring values.

def bilinear_interpolate(v11, v21, v12, v22, dx, dy):
    # Weight the four corner values by distance
    top = (1 - dx) * v11 + dx * v21      # First interpolate along the top edge
    bottom = (1 - dx) * v12 + dx * v22   # Then interpolate along the bottom edge
    value = (1 - dy) * top + dy * bottom # Finally interpolate along the y direction
    return value

This code demonstrates the two-stage bilinear interpolation process: horizontal first, then vertical.

Regular sampling inside each bin determines the granularity of feature aggregation

RoI Align does not only define how to interpolate. It also defines where to sample. A common approach is to place regular sampling points inside each bin, either with a fixed count or adaptively based on the bin size.

For adaptive sampling, implementations often set rx = ceil(bin_w) and ry = ceil(bin_h). They then generate points through uniform subdivision with center offsets, which avoids duplicate or asymmetric samples on bin boundaries.

image.png AI Visual Insight: This figure shows the regular sampling layout inside a single bin. The main point is that sampling points are typically chosen at the center of each subregion, which covers the bin interior while avoiding boundary overlap and improving statistical stability.

The sampling point generation process maps directly to engineering implementation

Take the top-left bin [0.8,1.6] × [0,2.25] as an example. If rx=1 and ry=3, the resulting sampling points are (1.2, 0.375), (1.2, 1.125), and (1.2, 1.875).

import math

x_start, y_start = 0.8, 0.0
bin_w, bin_h = 0.8, 2.25
rx, ry = math.ceil(bin_w), math.ceil(bin_h)

step_x = bin_w / rx  # Step size along x
step_y = bin_h / ry  # Step size along y

points = []
for ix in range(rx):
    for iy in range(ry):
        # Use the center of each subregion as the sampling point
        x = x_start + (ix + 0.5) * step_x
        y = y_start + (iy + 0.5) * step_y
        points.append((x, y))

This code generates regular sampling coordinates inside a bin, which are then used for interpolation and pooling.

Final pooled output compresses multi-point sampling into fixed-size features

After computing interpolated features for each sampling point, the method still needs bin-level aggregation. The most common choice is average pooling, although some implementations use max pooling. In mainstream RoI Align implementations, average aggregation is more common.

The purpose of this stage is to compress multi-point information from continuous space into a fixed-size tensor that can be consumed by the detection head, classification head, or mask head. The final output is still a regular tensor, but it is now derived from precisely aligned samples.

image.png AI Visual Insight: This figure describes the aggregation process from multiple interpolated sampling points inside a bin to a single output unit, emphasizing that RoI Align does not directly read grid values. Instead, it samples precisely first and pools afterward to produce fixed-shape region features.

d9037aff-61ef-4cab-94a2-720efe38d4c6.png AI Visual Insight: This figure further summarizes the complete RoI Align workflow, from continuous boundaries and bin partitioning to regular sampling, bilinear interpolation, and pooled output. It works well as a visual index of the overall algorithm pipeline.

A simplified pipeline looks like this

def roi_align_pipeline(roi, feature_map):
    # Divide the RoI continuously
    bins = split_into_bins(roi)          # Partition the output grid
    outputs = []
    for b in bins:
        pts = sample_points(b)           # Generate sampling points inside the bin
        vals = [interp(feature_map, p) for p in pts]  # Interpolate each point
        outputs.append(sum(vals) / len(vals))         # Aggregate with average pooling
    return outputs

This code connects the full execution chain of RoI Align from an engineering perspective.

RoI Align still leaves room for further optimization

RoI Align removes the alignment error caused by manual rounding, but that does not mean it eliminates all errors. Interpolation is still an approximation, and a fixed sampling rate implicitly assumes that the local structure is relatively smooth.

When object boundaries are complex, scale variation is large, or texture is highly non-uniform, uniform sampling may still be insufficient to represent local detail. For that reason, later research explored extensions such as learnable sampling, deformable RoI operations, and structure-aware aggregation.

FAQ provides structured answers to common questions

Q1: What is the fundamental difference between RoI Align and RoI Pooling?

A: The core difference is whether coordinates are rounded. RoI Pooling maps boundaries and subregions onto a discrete grid, which introduces quantization error. RoI Align keeps floating-point coordinates and uses bilinear interpolation to extract features at non-integer locations.

Q2: Why is RoI Align more important for instance segmentation?

A: Instance segmentation requires pixel-level alignment, so boundary offsets directly degrade mask quality. RoI Align preserves spatial relationships more accurately, which is why it is more critical than RoI Pooling in Mask R-CNN.

Q3: Does RoI Align always use exactly four sampling points?

A: Not necessarily. The paper commonly uses fixed regular sampling inside each bin, but engineering implementations often expose a sampling ratio parameter or choose the sampling density adaptively based on bin size.

Core takeaway

This article systematically breaks down the key mechanism behind RoI Align: why RoI Pooling introduces quantization error through rounding, how RoI Align improves detection and segmentation accuracy through continuous coordinates, regular sampling, and bilinear interpolation, and what its implementation pipeline, formulas, and future optimization directions look like.