Optimizing the CELU Activation Function with Triton: From Prototype to AutoTune for High-Performance GPU Kernels - Devuly | Smart Analytics for Developers & Projects

This article focuses on building a high-performance GPU operator for the CELU activation function with Triton, addressing three common pain points: CUDA development complexity, the high barrier to optimizing pointwise operators, and adaptation challenges in quantized workloads. It covers the mathematical foundation, explicit memory access, autotuning, and accuracy validation. Keywords: Triton, CELU, GPU operator optimization.

Table of Contents

Technical Specifications at a Glance

Parameter	Details
Language	Python, Triton DSL
Runtime Backend	PyTorch + CUDA
Core Protocols / Interfaces	Triton JIT, AutoTune, PyTorch Unary Op
Test Environment	H20 / PyTorch 2.8.0+cu126 / Triton 3.4.0
GitHub Stars	Not provided in the source material
Core Dependencies	triton, torch, pytest, flag_gems

Triton provides a lower-friction path to implementing high-performance CELU kernels

Traditional CUDA kernel optimization often requires manual handling of thread blocks, shared memory, memory coalescing, and register pressure. For pointwise operators such as activation functions, the engineering complexity and performance payoff are often disproportionate.

Triton matters because developers can describe block-level computation with a Python-like programming model, then let the compiler handle thread mapping, memory access optimization, and low-level scheduling. For AI infrastructure teams, this is a far more practical path to performance engineering.

CELU’s mathematical structure makes it a strong fit for vectorized fusion

CELU is linear on the positive half-axis and exponentially decays on the negative half-axis, which makes it naturally suitable for single-input elementwise parallelism. Its common formulation is shown below:

import torch

def celu_ref(x, alpha=1.0):
    # Pass through positive values directly, and smooth negative values exponentially
    return torch.where(x > 0, x, alpha * (torch.exp(x / alpha) - 1))

This code illustrates the core piecewise logic of CELU and serves as a good semantic reference for a Triton implementation.

CELU’s extended parameters support quantization and mixed-precision inference

In addition to alpha, the original implementation introduces scale and input_scale. These are not required by the mathematical definition, but they are common engineering extensions for inference systems.

input_scale maps quantized inputs back into the floating-point domain, while scale aligns the output with the quantization range expected by downstream layers. In full-precision training, both are typically 1.0, but they become critical in INT8 and mixed-precision inference.

AI Visual Insight: The figure shows the piecewise CELU curve across the positive and negative half-axes, highlighting how changes in alpha affect curvature on the negative side: a larger alpha produces a smoother negative region that is closer to linear, while a smaller alpha creates a steeper curve that behaves more like ReLU truncation.

The first implementation is ideal for rapid prototyping

The high-level abstraction uses pointwise_dynamic to quickly handle tensor type promotion and broadcasting, which makes it a good choice for validating semantic correctness first.

import triton
import triton.language as tl
from flag_gems.utils import pointwise_dynamic

@pointwise_dynamic(
    is_tensor=[True, False, False, False],
    promotion_methods=[(0, "DEFAULT")],
)
@triton.jit
def celu_forward_kernel(x, alpha, scale, input_scale):
    # Express elementwise piecewise computation directly
    return tl.where(
        x > 0,
        scale * input_scale * x,
        scale * alpha * (tl.exp(x.to(tl.float32) * input_scale / alpha) - 1),
    )

The value of this version is that it is short, clear, and easy to align with PyTorch behavior, although it offers limited low-level control.

Explicit memory management is the real performance turning point

Once the goal shifts from “works correctly” to “runs as fast as possible,” you need explicit control over memory reads and writes, boundary masks, and block granularity. For pointwise operators, the gains often come from reducing abstraction overhead and improving memory access patterns.

import triton
import triton.language as tl

@triton.jit
def celu_forward_kernel(x_ptr, y_ptr, alpha, scale, input_scale,
                        n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    # Compute the data range handled by the current program
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements  # Prevent out-of-bounds access in the tail block

    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.where(
        x > 0,
        scale * input_scale * x,
        scale * alpha * (tl.exp(x.to(tl.float32) * input_scale / alpha) - 1),
    )
    tl.store(y_ptr + offsets, y, mask=mask)  # Write back results with masking

This kernel shows the core structure of Triton pointwise optimization: block partitioning, masked loads, vectorized computation, and masked stores.

AutoTune enables one operator to adapt across input sizes

BLOCK_SIZE does not have a universal optimum. Small tensors care more about scheduling overhead, while large tensors depend more on parallelism and throughput. In practice, you should let benchmarking choose the best configuration.

import triton

def get_autotune_config():
    return [
        triton.Config({'BLOCK_SIZE': 128}),
        triton.Config({'BLOCK_SIZE': 256}),
        triton.Config({'BLOCK_SIZE': 512}),
        triton.Config({'BLOCK_SIZE': 1024}),
        triton.Config({'BLOCK_SIZE': 2048}),
    ]

@triton.autotune(configs=get_autotune_config(), key=["n_elements"])
@triton.jit
def celu_forward_kernel(x_ptr, y_ptr, alpha, scale, input_scale,
                        n_elements, BLOCK_SIZE: tl.constexpr):
    pass

This configuration demonstrates that Triton will automatically enumerate candidate block sizes on the first run, which typically delivers more stable performance than a fixed parameter across diverse workloads.

To observe the tuning process, use the following command:

export TRITON_PRINT_AUTOTUNING=1

This command prints Triton autotuning logs, making it easier to analyze the best configuration for different input sizes.

The in-place variant further reduces memory and bandwidth overhead

For activation functions and other operators with no cross-element dependencies, in-place execution is a highly effective engineering optimization. Input and output share the same memory region, which directly eliminates one full output allocation and write path.

def celu_(A, alpha=1.0, scale=1.0, input_scale=1.0):
    n_elements = A.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    # Reuse the same address for input and output to enable in-place updates
    celu_forward_kernel[grid](A, A, alpha, scale, input_scale, n_elements)
    return A

This code implements in-place CELU and is well suited for inference pipelines with dense activation usage and tight memory bandwidth constraints.

Performance data shows that the optimization path delivers consistent gains

The test environment uses H20, PyTorch 2.8.0+cu126, and Triton 3.4.0. The original results indicate that the explicit implementation typically delivers about 10% to 20% improvement over the high-level wrapper, while AutoTune continues to move performance closer to the optimum across different tensor sizes.

AI Visual Insight: This figure appears to show the baseline performance curve or bar chart, reflecting throughput or latency across different input sizes under the high-level pointwise_dynamic abstraction. It serves as a useful baseline for comparison against the explicit implementation and AutoTune variants.

AI Visual Insight: This figure shows the performance of the explicit memory management version. The key takeaway is usually that with a fixed BLOCK_SIZE=1024, throughput improves over the baseline, demonstrating that reducing abstraction overhead and strengthening contiguous memory access are effective for unary pointwise operators.

AI Visual Insight: This figure appears to show AutoTune logs or final performance results, illustrating that Triton selects different block configurations based on n_elements, allowing both small and large tensors to run closer to their respective optimum execution points.

Accuracy validation is a required bar for production use

Performance improvements cannot come at the cost of numerical consistency. The original tests cover float16, bfloat16, and float32, and validate tensors from 1D to 4D, using native PyTorch CELU as the reference implementation.

import pytest
import torch

@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
def test_accuracy_celu(dtype):
    x = torch.randn((64, 64), dtype=dtype, device="cuda")
    alpha = 1.0
    ref = torch.nn.functional.celu(x.float(), alpha).to(dtype)  # Reference result
    out = torch.nn.functional.celu(x, alpha)  # Result under test
    torch.testing.assert_close(out, ref, atol=1e-4, rtol=1e-3)

This test code shows the basic method for accuracy validation: use a unified reference implementation, set tolerances by data type, and cover representative shapes.

AI Visual Insight: This figure likely shows successful test results for the standard CELU forward path, typically in the form of a passing pytest summary, confirming that the Triton implementation stays within tolerance across multiple dtypes and shapes.

AI Visual Insight: This figure corresponds to the in-place version test results, showing that overwriting the input buffer still satisfies accuracy requirements, which is especially important when replacing activations in inference pipelines.

Engineering practice shows that numerical stability and tuning strategy matter just as much

Promoting x to tl.float32 before calling exp is not redundant. It prevents overflow in half-precision exponential computation, which becomes especially important when negative values have large magnitude or when alpha is small.

AutoTune is not always the right choice either. For one-off scripts or strongly real-time services, the startup cost of first-run tuning may outweigh the benefits. But if the operator is reused over time, the cached gains can be substantial.

FAQ

1. What is the biggest advantage of a Triton CELU implementation over the native PyTorch version?

The biggest advantage is control. You can explicitly control memory access, block size, in-place updates, and autotuning strategy, which helps you extract higher throughput on specific hardware and model pipelines.

2. Why convert the input to float32 before applying `exp` in the CELU kernel?

Because exponential functions are highly sensitive to numerical precision. Half precision is more prone to overflow and accumulated error, while converting to float32 first significantly improves numerical stability, especially for mixed-precision training and inference.

3. When is AutoTune the best fit?

It is best suited for long-running services, variable input sizes, and scenarios where the same operator is called repeatedly. It is less suitable for one-off tasks, ultra-low-latency first-token requirements, or short-lived containers where the cache cannot be reused.

Core summary: This article systematically reconstructs the implementation path for optimizing the CELU activation function with Triton, covering the mathematical definition, explicit memory management, AutoTune, the in-place variant, performance benchmarking, and accuracy validation to help developers build production-ready high-performance GPU pointwise operators.