This article explores the extension of relative position encoding (RPE) from 1D sequences to 2D images in Swin Transformer. Building on T5's bias-based RPE with bucket partitioning, the author details the design choices for 2D spatial relationships, including how to handle height and width dimensions separately. The post covers the mathematical formulation, implementation considerations, and how this approach achieves efficient position encoding for vision tasks. It is a valuable resource for researchers and engineers working on vision transformers, offering both theoretical depth and practical insights.
A detailed explanation of how Swin Transformer extends 1D bias-based RPE to 2D images, with implementation insights.