Transformer Architecture Explained: Word Embeddings, Attention Variants, MoE, and How to Choose the Right Model Design - Devuly | Smart Analytics for Developers & Projects

This article reconstructs the core Transformer technology stack: how word embeddings evolved from static vectors to dynamic contextual representations, how MHA, MQA, and GQA balance model quality with KV cache efficiency, how MoE expands capacity through sparse activation, and how the three major Transformer architectures map to different task boundaries. Keywords: Transformer, attention mechanism, MoE.

Table of Contents

Technical Specification Snapshot

Parameter	Details
Core Topics	Word embedding evolution, attention variants, MoE, Transformer architecture selection
Primary Languages	Python / Mathematical formulas / Markdown
Core Protocols	Self-attention, causal masking, Top-k routing
Stars	Not provided in the original content
Core Dependencies	Transformer, Softmax, FFN, KV Cache, Router
Representative Models	BERT, GPT, LLaMA, Qwen, Mixtral

Word embedding technology has evolved from static lookup tables to contextual representations

Word2Vec and GloVe represent the era of static word vectors. The former learns distributed representations from local context windows, while the latter models statistical relationships using a global co-occurrence matrix. Both are efficient, but each word maps to only one vector, which makes polysemy inherently difficult to handle.

Transformers changed that. They first map tokens into base embeddings, then use stacked self-attention layers to fuse context, allowing a token like “Apple” to produce different hidden states in the context of fruit versus the company. This shift marks a critical leap in modern NLP—from semantic proximity to semantic understanding.

You can understand the differences between the three embedding types through a minimal implementation

# Core difference across three representation types: static lookup vs. dynamic encoding

def static_embedding(token, table):
    return table[token]  # Direct lookup; the same word always gets the same vector

def contextual_embedding(tokens, model):
    x = model.token_embed(tokens)      # Base token vectors
    x = model.position_encode(x)       # Inject positional information
    h = model.transformer_layers(x)    # Dynamically update representations through context
    return h                           # Each position gets a context-aware vector

This code captures the essential difference between static word vectors and Transformer-based dynamic embeddings.

Embedding selection maps directly to resource constraints and task goals

Dimension	Word2Vec	GloVe	Transformer Dynamic Embeddings
Representation Type	Static	Static	Dynamic
Context Modeling	Local window	Global co-occurrence	Full-sequence attention
Polysemy Handling	Weak	Weak	Strong
Inference Cost	Low	Low	High
Typical Use Cases	Lightweight retrieval	Representation learning on small to mid-sized corpora	Understanding and generation tasks

Multi-head attention has evolved from MHA to deployment-oriented GQA

MHA is the standard multi-head attention mechanism. Each head has independent Q, K, and V projections, which gives it the strongest expressive power. But during decoding, each token must cache the full set of K and V tensors for all heads, so KV cache overhead becomes very large.

MQA takes a different approach: all heads share a single K and V while retaining separate Q projections. This reduces the cache from h copies to 1, significantly improving throughput, but it also reduces representational freedom across heads. GQA sits between the two: it shares K and V by group, striking a more practical balance between quality and efficiency.

MHA attention formula diagram AI Visual Insight: This image illustrates the computation path from standard scaled dot-product attention to concatenated multi-head output. It highlights Q, K, and V linear projections, softmax-based weight normalization, and the structure of parallel heads followed by output projection. It is a key visual for understanding MHA parameter scale and compute flow.

KV cache differences determine large-model deployment cost

# Estimate the number of KV cache units under different attention mechanisms

def kv_cache_units(num_heads, num_groups=None, mode="MHA"):
    if mode == "MHA":
        return num_heads          # Each head caches its own K/V
    if mode == "MQA":
        return 1                  # All heads share one K/V
    if mode == "GQA":
        return num_groups         # Each group shares one K/V
    raise ValueError("unknown mode")

This code shows the fundamental difference in inference memory usage across the three attention variants.

Engineering choices for attention variants follow clear patterns

Feature	MHA	MQA	GQA
K/V Independence	Per-head independent	Fully shared	Group-shared
KV Cache	Largest	Smallest	Medium
Inference Speed	Slower	Fastest	Faster
Expressive Power	Strongest	Weakest	Best practical trade-off
Common Scenarios	Training, smaller models	Extreme inference optimization	Mainstream LLM deployment

Today’s mainstream large models such as LLaMA, Qwen, and DeepSeek generally favor GQA because it better matches the engineering realities of long-context and high-concurrency inference.

MoE expands model capacity through sparse activation without linearly increasing compute

The core idea behind MoE is not that “all parameters participate in every computation.” Instead, it expands the FFN into multiple expert networks, then uses a router to select only the Top-k experts for each token. As a result, the total parameter count can grow dramatically while per-forward FLOPs remain under control.

A standard MoE layer includes three parts: expert networks, a gating network, and weighted aggregation. The real challenge is not just routing itself, but preventing a small subset of experts from being preferred over time while others remain undertrained. That is why load balancing loss exists.

MoE mathematical form diagram AI Visual Insight: This image shows the mathematical expression for MoE output as a weighted sum of expert outputs using gating weights. It emphasizes the sparse weight vector G(x) produced by the router, which is the core formula behind “large parameter capacity but sparse activation per forward pass.”

Linear gating diagram AI Visual Insight: This image shows how the gating network maps input into an expert score distribution, typically via a linear transformation followed by softmax and then Top-k selection. It makes clear that the router itself is a lightweight module rather than the main compute-heavy component.

Noisy gating diagram AI Visual Insight: This image highlights the strategy of injecting noise into gating scores to increase exploration and mitigate expert collapse and over-concentrated routing. This directly relates to load balancing and training stability.

The core logic of sparse routing can be compressed into three steps

# Minimal pseudo-implementation of MoE Top-k routing

def moe_forward(x, router, experts, k=2):
    scores = router(x)                        # Compute a score for each expert
    topk_idx = scores.argsort()[-k:]          # Select the top-k scoring experts
    output = 0
    for i in topk_idx:
        output += scores[i] * experts[i](x)   # Aggregate outputs from activated experts only
    return output

This code shows that MoE saves computation by “computing only a small subset of experts,” not by reducing the total number of model parameters.

MoE delivers value in both training and inference

Dimension	Dense FFN	MoE
Parameter Scale	Grows linearly with layer width	Can scale much further
Active Parameters per Forward Pass	All participate	Only Top-k experts
Training Difficulty	Standard stability issues	Routing and load balancing
Deployment Benefit	Stable and simple	Higher capacity density
Representative Models	GPT-style dense models	Switch, Mixtral, DeepSeek-MoE

When a model needs larger capacity under a limited compute budget, MoE is often more effective than simply scaling width.

The three Transformer architectures target generation, understanding, and sequence transformation respectively

Decoder-Only uses causal masking, so each position can see only previous tokens. That makes it a natural fit for autoregressive generation, which is why GPT, LLaMA, and Qwen all follow this paradigm. It scales well and has become the de facto standard for general-purpose large language models.

Encoder-Only uses bidirectional attention and is better suited to understanding tasks such as classification, named entity recognition, and retrieval encoding. Encoder-Decoder combines bidirectional encoding with autoregressive decoding, making it a more natural fit for seq2seq tasks such as translation, summarization, and rewriting.

You can identify the right architecture quickly from the masking pattern

# Choose a Transformer architecture based on the task

def choose_arch(task_type):
    if task_type in ["chat", "generation", "codegen"]:
        return "Decoder-Only"     # Best for autoregressive generation
    if task_type in ["classification", "ner", "retrieval"]:
        return "Encoder-Only"     # Best for bidirectional understanding
    if task_type in ["translation", "summarization", "rewrite"]:
        return "Encoder-Decoder"  # Best for sequence transformation
    return "Depends"

This code establishes a direct mapping between mainstream Transformer architectures and task categories.

The key to architecture selection is not sophistication, but task fit

Architecture	Attention Pattern	Strengths	Typical Tasks
Decoder-Only	Causal masking	General-purpose generation, strong scalability	Dialogue, code generation, completion
Encoder-Only	Bidirectional visibility	Strong understanding, efficient fine-tuning	Classification, retrieval, sequence labeling
Encoder-Decoder	Encoding + decoding	Strong sequence transformation	Translation, summarization, rewriting

If your goal is to unify multiple tasks and maximize generation capability, choose Decoder-Only. If you want high-quality semantic encoding, choose Encoder-Only. If the input and output structures differ significantly, prefer Encoder-Decoder.

The main line of the modern large-model technology stack is now very clear

At the embedding layer, dynamic contextual representations have replaced static lookup tables. At the attention layer, GQA is becoming the inference standard. For capacity scaling, MoE makes hundred-billion-parameter systems more practical. At the system level, Decoder-Only dominates general-purpose generation, while the other two architectures still retain strong advantages in specialized tasks.

Understanding these four trends matters more than memorizing model names in isolation, because together they define the design boundaries of modern Transformers: expressive power, deployment efficiency, scalable capacity, and task fit.

FAQ

1. Why haven’t Word2Vec and GloVe become completely obsolete today?

Because they are cheap to train, extremely fast at inference, and simple to implement. They still work well in cost-sensitive scenarios such as lightweight retrieval, clustering, and recommendation. When you do not need deep contextual understanding, static vectors still have practical engineering value.

2. Why do mainstream large models prefer GQA instead of continuing to use standard MHA?

The core reason is that KV cache cost becomes too high during long-sequence inference. GQA offers a better balance among memory usage, bandwidth pressure, and model quality, making it better suited to production-grade deployment.

3. Does MoE always make inference faster?

Not necessarily. MoE’s advantage is that it preserves controllable compute at larger capacity, but routing, cross-device communication, and uneven expert utilization can all introduce extra overhead. With good design, it improves capacity efficiency, but that does not mean it is universally faster in every scenario.

Core Summary: This article systematically outlines four major trends in the modern Transformer stack: embeddings evolving from Word2Vec and GloVe to dynamic contextual representations, attention shifting from MHA toward GQA and MQA, MoE expanding capacity through sparse activation, and the application differences and selection logic among Decoder-Only, Encoder-Only, and Encoder-Decoder architectures.