Abstract

Since the original Transformer paper was published in 2017, most mainstream large language models, including GPT, Claude, Gemini, LLaMA, Mistral and Qwen, have adopted stacked Transformer decoder blocks as their core architectural foundation. Although these models differ in training data, parameter scale, alignment strategy and engineering optimization, their basic operating logic remains highly consistent.

This article explains the end-to-end workflow of modern decoder-only LLMs without heavy mathematical derivation. It covers nine essential modules: tokenization, embedding, positional encoding, self-attention, multi-head attention, Grouped-Query Attention, Feed-Forward Networks, residual connections, layer normalization and autoregressive next-token prediction. It also compares the original 2017 Transformer design with modern post-2023 optimizations, and briefly discusses emerging alternatives such as Mamba state-space models and hybrid Transformer-SSM architectures.

After reading this breakdown, AI beginners and software developers should be able to understand model whitepapers, technical documentation and research papers more confidently.

1. Preprocessing Stage: Tokenization and Embedding

Large language models cannot directly process raw human-readable text. Before entering the neural network, every prompt must first be converted into a sequence of integer IDs. This step is called tokenization.

Instead of splitting text strictly by complete English words or individual Chinese characters, modern tokenizers usually divide text into subword units. This design balances vocabulary size and semantic coverage. Mainstream LLM vocabularies typically range from tens of thousands to hundreds of thousands of tokens. Smaller 7B open-source models often use 32K–64K vocabularies, while frontier proprietary models such as GPT-4 and Claude Opus may use 100K+ vocabularies optimized for multilingual scenarios.

Common tokenization algorithms include BPE, WordPiece and SentencePiece. Each model family usually trains its own tokenizer. The tokenizer directly affects inference cost, multilingual performance and compatibility across different text formats.

A raw token ID, such as 1024, has no semantic meaning by itself. It is only an index pointing to a row inside the model’s embedding matrix. The embedding layer maps each token ID into a dense high-dimensional vector. These vectors store semantic information learned during pre-training.

For mainstream open-source 7B models, hidden dimensions often reach around 4,096. For trillion-parameter frontier models, hidden dimensions may exceed 12,288. Semantically similar tokens usually produce nearby vector representations. For example, vectors for “king” and “queen” may be close in embedding space.

However, embeddings alone do not contain sequence order. The model still needs positional encoding to understand where each token appears in the sentence.

2. Positional Encoding: Adding Order to Token Representations

Self-attention does not naturally understand word order. Yet word order is essential for language meaning. For example, “the dog chased the cat” and “the cat chased the dog” contain similar tokens but very different meanings. Positional encoding solves this problem.

The original 2017 Transformer used fixed sinusoidal absolute positional encoding. It added precomputed position values directly to token embeddings. This method is simple, but it struggles with long-context extrapolation. If input length exceeds the maximum context length seen during training, performance may degrade.

Modern LLMs often use Rotary Positional Embedding, or RoPE, proposed by Su et al. in 2021. RoPE does not directly add position values to embeddings. Instead, it injects relative positional information by rotating Query and Key vectors during attention calculation.

RoPE has become the default choice for many open-source LLM families, including LLaMA, Mistral, Gemma and Qwen. With techniques such as YaRN and positional interpolation, RoPE-based models can extend context windows from 8K tokens to 128K tokens or even longer.

Still, long-context modeling remains difficult. Even current models can lose information in the middle of very long passages. This is one of the major research directions for future LLM architectures.

3. Self-Attention, Multi-Head Attention and GQA

Self-attention is the core mechanism that gives the Transformer its name. It allows each token to examine other tokens in the sequence and decide which ones are most relevant.

Inside each Transformer block, every token vector is projected into three matrices:

  1. Query: what the current token is looking for
  2. Key: what each token can be matched against
  3. Value: the actual information to be aggregated

The model calculates relevance scores by multiplying Query and Key vectors. These scores are scaled to prevent numerical instability and then passed through softmax to become probability weights. The final output is a weighted sum of Value vectors.

Decoder-only LLMs use causal masking. This prevents the model from seeing future tokens during training and inference. It ensures the model predicts the next token only from previous context.

Self-attention is powerful, but expensive. Its computational complexity grows as O(n²) with sequence length. This means long-context inference becomes increasingly costly as input length grows.

Multi-Head Attention

A single attention head can only capture limited relationships. Multi-Head Attention solves this by running several attention heads in parallel. Each head learns different patterns. Some may focus on grammar, others on coreference, nearby tokens or repeated structures.

After all heads compute their outputs, the results are concatenated and passed through a final projection layer.

Grouped-Query Attention

To reduce memory usage during inference, most post-2023 decoder-only LLMs use Grouped-Query Attention, or GQA. In GQA, multiple Query heads share a smaller number of Key and Value heads. Under a typical 4:1 grouping ratio, KV cache memory can be reduced by about 75% with limited performance loss.

This optimization has become common in modern open-source models from LLaMA 3 onward, especially for long-context and high-concurrency deployment.

4. Feed-Forward Network: Per-Token Nonlinear Transformation

After attention finishes cross-token information exchange, each Transformer block applies a Feed-Forward Network, or FFN. Unlike attention, FFN processes each token independently. It does not exchange information across positions.

A standard dense FFN usually follows three steps:

  1. Expand the hidden dimension to 3–4 times its original size
  2. Apply a nonlinear activation function such as GeLU or SwiGLU
  3. Compress the dimension back to the original hidden size

The nonlinear activation is critical. Without it, the network would become a stack of linear transformations with limited expressive power.

In dense Transformer LLMs, FFN layers often contain more than 60% of total model parameters. Interpretability studies suggest that certain FFN neurons may store factual knowledge or conceptual features. This is why partial FFN tuning can sometimes modify specific model behaviors without full retraining.

Modern frontier models increasingly adopt Mixture-of-Experts architectures to replace dense FFN blocks. In MoE models, each token is routed to only a small number of expert subnetworks. This allows the total parameter count to grow dramatically while keeping per-token inference cost relatively controlled.

5. Residual Connections and Layer Normalization

Very deep neural networks are hard to train because gradients may vanish or explode. Residual connections help solve this problem.

In Transformer blocks, the output of an attention or FFN module is added back to the original input. This forms a residual stream. It allows information and gradients to flow more smoothly through many stacked layers.

Residual streams are also important for interpretability. Researchers often trace how information moves through the model by analyzing changes in the residual stream across layers.

Layer Normalization Variants

Layer normalization stabilizes feature distributions and improves training convergence. The original Transformer used post-norm, applying normalization after attention and FFN modules. Modern LLMs usually use pre-norm, applying normalization before each submodule. Pre-norm improves stability when training models with dozens or hundreds of layers.

Modern open-source models such as LLaMA, Mistral and Qwen commonly use RMSNorm instead of standard LayerNorm. RMSNorm removes mean subtraction and normalizes features based on root-mean-square statistics. It reduces computational overhead while preserving training stability.

6. Autoregressive Next-Token Prediction

After the input passes through all Transformer layers, the model takes the final hidden vector of the last token and maps it to the vocabulary space through a final linear layer. This produces raw scores, called logits, for every possible next token.

Softmax converts logits into a probability distribution. The model then selects the next token through a decoding strategy.

Greedy decoding always selects the highest-probability token. In practice, developers often adjust sampling parameters such as:

  • Temperature: controls randomness
  • Top-P: samples from the smallest token set whose cumulative probability reaches a threshold
  • Top-K: samples only from the top K candidate tokens

Low temperature produces more factual and deterministic text. Higher temperature produces more diverse and creative output.

Once a token is generated, it is appended to the input sequence. The model then repeats the process until it generates an end-of-sequence token or reaches the output length limit. This loop is called autoregressive generation.

Next-token prediction is the main objective during pre-training. The model learns to predict the next token from massive unlabeled text corpora. Later stages such as supervised fine-tuning and RLHF adjust behavior and alignment, but they do not change the core generation mechanism.

Speculative decoding has recently become an important inference optimization. A small draft model first generates candidate token chunks. The larger model then verifies them in batches. This can significantly improve token generation speed in production systems. An API gateway such as Treerouter can serve as one routing layer for organizing multi-model inference traffic, though actual serving strategy still depends on the application architecture.

7. Shared Architecture and Key Differences Across Mainstream LLMs

Most mainstream LLMs share the same basic decoder-only Transformer framework. This includes tokenization, embedding, positional encoding, attention, FFN or MoE layers, residual connections, normalization and autoregressive prediction.

The major differences between model families usually come from four areas:

  1. Training corpus Data source, language distribution, domain coverage and cutoff time determine knowledge scope and factual accuracy.

  2. Model scale Parameter count, hidden dimension, layer count and attention head configuration affect reasoning ability and generalization.

  3. Architectural hyperparameters Choices such as MHA vs GQA, RoPE variants, dense FFN vs MoE and LayerNorm vs RMSNorm shape efficiency and capability.

  4. Post-training strategy Instruction tuning, RLHF, safety tuning and proprietary alignment pipelines determine user-facing behavior.

This convergence around decoder-only Transformers has created a common industrial baseline. It allows developers to compare new models more easily, even when vendors use different branding or product positioning.

8. Future Architectural Diversification: Mamba and Hybrid Models

Although Transformers still dominate today’s LLM landscape, their quadratic attention cost motivates research into alternative architectures.

One of the most important alternatives is Mamba, based on State Space Models. Mamba avoids explicit self-attention and processes sequences with linear O(n) time complexity. This makes it attractive for ultra-long context scenarios.

Pure Mamba models are not the only direction. Hybrid architectures that combine Transformer layers with Mamba-style SSM layers are also gaining attention. Examples include Jamba and IBM’s Bamba. These models aim to combine the local semantic strength of attention with the long-range efficiency of state-space modeling.

In the near term, Transformers are unlikely to disappear. Instead, the industry will likely move toward a mixed landscape. Sparse attention, KV cache compression, MoE routing, SSM layers and hybrid architectures will evolve together.

Conclusion

Modern LLMs may appear complex, but their core workflow is built from a clear sequence of modules. Tokenization converts text into IDs. Embedding maps IDs into semantic vectors. Positional encoding adds sequence order. Attention enables cross-token information exchange. FFN or MoE layers perform per-token feature transformation. Residual connections and normalization stabilize deep training. Autoregressive decoding produces the final output token by token.

Understanding these mechanisms removes much of the black-box mystery around LLMs. It also helps developers read model papers, compare architectures, design fine-tuning pipelines and optimize deployment more effectively.

Transformer-based models remain the mainstream foundation of current LLMs. At the same time, Mamba, SSM and hybrid architectures are opening new directions for more efficient long-context modeling. For AI engineers, mastering these fundamentals is the first step toward evaluating and applying new model technologies in a fast-changing generative AI industry.