Transformers & Attention Mechanism

The Transformer Revolution

Introduced in the 2017 paper "Attention Is All You Need," Transformers have become the foundation of modern AI. They power GPT-4, BERT, Claude, and virtually every state-of-the-art language model.

                🎯 Why Transformers Changed Everything
                Process entire sequences in parallel (not sequential like RNNs)
Capture long-range dependencies effectively
Scale to billions of parameters
Enable transfer learning across tasks

            

🧠 The Core Idea: Attention

What is Attention?

Attention allows the model to focus on different parts of the input when processing each element. Like how you pay more attention to certain words when reading a sentence.

Example: "The animal didn't cross the street because it was too tired."

→ "it" refers to "animal" (not "street") - attention helps the model figure this out!

The Attention Formula

# Self-Attention in one equation
Attention(Q, K, V) = softmax(Q @ K.T / √d_k) @ V

Where:
- Q (Query): "What am I looking for?"
- K (Key): "What do I contain?"
- V (Value): "What do I actually output?"
- d_k: dimension of keys (for scaling)

# Step by step:
1. Compute attention scores: Q @ K.T
2. Scale down: divide by √d_k
3. Convert to probabilities: softmax
4. Weight the values: multiply by V

🏗️ Transformer Architecture

A Transformer consists of two main parts:

                    Encoder (Understanding)
                    Reads and processes input
Creates rich representations
Used in BERT, T5
Example: Understanding a sentence

                

                    Decoder (Generation)
                    Generates output sequentially
Uses encoder representations
Used in GPT, ChatGPT
Example: Writing a response

                

Full Architecture Diagram

Input: "Hello World"
   ↓
[Token Embedding] 
   ↓
[+ Positional Encoding]  ← Adds position information
   ↓
┌─────────────────────────────┐
│  ENCODER LAYER 1            │
│  ┌──────────────────────┐   │
│  │  Multi-Head         │   │  ← Attention to all words
│  │  Self-Attention     │   │
│  └──────────────────────┘   │
│           ↓                  │
│  ┌──────────────────────┐   │
│  │  Feed Forward       │   │  ← Process each position
│  │  Neural Network     │   │
│  └──────────────────────┘   │
└─────────────────────────────┘
   ↓
┌─────────────────────────────┐
│  ENCODER LAYER 2-N          │  ← Repeat 6-96 times
└─────────────────────────────┘
   ↓
Rich Contextual Representations

🔍 Deep Dive: Self-Attention

Step-by-Step Example

Let's process: "The cat sat on the mat"

import torch
import torch.nn as nn

# 1. Input Embeddings
words = ["The", "cat", "sat", "on", "the", "mat"]
embedding_dim = 512

# Each word becomes a 512-dimensional vector
embeddings = torch.randn(6, 512)  # [seq_len, d_model]

# 2. Create Q, K, V matrices
d_k = 64  # Dimension of queries/keys

W_q = nn.Linear(512, 64)  # Query projection
W_k = nn.Linear(512, 64)  # Key projection
W_v = nn.Linear(512, 64)  # Value projection

Q = W_q(embeddings)  # [6, 64]
K = W_k(embeddings)  # [6, 64]
V = W_v(embeddings)  # [6, 64]

# 3. Compute attention scores
scores = Q @ K.transpose(-2, -1)  # [6, 6]
# scores[i][j] = how much word i should attend to word j

scores = scores / torch.sqrt(torch.tensor(d_k))  # Scale

# 4. Apply softmax (convert to probabilities)
attention_weights = torch.softmax(scores, dim=-1)  # [6, 6]

# Example attention weights for "cat":
# cat → The: 0.05
# cat → cat: 0.40  ← pays most attention to itself
# cat → sat: 0.25  ← "cat" is related to "sat"
# cat → on:  0.10
# cat → the: 0.05
# cat → mat: 0.15  ← "cat" on the "mat"

# 5. Weighted sum of values
output = attention_weights @ V  # [6, 64]

print(f"Input shape: {embeddings.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")

Visualizing Attention

import matplotlib.pyplot as plt
import seaborn as sns

# Visualize attention matrix
plt.figure(figsize=(8, 6))
sns.heatmap(attention_weights.detach().numpy(), 
            xticklabels=words,
            yticklabels=words,
            cmap='YlOrRd',
            annot=True,
            fmt='.2f')
plt.title('Self-Attention Weights')
plt.xlabel('Keys')
plt.ylabel('Queries')
plt.show()

# Darker cells = stronger attention

🎯 Multi-Head Attention

Why Multiple Heads?

Instead of one attention mechanism, use multiple "heads" that learn different relationships:

Head 1: Might focus on syntactic relationships (subject-verb)
Head 2: Might capture semantic meaning
Head 3: Might track co-references (pronouns)
Head 4-8: Other patterns...

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # 512 / 8 = 64
        
        # Linear layers for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def split_heads(self, x, batch_size):
        """Split into multiple heads"""
        # x: [batch, seq_len, d_model]
        x = x.view(batch_size, -1, self.num_heads, self.d_k)
        # [batch, seq_len, num_heads, d_k]
        return x.transpose(1, 2)  # [batch, num_heads, seq_len, d_k]
    
    def forward(self, x, mask=None):
        batch_size = x.size(0)
        
        # 1. Linear projections
        Q = self.W_q(x)  # [batch, seq_len, d_model]
        K = self.W_k(x)
        V = self.W_v(x)
        
        # 2. Split into multiple heads
        Q = self.split_heads(Q, batch_size)
        K = self.split_heads(K, batch_size)
        V = self.split_heads(V, batch_size)
        
        # 3. Scaled dot-product attention (for each head)
        scores = Q @ K.transpose(-2, -1) / torch.sqrt(torch.tensor(self.d_k))
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attention_weights = torch.softmax(scores, dim=-1)
        attention_output = attention_weights @ V
        
        # 4. Concatenate heads
        attention_output = attention_output.transpose(1, 2).contiguous()
        attention_output = attention_output.view(batch_size, -1, self.num_heads * self.d_k)
        
        # 5. Final linear layer
        output = self.W_o(attention_output)
        
        return output, attention_weights

# Usage
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(32, 50, 512)  # [batch, seq_len, d_model]
output, weights = mha(x)
print(f"Output shape: {output.shape}")  # [32, 50, 512]

📍 Positional Encoding

The Problem

Transformers process all tokens in parallel → they don't know word order!

"Cat chases dog" vs "Dog chases cat" would look the same without position info.

The Solution: Add Position Information

import numpy as np

def positional_encoding(seq_len, d_model):
    """
    Create positional encodings using sine/cosine functions
    """
    position = np.arange(seq_len)[:, np.newaxis]  # [seq_len, 1]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    pe = np.zeros((seq_len, d_model))
    pe[:, 0::2] = np.sin(position * div_term)  # Even dimensions
    pe[:, 1::2] = np.cos(position * div_term)  # Odd dimensions
    
    return torch.FloatTensor(pe)

# Add to embeddings
seq_len, d_model = 50, 512
embeddings = torch.randn(1, seq_len, d_model)
pos_encoding = positional_encoding(seq_len, d_model)

# Combine
input_with_position = embeddings + pos_encoding

print(f"Positional encoding shape: {pos_encoding.shape}")

Visualize Positional Encodings

pe = positional_encoding(100, 512)

plt.figure(figsize=(12, 6))
plt.pcolormesh(pe.numpy(), cmap='RdBu')
plt.xlabel('Embedding Dimension')
plt.ylabel('Position in Sequence')
plt.colorbar()
plt.title('Positional Encoding Pattern')
plt.show()

# Creates unique "fingerprint" for each position

🔄 Complete Transformer Block

class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        
        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # 1. Multi-head attention + residual connection
        attention_output, _ = self.attention(x, mask)
        x = x + self.dropout1(attention_output)  # Residual
        x = self.norm1(x)  # Layer norm
        
        # 2. Feed-forward + residual connection
        ffn_output = self.ffn(x)
        x = x + self.dropout2(ffn_output)  # Residual
        x = self.norm2(x)  # Layer norm
        
        return x

# Full Transformer
class Transformer(nn.Module):
    def __init__(self, vocab_size, d_model=512, num_layers=6, num_heads=8):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads) 
            for _ in range(num_layers)
        ])
        self.output_layer = nn.Linear(d_model, vocab_size)
    
    def forward(self, x):
        # Embed tokens
        x = self.embedding(x)
        
        # Add positional encoding
        x = x + positional_encoding(x.size(1), x.size(2))
        
        # Process through transformer blocks
        for block in self.blocks:
            x = block(x)
        
        # Output projection
        logits = self.output_layer(x)
        return logits

# Create model
model = Transformer(vocab_size=50000, d_model=512, num_layers=6)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

🎭 Encoder vs Decoder vs Encoder-Decoder

Encoder-Only

Examples: BERT, RoBERTa

Use Cases:

Text classification
Named entity recognition
Question answering
Sentiment analysis

Key Feature: Bidirectional context

Decoder-Only

Examples: GPT-3, GPT-4, LLaMA

Use Cases:

Text generation
Story writing
Code completion
Chat

Key Feature: Auto-regressive generation

Encoder-Decoder

Examples: T5, BART

Use Cases:

Translation
Summarization
Paraphrasing
Question generation

Key Feature: Input → Output transformation

⚡ Why Transformers Are So Powerful

1. Parallelization

Unlike RNNs, all tokens process simultaneously

Much faster training
Can use modern GPU hardware efficiently
Scales to longer sequences

2. Long-Range Dependencies

Directly connect any two tokens

No vanishing gradients
Understands context across entire document
Better than RNNs at long sequences

3. Interpretability

Attention weights show what model focuses on

Visualize decision-making
Debug model behavior
Understand predictions

4. Transfer Learning

Pre-train once, fine-tune for many tasks

Save computation
Better performance
Work with less data

📊 Transformer Variants

GPT (Generative Pre-trained Transformer)

Decoder-only, autoregressive

Size: 175B parameters (GPT-3)

BERT (Bidirectional Encoder)

Encoder-only, masked language modeling

Size: 110M - 340M parameters

T5 (Text-to-Text Transfer)

Encoder-decoder, unified framework

Size: 220M - 11B parameters

CLIP (Contrastive Language-Image)

Dual encoders for text + images

Use: Image-text understanding

🎯 Key Takeaways

Attention is the core mechanism - allows model to focus on relevant parts
Multi-head attention learns different types of relationships simultaneously
Positional encoding gives the model sense of word order
Parallelization makes Transformers much faster than RNNs
Scalability - works well from millions to trillions of parameters
Powers virtually all modern LLMs (GPT-4, Claude, Gemini)