🔮 Transformers & Attention Mechanism

The architecture that revolutionized AI

The Transformer Revolution

Introduced in the 2017 paper "Attention Is All You Need," Transformers have become the foundation of modern AI. They power GPT-4, BERT, Claude, and virtually every state-of-the-art language model.

🎯 Why Transformers Changed Everything

  • Process entire sequences in parallel (not sequential like RNNs)
  • Capture long-range dependencies effectively
  • Scale to billions of parameters
  • Enable transfer learning across tasks

🧠 The Core Idea: Attention

What is Attention?

Attention allows the model to focus on different parts of the input when processing each element. Like how you pay more attention to certain words when reading a sentence.

Example: "The animal didn't cross the street because it was too tired."

→ "it" refers to "animal" (not "street") - attention helps the model figure this out!

The Attention Formula

# Self-Attention in one equation
Attention(Q, K, V) = softmax(Q @ K.T / √d_k) @ V

Where:
- Q (Query): "What am I looking for?"
- K (Key): "What do I contain?"
- V (Value): "What do I actually output?"
- d_k: dimension of keys (for scaling)

# Step by step:
1. Compute attention scores: Q @ K.T
2. Scale down: divide by √d_k
3. Convert to probabilities: softmax
4. Weight the values: multiply by V

🏗️ Transformer Architecture

A Transformer consists of two main parts:

Encoder (Understanding)

  • Reads and processes input
  • Creates rich representations
  • Used in BERT, T5
  • Example: Understanding a sentence

Decoder (Generation)

  • Generates output sequentially
  • Uses encoder representations
  • Used in GPT, ChatGPT
  • Example: Writing a response

Full Architecture Diagram

Input: "Hello World"
   ↓
[Token Embedding] 
   ↓
[+ Positional Encoding]  ← Adds position information
   ↓
┌─────────────────────────────┐
│  ENCODER LAYER 1            │
│  ┌──────────────────────┐   │
│  │  Multi-Head         │   │  ← Attention to all words
│  │  Self-Attention     │   │
│  └──────────────────────┘   │
│           ↓                  │
│  ┌──────────────────────┐   │
│  │  Feed Forward       │   │  ← Process each position
│  │  Neural Network     │   │
│  └──────────────────────┘   │
└─────────────────────────────┘
   ↓
┌─────────────────────────────┐
│  ENCODER LAYER 2-N          │  ← Repeat 6-96 times
└─────────────────────────────┘
   ↓
Rich Contextual Representations

🔍 Deep Dive: Self-Attention

Step-by-Step Example

Let's process: "The cat sat on the mat"

import torch
import torch.nn as nn

# 1. Input Embeddings
words = ["The", "cat", "sat", "on", "the", "mat"]
embedding_dim = 512

# Each word becomes a 512-dimensional vector
embeddings = torch.randn(6, 512)  # [seq_len, d_model]

# 2. Create Q, K, V matrices
d_k = 64  # Dimension of queries/keys

W_q = nn.Linear(512, 64)  # Query projection
W_k = nn.Linear(512, 64)  # Key projection
W_v = nn.Linear(512, 64)  # Value projection

Q = W_q(embeddings)  # [6, 64]
K = W_k(embeddings)  # [6, 64]
V = W_v(embeddings)  # [6, 64]

# 3. Compute attention scores
scores = Q @ K.transpose(-2, -1)  # [6, 6]
# scores[i][j] = how much word i should attend to word j

scores = scores / torch.sqrt(torch.tensor(d_k))  # Scale

# 4. Apply softmax (convert to probabilities)
attention_weights = torch.softmax(scores, dim=-1)  # [6, 6]

# Example attention weights for "cat":
# cat → The: 0.05
# cat → cat: 0.40  ← pays most attention to itself
# cat → sat: 0.25  ← "cat" is related to "sat"
# cat → on:  0.10
# cat → the: 0.05
# cat → mat: 0.15  ← "cat" on the "mat"

# 5. Weighted sum of values
output = attention_weights @ V  # [6, 64]

print(f"Input shape: {embeddings.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")

Visualizing Attention

import matplotlib.pyplot as plt
import seaborn as sns

# Visualize attention matrix
plt.figure(figsize=(8, 6))
sns.heatmap(attention_weights.detach().numpy(), 
            xticklabels=words,
            yticklabels=words,
            cmap='YlOrRd',
            annot=True,
            fmt='.2f')
plt.title('Self-Attention Weights')
plt.xlabel('Keys')
plt.ylabel('Queries')
plt.show()

# Darker cells = stronger attention

🎯 Multi-Head Attention

Why Multiple Heads?

Instead of one attention mechanism, use multiple "heads" that learn different relationships:

  • Head 1: Might focus on syntactic relationships (subject-verb)
  • Head 2: Might capture semantic meaning
  • Head 3: Might track co-references (pronouns)
  • Head 4-8: Other patterns...
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # 512 / 8 = 64
        
        # Linear layers for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def split_heads(self, x, batch_size):
        """Split into multiple heads"""
        # x: [batch, seq_len, d_model]
        x = x.view(batch_size, -1, self.num_heads, self.d_k)
        # [batch, seq_len, num_heads, d_k]
        return x.transpose(1, 2)  # [batch, num_heads, seq_len, d_k]
    
    def forward(self, x, mask=None):
        batch_size = x.size(0)
        
        # 1. Linear projections
        Q = self.W_q(x)  # [batch, seq_len, d_model]
        K = self.W_k(x)
        V = self.W_v(x)
        
        # 2. Split into multiple heads
        Q = self.split_heads(Q, batch_size)
        K = self.split_heads(K, batch_size)
        V = self.split_heads(V, batch_size)
        
        # 3. Scaled dot-product attention (for each head)
        scores = Q @ K.transpose(-2, -1) / torch.sqrt(torch.tensor(self.d_k))
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attention_weights = torch.softmax(scores, dim=-1)
        attention_output = attention_weights @ V
        
        # 4. Concatenate heads
        attention_output = attention_output.transpose(1, 2).contiguous()
        attention_output = attention_output.view(batch_size, -1, self.num_heads * self.d_k)
        
        # 5. Final linear layer
        output = self.W_o(attention_output)
        
        return output, attention_weights

# Usage
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(32, 50, 512)  # [batch, seq_len, d_model]
output, weights = mha(x)
print(f"Output shape: {output.shape}")  # [32, 50, 512]

📍 Positional Encoding

The Problem

Transformers process all tokens in parallel → they don't know word order!

"Cat chases dog" vs "Dog chases cat" would look the same without position info.

The Solution: Add Position Information

import numpy as np

def positional_encoding(seq_len, d_model):
    """
    Create positional encodings using sine/cosine functions
    """
    position = np.arange(seq_len)[:, np.newaxis]  # [seq_len, 1]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    pe = np.zeros((seq_len, d_model))
    pe[:, 0::2] = np.sin(position * div_term)  # Even dimensions
    pe[:, 1::2] = np.cos(position * div_term)  # Odd dimensions
    
    return torch.FloatTensor(pe)

# Add to embeddings
seq_len, d_model = 50, 512
embeddings = torch.randn(1, seq_len, d_model)
pos_encoding = positional_encoding(seq_len, d_model)

# Combine
input_with_position = embeddings + pos_encoding

print(f"Positional encoding shape: {pos_encoding.shape}")

Visualize Positional Encodings

pe = positional_encoding(100, 512)

plt.figure(figsize=(12, 6))
plt.pcolormesh(pe.numpy(), cmap='RdBu')
plt.xlabel('Embedding Dimension')
plt.ylabel('Position in Sequence')
plt.colorbar()
plt.title('Positional Encoding Pattern')
plt.show()

# Creates unique "fingerprint" for each position

🔄 Complete Transformer Block

class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        
        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # 1. Multi-head attention + residual connection
        attention_output, _ = self.attention(x, mask)
        x = x + self.dropout1(attention_output)  # Residual
        x = self.norm1(x)  # Layer norm
        
        # 2. Feed-forward + residual connection
        ffn_output = self.ffn(x)
        x = x + self.dropout2(ffn_output)  # Residual
        x = self.norm2(x)  # Layer norm
        
        return x

# Full Transformer
class Transformer(nn.Module):
    def __init__(self, vocab_size, d_model=512, num_layers=6, num_heads=8):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads) 
            for _ in range(num_layers)
        ])
        self.output_layer = nn.Linear(d_model, vocab_size)
    
    def forward(self, x):
        # Embed tokens
        x = self.embedding(x)
        
        # Add positional encoding
        x = x + positional_encoding(x.size(1), x.size(2))
        
        # Process through transformer blocks
        for block in self.blocks:
            x = block(x)
        
        # Output projection
        logits = self.output_layer(x)
        return logits

# Create model
model = Transformer(vocab_size=50000, d_model=512, num_layers=6)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

🎭 Encoder vs Decoder vs Encoder-Decoder

Encoder-Only

Examples: BERT, RoBERTa

Use Cases:

  • Text classification
  • Named entity recognition
  • Question answering
  • Sentiment analysis

Key Feature: Bidirectional context

Decoder-Only

Examples: GPT-3, GPT-4, LLaMA

Use Cases:

  • Text generation
  • Story writing
  • Code completion
  • Chat

Key Feature: Auto-regressive generation

Encoder-Decoder

Examples: T5, BART

Use Cases:

  • Translation
  • Summarization
  • Paraphrasing
  • Question generation

Key Feature: Input → Output transformation

⚡ Why Transformers Are So Powerful

1. Parallelization

Unlike RNNs, all tokens process simultaneously

  • Much faster training
  • Can use modern GPU hardware efficiently
  • Scales to longer sequences

2. Long-Range Dependencies

Directly connect any two tokens

  • No vanishing gradients
  • Understands context across entire document
  • Better than RNNs at long sequences

3. Interpretability

Attention weights show what model focuses on

  • Visualize decision-making
  • Debug model behavior
  • Understand predictions

4. Transfer Learning

Pre-train once, fine-tune for many tasks

  • Save computation
  • Better performance
  • Work with less data

📊 Transformer Variants

GPT (Generative Pre-trained Transformer)

Decoder-only, autoregressive

Size: 175B parameters (GPT-3)

BERT (Bidirectional Encoder)

Encoder-only, masked language modeling

Size: 110M - 340M parameters

T5 (Text-to-Text Transfer)

Encoder-decoder, unified framework

Size: 220M - 11B parameters

CLIP (Contrastive Language-Image)

Dual encoders for text + images

Use: Image-text understanding

🎯 Key Takeaways