Recurrent Neural Networks (RNN)

What are RNNs?

Recurrent Neural Networks are designed for sequential data where order matters. They have memory of previous inputs, making them perfect for time series, text, speech, and video.

                Key Concepts:
                Hidden State: Memory that carries information across time steps
Sequential Processing: Process one element at a time
Parameter Sharing: Same weights used at each time step
LSTM/GRU: Advanced RNNs that solve vanishing gradient problem

            

🧠 Basic RNN Architecture

At each time step, RNN takes input and previous hidden state to produce output and new hidden state:

h_t = tanh(W_hh × h_t-1 + W_xh × x_t + b)

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# Simple RNN for sequence classification
model = keras.Sequential([
    layers.SimpleRNN(
        units=128,              # Number of hidden units
        activation='tanh',      # Default activation
        return_sequences=False, # Return only last output
        input_shape=(None, 10)  # (timesteps, features)
    ),
    layers.Dense(1, activation='sigmoid')
])

model.summary()

⚡ Long Short-Term Memory (LSTM)

LSTM solves the vanishing gradient problem in RNNs using gates to control information flow.

LSTM Architecture

LSTM has three gates:

Forget Gate: Decides what to remove from cell state
Input Gate: Decides what new information to add
Output Gate: Decides what to output

# Text sentiment classification with LSTM
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence

# Load IMDB dataset
max_features = 10000  # Top 10,000 words
maxlen = 200          # Max review length

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to same length
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)

print(f"Training shape: {X_train.shape}")  # (25000, 200)

# Build LSTM model
model = keras.Sequential([
    layers.Embedding(max_features, 128),  # Word embeddings
    layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2),
    layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Train
history = model.fit(
    X_train, y_train,
    batch_size=128,
    epochs=10,
    validation_split=0.2,
    verbose=1
)

# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest accuracy: {test_acc:.4f}")

🚀 Gated Recurrent Unit (GRU)

GRU is a simpler alternative to LSTM with only two gates, often faster while maintaining similar performance.

# GRU for sequence classification
model = keras.Sequential([
    layers.Embedding(max_features, 128),
    layers.GRU(
        units=128,
        dropout=0.2,
        recurrent_dropout=0.2,
        return_sequences=False
    ),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=128, epochs=10, validation_split=0.2)

# LSTM vs GRU:
# - LSTM: More powerful, more parameters, slower
# - GRU: Simpler, faster, often similar performance
# - Try GRU first, use LSTM if needed

📊 Sequence-to-Sequence Tasks

Many-to-One: Sentiment Analysis

# Input: Sequence of words → Output: Single label
model = keras.Sequential([
    layers.Embedding(vocab_size, 128),
    layers.LSTM(64, return_sequences=False),  # Only last output
    layers.Dense(1, activation='sigmoid')
])

Many-to-Many: Time Series Forecasting

# Input: Sequence → Output: Sequence (same length)
model = keras.Sequential([
    layers.LSTM(64, return_sequences=True, input_shape=(timesteps, features)),
    layers.TimeDistributed(layers.Dense(1))
])

# Example: Stock price prediction
import pandas as pd

# Generate sample time series
time_steps = 100
data = np.cumsum(np.random.randn(1000))

# Create sequences
def create_sequences(data, seq_length):
    X, y = [], []
    for i in range(len(data) - seq_length):
        X.append(data[i:i+seq_length])
        y.append(data[i+seq_length])
    return np.array(X), np.array(y)

seq_length = 20
X, y = create_sequences(data, seq_length)
X = X.reshape(-1, seq_length, 1)

# Split
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Build model
model = keras.Sequential([
    layers.LSTM(50, return_sequences=True, input_shape=(seq_length, 1)),
    layers.LSTM(50),
    layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.1)

Sequence-to-Sequence: Machine Translation

# Encoder-Decoder architecture
# Encoder: Process input sequence → context vector
encoder_inputs = layers.Input(shape=(None, num_encoder_features))
encoder = layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

# Decoder: Generate output sequence from context
decoder_inputs = layers.Input(shape=(None, num_decoder_features))
decoder_lstm = layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = layers.Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

🔄 Bidirectional RNN

Process sequences in both forward and backward directions for better context understanding.

# Bidirectional LSTM
model = keras.Sequential([
    layers.Embedding(max_features, 128),
    layers.Bidirectional(layers.LSTM(64, return_sequences=True)),
    layers.Bidirectional(layers.LSTM(32)),
    layers.Dense(1, activation='sigmoid')
])

# Benefits:
# - Sees future and past context
# - Better for NLP tasks (e.g., named entity recognition)
# - 2x parameters (forward + backward)
# - Cannot be used for real-time prediction

📝 Text Generation with RNN

# Character-level text generation
text = "Your training text here..."

# Create character vocabulary
chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}

# Prepare sequences
seq_length = 40
step = 3
sequences = []
next_chars = []

for i in range(0, len(text) - seq_length, step):
    sequences.append(text[i:i+seq_length])
    next_chars.append(text[i+seq_length])

# Vectorize
X = np.zeros((len(sequences), seq_length, len(chars)), dtype=bool)
y = np.zeros((len(sequences), len(chars)), dtype=bool)

for i, seq in enumerate(sequences):
    for t, char in enumerate(seq):
        X[i, t, char_to_idx[char]] = 1
    y[i, char_to_idx[next_chars[i]]] = 1

# Build model
model = keras.Sequential([
    layers.LSTM(128, input_shape=(seq_length, len(chars))),
    layers.Dense(len(chars), activation='softmax')
])

model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(X, y, batch_size=128, epochs=30)

# Generate text
def generate_text(model, start_string, length=100, temperature=1.0):
    generated = start_string
    
    for _ in range(length):
        # Prepare input
        x = np.zeros((1, seq_length, len(chars)))
        for t, char in enumerate(generated[-seq_length:]):
            x[0, t, char_to_idx[char]] = 1
        
        # Predict
        preds = model.predict(x, verbose=0)[0]
        preds = np.log(preds) / temperature
        exp_preds = np.exp(preds)
        preds = exp_preds / np.sum(exp_preds)
        
        # Sample next character
        next_idx = np.random.choice(len(chars), p=preds)
        next_char = idx_to_char[next_idx]
        generated += next_char
    
    return generated

print(generate_text(model, "The ", length=200, temperature=0.5))

⏰ Time Series Forecasting

# Multivariate time series prediction
# Example: Predict temperature from multiple weather features

# Generate sample data
n_samples = 1000
n_features = 5  # Temperature, humidity, pressure, wind, etc.
data = np.random.randn(n_samples, n_features)

# Create sliding windows
def create_dataset(data, window_size, horizon=1):
    X, y = [], []
    for i in range(len(data) - window_size - horizon + 1):
        X.append(data[i:i+window_size])
        y.append(data[i+window_size:i+window_size+horizon, 0])  # Predict temp
    return np.array(X), np.array(y)

window_size = 24  # 24 hours
horizon = 6       # Predict 6 hours ahead

X, y = create_dataset(data, window_size, horizon)

# Split
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Build model
model = keras.Sequential([
    layers.LSTM(64, return_sequences=True, input_shape=(window_size, n_features)),
    layers.Dropout(0.2),
    layers.LSTM(32),
    layers.Dropout(0.2),
    layers.Dense(horizon)  # Predict multiple future steps
])

model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# Train
history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    callbacks=[
        keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(patience=5, factor=0.5)
    ]
)

# Predict
predictions = model.predict(X_test)

# Visualize
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(y_test[:100, 0], label='Actual')
plt.plot(predictions[:100, 0], label='Predicted')
plt.legend()
plt.title('Time Series Forecast')
plt.show()

🎯 RNN vs LSTM vs GRU

Feature	Simple RNN	LSTM	GRU
Parameters	Least	Most (4 gates)	Middle (2 gates)
Training Speed	Fastest	Slowest	Fast
Long Dependencies	Poor	Excellent	Very Good
Vanishing Gradient	Yes	No	No
Memory Control	None	Cell state + gates	Hidden state + gates
Best For	Short sequences	Complex patterns	Most tasks

💡 Best Practices

Start with GRU: Good balance of performance and speed
Use LSTM for: Very long sequences or complex patterns
Dropout: Add dropout (0.2-0.5) and recurrent_dropout to prevent overfitting
Bidirectional: Use for NLP when full context available
Normalize data: Scale time series to [0, 1] or standardize
Batch size: 32-128 usually works well
return_sequences=True: When stacking RNNs or for seq-to-seq
Early stopping: Monitor validation loss, stop when overfitting
Learning rate: Start with 0.001 (Adam default)

⚠️ Common Pitfalls

Wrong input shape: Must be (batch, timesteps, features)
Not shuffling: Shuffle time series windows, not the series itself
Leakage: Don't normalize on entire dataset - use train stats only
Too deep: 2-3 RNN layers usually sufficient
Vanishing gradients: Use LSTM/GRU instead of SimpleRNN
Exploding gradients: Clip gradients or use smaller learning rate

🎯 RNN Applications

Natural Language Processing

Sentiment analysis
Machine translation
Text generation
Named entity recognition
Question answering

Time Series

Stock price prediction
Weather forecasting
Energy consumption prediction
Anomaly detection

Other Applications

Speech recognition
Music generation
Video analysis
Handwriting recognition

🔮 Modern Alternatives: Transformers

Note: For many NLP tasks, Transformer architectures (BERT, GPT) have largely replaced RNNs due to:

Better parallelization (faster training)
Better handling of long-range dependencies
State-of-the-art performance

However, RNNs are still useful for:

Real-time sequence processing
Smaller models with limited resources
Time series forecasting
Streaming data applications

🎯 Key Takeaways

RNNs process sequential data with memory of past inputs
LSTM solves vanishing gradient with gates and cell state
GRU is simpler, faster alternative to LSTM
Bidirectional RNNs see both past and future context
return_sequences=True for seq-to-seq tasks
Dropout prevents overfitting in recurrent layers
Transformers often better for NLP, but RNNs still useful