What are RNNs?
Recurrent Neural Networks are designed for sequential data where order matters. They have memory of previous inputs, making them perfect for time series, text, speech, and video.
Key Concepts:
- Hidden State: Memory that carries information across time steps
- Sequential Processing: Process one element at a time
- Parameter Sharing: Same weights used at each time step
- LSTM/GRU: Advanced RNNs that solve vanishing gradient problem
🧠 Basic RNN Architecture
At each time step, RNN takes input and previous hidden state to produce output and new hidden state:
ht = tanh(Whh × ht-1 + Wxh × xt + b)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
# Simple RNN for sequence classification
model = keras.Sequential([
layers.SimpleRNN(
units=128, # Number of hidden units
activation='tanh', # Default activation
return_sequences=False, # Return only last output
input_shape=(None, 10) # (timesteps, features)
),
layers.Dense(1, activation='sigmoid')
])
model.summary()
⚡ Long Short-Term Memory (LSTM)
LSTM solves the vanishing gradient problem in RNNs using gates to control information flow.
LSTM Architecture
LSTM has three gates:
- Forget Gate: Decides what to remove from cell state
- Input Gate: Decides what new information to add
- Output Gate: Decides what to output
# Text sentiment classification with LSTM
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
# Load IMDB dataset
max_features = 10000 # Top 10,000 words
maxlen = 200 # Max review length
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
# Pad sequences to same length
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print(f"Training shape: {X_train.shape}") # (25000, 200)
# Build LSTM model
model = keras.Sequential([
layers.Embedding(max_features, 128), # Word embeddings
layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2),
layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
# Train
history = model.fit(
X_train, y_train,
batch_size=128,
epochs=10,
validation_split=0.2,
verbose=1
)
# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest accuracy: {test_acc:.4f}")
🚀 Gated Recurrent Unit (GRU)
GRU is a simpler alternative to LSTM with only two gates, often faster while maintaining similar performance.
# GRU for sequence classification
model = keras.Sequential([
layers.Embedding(max_features, 128),
layers.GRU(
units=128,
dropout=0.2,
recurrent_dropout=0.2,
return_sequences=False
),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=128, epochs=10, validation_split=0.2)
# LSTM vs GRU:
# - LSTM: More powerful, more parameters, slower
# - GRU: Simpler, faster, often similar performance
# - Try GRU first, use LSTM if needed
📊 Sequence-to-Sequence Tasks
Many-to-One: Sentiment Analysis
# Input: Sequence of words → Output: Single label
model = keras.Sequential([
layers.Embedding(vocab_size, 128),
layers.LSTM(64, return_sequences=False), # Only last output
layers.Dense(1, activation='sigmoid')
])
Many-to-Many: Time Series Forecasting
# Input: Sequence → Output: Sequence (same length)
model = keras.Sequential([
layers.LSTM(64, return_sequences=True, input_shape=(timesteps, features)),
layers.TimeDistributed(layers.Dense(1))
])
# Example: Stock price prediction
import pandas as pd
# Generate sample time series
time_steps = 100
data = np.cumsum(np.random.randn(1000))
# Create sequences
def create_sequences(data, seq_length):
X, y = [], []
for i in range(len(data) - seq_length):
X.append(data[i:i+seq_length])
y.append(data[i+seq_length])
return np.array(X), np.array(y)
seq_length = 20
X, y = create_sequences(data, seq_length)
X = X.reshape(-1, seq_length, 1)
# Split
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
# Build model
model = keras.Sequential([
layers.LSTM(50, return_sequences=True, input_shape=(seq_length, 1)),
layers.LSTM(50),
layers.Dense(1)
])
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.1)
Sequence-to-Sequence: Machine Translation
# Encoder-Decoder architecture
# Encoder: Process input sequence → context vector
encoder_inputs = layers.Input(shape=(None, num_encoder_features))
encoder = layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]
# Decoder: Generate output sequence from context
decoder_inputs = layers.Input(shape=(None, num_decoder_features))
decoder_lstm = layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = layers.Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
🔄 Bidirectional RNN
Process sequences in both forward and backward directions for better context understanding.
# Bidirectional LSTM
model = keras.Sequential([
layers.Embedding(max_features, 128),
layers.Bidirectional(layers.LSTM(64, return_sequences=True)),
layers.Bidirectional(layers.LSTM(32)),
layers.Dense(1, activation='sigmoid')
])
# Benefits:
# - Sees future and past context
# - Better for NLP tasks (e.g., named entity recognition)
# - 2x parameters (forward + backward)
# - Cannot be used for real-time prediction
📝 Text Generation with RNN
# Character-level text generation
text = "Your training text here..."
# Create character vocabulary
chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}
# Prepare sequences
seq_length = 40
step = 3
sequences = []
next_chars = []
for i in range(0, len(text) - seq_length, step):
sequences.append(text[i:i+seq_length])
next_chars.append(text[i+seq_length])
# Vectorize
X = np.zeros((len(sequences), seq_length, len(chars)), dtype=bool)
y = np.zeros((len(sequences), len(chars)), dtype=bool)
for i, seq in enumerate(sequences):
for t, char in enumerate(seq):
X[i, t, char_to_idx[char]] = 1
y[i, char_to_idx[next_chars[i]]] = 1
# Build model
model = keras.Sequential([
layers.LSTM(128, input_shape=(seq_length, len(chars))),
layers.Dense(len(chars), activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(X, y, batch_size=128, epochs=30)
# Generate text
def generate_text(model, start_string, length=100, temperature=1.0):
generated = start_string
for _ in range(length):
# Prepare input
x = np.zeros((1, seq_length, len(chars)))
for t, char in enumerate(generated[-seq_length:]):
x[0, t, char_to_idx[char]] = 1
# Predict
preds = model.predict(x, verbose=0)[0]
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
# Sample next character
next_idx = np.random.choice(len(chars), p=preds)
next_char = idx_to_char[next_idx]
generated += next_char
return generated
print(generate_text(model, "The ", length=200, temperature=0.5))
⏰ Time Series Forecasting
# Multivariate time series prediction
# Example: Predict temperature from multiple weather features
# Generate sample data
n_samples = 1000
n_features = 5 # Temperature, humidity, pressure, wind, etc.
data = np.random.randn(n_samples, n_features)
# Create sliding windows
def create_dataset(data, window_size, horizon=1):
X, y = [], []
for i in range(len(data) - window_size - horizon + 1):
X.append(data[i:i+window_size])
y.append(data[i+window_size:i+window_size+horizon, 0]) # Predict temp
return np.array(X), np.array(y)
window_size = 24 # 24 hours
horizon = 6 # Predict 6 hours ahead
X, y = create_dataset(data, window_size, horizon)
# Split
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
# Build model
model = keras.Sequential([
layers.LSTM(64, return_sequences=True, input_shape=(window_size, n_features)),
layers.Dropout(0.2),
layers.LSTM(32),
layers.Dropout(0.2),
layers.Dense(horizon) # Predict multiple future steps
])
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
# Train
history = model.fit(
X_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.2,
callbacks=[
keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True),
keras.callbacks.ReduceLROnPlateau(patience=5, factor=0.5)
]
)
# Predict
predictions = model.predict(X_test)
# Visualize
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(y_test[:100, 0], label='Actual')
plt.plot(predictions[:100, 0], label='Predicted')
plt.legend()
plt.title('Time Series Forecast')
plt.show()
🎯 RNN vs LSTM vs GRU
| Feature | Simple RNN | LSTM | GRU |
|---|---|---|---|
| Parameters | Least | Most (4 gates) | Middle (2 gates) |
| Training Speed | Fastest | Slowest | Fast |
| Long Dependencies | Poor | Excellent | Very Good |
| Vanishing Gradient | Yes | No | No |
| Memory Control | None | Cell state + gates | Hidden state + gates |
| Best For | Short sequences | Complex patterns | Most tasks |
💡 Best Practices
- Start with GRU: Good balance of performance and speed
- Use LSTM for: Very long sequences or complex patterns
- Dropout: Add dropout (0.2-0.5) and recurrent_dropout to prevent overfitting
- Bidirectional: Use for NLP when full context available
- Normalize data: Scale time series to [0, 1] or standardize
- Batch size: 32-128 usually works well
- return_sequences=True: When stacking RNNs or for seq-to-seq
- Early stopping: Monitor validation loss, stop when overfitting
- Learning rate: Start with 0.001 (Adam default)
⚠️ Common Pitfalls
- Wrong input shape: Must be (batch, timesteps, features)
- Not shuffling: Shuffle time series windows, not the series itself
- Leakage: Don't normalize on entire dataset - use train stats only
- Too deep: 2-3 RNN layers usually sufficient
- Vanishing gradients: Use LSTM/GRU instead of SimpleRNN
- Exploding gradients: Clip gradients or use smaller learning rate
🎯 RNN Applications
Natural Language Processing
- Sentiment analysis
- Machine translation
- Text generation
- Named entity recognition
- Question answering
Time Series
- Stock price prediction
- Weather forecasting
- Energy consumption prediction
- Anomaly detection
Other Applications
- Speech recognition
- Music generation
- Video analysis
- Handwriting recognition
🔮 Modern Alternatives: Transformers
Note: For many NLP tasks, Transformer architectures (BERT, GPT) have largely replaced RNNs due to:
- Better parallelization (faster training)
- Better handling of long-range dependencies
- State-of-the-art performance
However, RNNs are still useful for:
- Real-time sequence processing
- Smaller models with limited resources
- Time series forecasting
- Streaming data applications
🎯 Key Takeaways
- RNNs process sequential data with memory of past inputs
- LSTM solves vanishing gradient with gates and cell state
- GRU is simpler, faster alternative to LSTM
- Bidirectional RNNs see both past and future context
- return_sequences=True for seq-to-seq tasks
- Dropout prevents overfitting in recurrent layers
- Transformers often better for NLP, but RNNs still useful