Support Vector Machines (SVM)

What are Support Vector Machines?

SVMs find the optimal hyperplane that separates different classes with the maximum margin. They're powerful for both linear and non-linear classification problems.

                Key Concepts:
                Hyperplane: Decision boundary that separates classes
Support Vectors: Data points closest to the hyperplane
Margin: Distance between hyperplane and support vectors
Kernel Trick: Transform data for non-linear problems

            

📊 Linear SVM

How It Works

SVM finds the hyperplane that maximizes the margin between classes:

from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt

# Generate linearly separable data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0, 
                           n_informative=2, n_clusters_per_class=1, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features (important for SVM!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Linear SVM
svm = SVC(kernel='linear', C=1.0, random_state=42)
svm.fit(X_train_scaled, y_train)

# Evaluate
accuracy = svm.score(X_test_scaled, y_test)
print(f"Accuracy: {accuracy:.3f}")

# Support vectors
print(f"Number of support vectors: {len(svm.support_vectors_)}")

Visualizing Decision Boundary

def plot_decision_boundary(model, X, y):
    # Create mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))
    
    # Predict
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='black')
    
    # Highlight support vectors
    plt.scatter(model.support_vectors_[:, 0], 
                model.support_vectors_[:, 1],
                s=200, linewidth=1, facecolors='none', edgecolors='k')
    plt.title('SVM Decision Boundary')
    plt.show()

plot_decision_boundary(svm, X_train_scaled, y_train)

⚙️ The C Parameter

C controls the trade-off between smooth decision boundary and classifying training points correctly.

# Small C: Larger margin, more misclassifications (underfitting)
svm_soft = SVC(kernel='linear', C=0.01)
svm_soft.fit(X_train_scaled, y_train)
print(f"C=0.01 Accuracy: {svm_soft.score(X_test_scaled, y_test):.3f}")

# Large C: Smaller margin, fewer misclassifications (risk overfitting)
svm_hard = SVC(kernel='linear', C=100)
svm_hard.fit(X_train_scaled, y_train)
print(f"C=100 Accuracy: {svm_hard.score(X_test_scaled, y_test):.3f}")

# C interpretation:
# - Small C (0.01-0.1): More regularization, simpler model
# - Medium C (1): Balanced (default)
# - Large C (10-100): Less regularization, complex model

🌀 Non-Linear SVM with Kernels

The kernel trick transforms data into higher dimensions to make it linearly separable.

Polynomial Kernel

# Generate non-linear data
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=200, noise=0.1, factor=0.5, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Polynomial kernel
svm_poly = SVC(kernel='poly', degree=3, C=1.0)
svm_poly.fit(X_train, y_train)
print(f"Polynomial Kernel Accuracy: {svm_poly.score(X_test, y_test):.3f}")

RBF (Radial Basis Function) Kernel

# RBF kernel (most popular for non-linear problems)
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_rbf.fit(X_train, y_train)
print(f"RBF Kernel Accuracy: {svm_rbf.score(X_test, y_test):.3f}")

# Gamma parameter:
# - Small gamma: Far-reaching influence (smoother boundary)
# - Large gamma: Limited influence (complex boundary, risk overfitting)

# Try different gamma values
for gamma in [0.001, 0.01, 0.1, 1, 10]:
    svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
    svm.fit(X_train, y_train)
    print(f"Gamma={gamma}: {svm.score(X_test, y_test):.3f}")

Sigmoid Kernel

# Sigmoid kernel (similar to neural network)
svm_sigmoid = SVC(kernel='sigmoid', C=1.0)
svm_sigmoid.fit(X_train, y_train)
print(f"Sigmoid Kernel Accuracy: {svm_sigmoid.score(X_test, y_test):.3f}")

📊 Kernel Comparison

Kernel	Use Case	Parameters	Pros	Cons
Linear	Linearly separable data	C	Fast, interpretable	Limited to linear problems
Polynomial	Polynomial relationships	C, degree	Flexible	Many parameters, slow
RBF	Most non-linear problems	C, gamma	Very flexible, powerful	Risk overfitting, slower
Sigmoid	Similar to neural nets	C, gamma	Good for some problems	Less popular, unstable

🎯 Multiclass Classification

from sklearn.datasets import load_iris

# Load multiclass dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# SVM handles multiclass automatically (one-vs-one strategy)
svm = SVC(kernel='rbf', C=1.0, gamma='scale')
svm.fit(X_train_scaled, y_train)

# Evaluate
accuracy = svm.score(X_test_scaled, y_test)
print(f"Multiclass Accuracy: {accuracy:.3f}")

# Predict
y_pred = svm.predict(X_test_scaled)

# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

⚡ SVM Regression (SVR)

SVM can also be used for regression problems!

from sklearn.svm import SVR
from sklearn.datasets import make_regression

# Generate regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train SVR
svr = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr.fit(X_train, y_train)

# Predict
y_pred = svr.predict(X_test)

# Evaluate
from sklearn.metrics import mean_squared_error, r2_score
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}")
print(f"R² Score: {r2:.3f}")

# Plot
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.legend()
plt.title('Support Vector Regression')
plt.show()

🔧 Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1, 'scale', 'auto'],
    'kernel': ['rbf', 'poly', 'sigmoid']
}

# Grid search
svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train_scaled, y_train)

# Best parameters
print("Best parameters:", grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Use best model
best_svm = grid_search.best_estimator_
test_score = best_svm.score(X_test_scaled, y_test)
print(f"Test accuracy: {test_score:.3f}")

⚠️ Important Considerations

1. Feature Scaling is Critical

# SVM is sensitive to feature scales
# Always scale your features!

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use same scaler!

# Without scaling: Poor performance
svm_no_scale = SVC().fit(X_train, y_train)
print(f"No scaling: {svm_no_scale.score(X_test, y_test):.3f}")

# With scaling: Much better
svm_scaled = SVC().fit(X_train_scaled, y_train)
print(f"With scaling: {svm_scaled.score(X_test_scaled, y_test):.3f}")

2. Computational Complexity

Training time: O(n² to n³) - slow for large datasets
Prediction time: O(n_sv) - depends on support vectors
Not suitable for: Very large datasets (>10,000 samples)
Better for: Small to medium datasets with complex boundaries

3. When to Use SVM

✅ Small to medium datasets (< 10,000 samples)
✅ High-dimensional data (e.g., text classification)
✅ Clear margin of separation exists
✅ Non-linear relationships (use RBF kernel)
❌ Very large datasets (use linear models or tree-based)
❌ Noisy data with overlapping classes
❌ Need probability estimates (use LogisticRegression instead)

💡 Practical Tips

Start with Linear: Try linear kernel first, only use RBF if needed
Scale features: Use StandardScaler before training
Tune C and gamma: Use GridSearchCV with cross-validation
RBF is default: Works well for most non-linear problems
Use probability: Set probability=True for predict_proba()
Large datasets: Consider LinearSVC or SGDClassifier instead

🎯 Key Takeaways

SVM finds optimal hyperplane with maximum margin
Kernel trick enables non-linear classification
RBF kernel is most popular for non-linear problems
C parameter controls regularization strength
Gamma parameter defines kernel width (RBF/poly)
Always scale features before using SVM
Best for small-medium datasets with complex boundaries