What are Support Vector Machines?
SVMs find the optimal hyperplane that separates different classes with the maximum margin. They're powerful for both linear and non-linear classification problems.
Key Concepts:
- Hyperplane: Decision boundary that separates classes
- Support Vectors: Data points closest to the hyperplane
- Margin: Distance between hyperplane and support vectors
- Kernel Trick: Transform data for non-linear problems
📊 Linear SVM
How It Works
SVM finds the hyperplane that maximizes the margin between classes:
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
# Generate linearly separable data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
n_informative=2, n_clusters_per_class=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale features (important for SVM!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Linear SVM
svm = SVC(kernel='linear', C=1.0, random_state=42)
svm.fit(X_train_scaled, y_train)
# Evaluate
accuracy = svm.score(X_test_scaled, y_test)
print(f"Accuracy: {accuracy:.3f}")
# Support vectors
print(f"Number of support vectors: {len(svm.support_vectors_)}")
Visualizing Decision Boundary
def plot_decision_boundary(model, X, y):
# Create mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
# Predict
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='black')
# Highlight support vectors
plt.scatter(model.support_vectors_[:, 0],
model.support_vectors_[:, 1],
s=200, linewidth=1, facecolors='none', edgecolors='k')
plt.title('SVM Decision Boundary')
plt.show()
plot_decision_boundary(svm, X_train_scaled, y_train)
⚙️ The C Parameter
C controls the trade-off between smooth decision boundary and classifying training points correctly.
# Small C: Larger margin, more misclassifications (underfitting)
svm_soft = SVC(kernel='linear', C=0.01)
svm_soft.fit(X_train_scaled, y_train)
print(f"C=0.01 Accuracy: {svm_soft.score(X_test_scaled, y_test):.3f}")
# Large C: Smaller margin, fewer misclassifications (risk overfitting)
svm_hard = SVC(kernel='linear', C=100)
svm_hard.fit(X_train_scaled, y_train)
print(f"C=100 Accuracy: {svm_hard.score(X_test_scaled, y_test):.3f}")
# C interpretation:
# - Small C (0.01-0.1): More regularization, simpler model
# - Medium C (1): Balanced (default)
# - Large C (10-100): Less regularization, complex model
🌀 Non-Linear SVM with Kernels
The kernel trick transforms data into higher dimensions to make it linearly separable.
Polynomial Kernel
# Generate non-linear data
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=200, noise=0.1, factor=0.5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Polynomial kernel
svm_poly = SVC(kernel='poly', degree=3, C=1.0)
svm_poly.fit(X_train, y_train)
print(f"Polynomial Kernel Accuracy: {svm_poly.score(X_test, y_test):.3f}")
RBF (Radial Basis Function) Kernel
# RBF kernel (most popular for non-linear problems)
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_rbf.fit(X_train, y_train)
print(f"RBF Kernel Accuracy: {svm_rbf.score(X_test, y_test):.3f}")
# Gamma parameter:
# - Small gamma: Far-reaching influence (smoother boundary)
# - Large gamma: Limited influence (complex boundary, risk overfitting)
# Try different gamma values
for gamma in [0.001, 0.01, 0.1, 1, 10]:
svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
svm.fit(X_train, y_train)
print(f"Gamma={gamma}: {svm.score(X_test, y_test):.3f}")
Sigmoid Kernel
# Sigmoid kernel (similar to neural network)
svm_sigmoid = SVC(kernel='sigmoid', C=1.0)
svm_sigmoid.fit(X_train, y_train)
print(f"Sigmoid Kernel Accuracy: {svm_sigmoid.score(X_test, y_test):.3f}")
📊 Kernel Comparison
| Kernel | Use Case | Parameters | Pros | Cons |
|---|---|---|---|---|
| Linear | Linearly separable data | C | Fast, interpretable | Limited to linear problems |
| Polynomial | Polynomial relationships | C, degree | Flexible | Many parameters, slow |
| RBF | Most non-linear problems | C, gamma | Very flexible, powerful | Risk overfitting, slower |
| Sigmoid | Similar to neural nets | C, gamma | Good for some problems | Less popular, unstable |
🎯 Multiclass Classification
from sklearn.datasets import load_iris
# Load multiclass dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# SVM handles multiclass automatically (one-vs-one strategy)
svm = SVC(kernel='rbf', C=1.0, gamma='scale')
svm.fit(X_train_scaled, y_train)
# Evaluate
accuracy = svm.score(X_test_scaled, y_test)
print(f"Multiclass Accuracy: {accuracy:.3f}")
# Predict
y_pred = svm.predict(X_test_scaled)
# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
⚡ SVM Regression (SVR)
SVM can also be used for regression problems!
from sklearn.svm import SVR
from sklearn.datasets import make_regression
# Generate regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train SVR
svr = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr.fit(X_train, y_train)
# Predict
y_pred = svr.predict(X_test)
# Evaluate
from sklearn.metrics import mean_squared_error, r2_score
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}")
print(f"R² Score: {r2:.3f}")
# Plot
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.legend()
plt.title('Support Vector Regression')
plt.show()
🔧 Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1, 'scale', 'auto'],
'kernel': ['rbf', 'poly', 'sigmoid']
}
# Grid search
svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train_scaled, y_train)
# Best parameters
print("Best parameters:", grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.3f}")
# Use best model
best_svm = grid_search.best_estimator_
test_score = best_svm.score(X_test_scaled, y_test)
print(f"Test accuracy: {test_score:.3f}")
⚠️ Important Considerations
1. Feature Scaling is Critical
# SVM is sensitive to feature scales
# Always scale your features!
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use same scaler!
# Without scaling: Poor performance
svm_no_scale = SVC().fit(X_train, y_train)
print(f"No scaling: {svm_no_scale.score(X_test, y_test):.3f}")
# With scaling: Much better
svm_scaled = SVC().fit(X_train_scaled, y_train)
print(f"With scaling: {svm_scaled.score(X_test_scaled, y_test):.3f}")
2. Computational Complexity
- Training time: O(n² to n³) - slow for large datasets
- Prediction time: O(n_sv) - depends on support vectors
- Not suitable for: Very large datasets (>10,000 samples)
- Better for: Small to medium datasets with complex boundaries
3. When to Use SVM
- ✅ Small to medium datasets (< 10,000 samples)
- ✅ High-dimensional data (e.g., text classification)
- ✅ Clear margin of separation exists
- ✅ Non-linear relationships (use RBF kernel)
- ❌ Very large datasets (use linear models or tree-based)
- ❌ Noisy data with overlapping classes
- ❌ Need probability estimates (use LogisticRegression instead)
💡 Practical Tips
- Start with Linear: Try linear kernel first, only use RBF if needed
- Scale features: Use StandardScaler before training
- Tune C and gamma: Use GridSearchCV with cross-validation
- RBF is default: Works well for most non-linear problems
- Use probability: Set
probability=Truefor predict_proba() - Large datasets: Consider LinearSVC or SGDClassifier instead
🎯 Key Takeaways
- SVM finds optimal hyperplane with maximum margin
- Kernel trick enables non-linear classification
- RBF kernel is most popular for non-linear problems
- C parameter controls regularization strength
- Gamma parameter defines kernel width (RBF/poly)
- Always scale features before using SVM
- Best for small-medium datasets with complex boundaries