Mathematics for Machine Learning

📖 Introduction

Mathematics is the foundation of machine learning. Understanding key mathematical concepts will help you grasp ML algorithms deeply, debug models effectively, and develop new techniques. This guide covers the essential math you need.

🎯 Why Math Matters in ML

Linear Algebra: Understanding data representation, transformations, and neural networks
Calculus: Optimization, gradient descent, backpropagation
Probability: Uncertainty quantification, Bayesian methods
Statistics: Hypothesis testing, confidence intervals, model evaluation

🔢 Linear Algebra

Scalars, Vectors, and Matrices

Scalar: A single number (e.g., 5, 3.14)

Vector: An array of numbers [1, 2, 3]

Matrix: A 2D array of numbers

import numpy as np

# Scalar
scalar = 5

# Vector (1D array)
vector = np.array([1, 2, 3, 4])
print("Vector:", vector)
print("Shape:", vector.shape)  # (4,)

# Matrix (2D array)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
print("\nMatrix:\n", matrix)
print("Shape:", matrix.shape)  # (3, 3)

# Tensor (3D+ array)
tensor = np.array([[[1, 2], [3, 4]],
                   [[5, 6], [7, 8]]])
print("\nTensor shape:", tensor.shape)  # (2, 2, 2)

Vector Operations

# Vector addition
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
print("Addition:", v1 + v2)  # [5, 7, 9]

# Scalar multiplication
print("Scalar multiply:", 3 * v1)  # [3, 6, 9]

# Dot product (inner product)
dot_product = np.dot(v1, v2)
print("Dot product:", dot_product)  # 1*4 + 2*5 + 3*6 = 32

# Vector magnitude (norm)
magnitude = np.linalg.norm(v1)
print("Magnitude:", magnitude)  # sqrt(1² + 2² + 3²) = 3.74

Matrix Operations

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix addition
print("A + B:\n", A + B)

# Matrix multiplication
print("\nA @ B:\n", A @ B)
# [[1*5+2*7, 1*6+2*8],
#  [3*5+4*7, 3*6+4*8]]

# Element-wise multiplication
print("\nA * B (element-wise):\n", A * B)

# Transpose
print("\nA transpose:\n", A.T)

# Inverse
A_inv = np.linalg.inv(A)
print("\nA inverse:\n", A_inv)
print("A @ A_inv (should be identity):\n", A @ A_inv)

Eigenvalues and Eigenvectors

Important for PCA, dimensionality reduction, and understanding neural network dynamics.

A = np.array([[4, 2], [1, 3]])

# Calculate eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)

print("Eigenvalues:", eigenvalues)
print("Eigenvectors:\n", eigenvectors)

# Verify: A @ v = λ @ v
for i in range(len(eigenvalues)):
    v = eigenvectors[:, i]
    lambda_v = eigenvalues[i]
    print(f"\nA @ v{i} =", A @ v)
    print(f"λ{i} * v{i} =", lambda_v * v)

📊 Calculus

Derivatives

Measures how a function changes as its input changes. Critical for optimization!

Derivative Definition:

f'(x) = lim(h→0) [f(x+h) - f(x)] / h

Common Derivatives:

f(x) = x² → f'(x) = 2x
f(x) = x³ → f'(x) = 3x²
f(x) = eˣ → f'(x) = eˣ
f(x) = ln(x) → f'(x) = 1/x
f(x) = sin(x) → f'(x) = cos(x)

Chain Rule

Essential for backpropagation in neural networks!

If h(x) = f(g(x)), then h'(x) = f'(g(x)) × g'(x)

import matplotlib.pyplot as plt

# Example: f(x) = x² and its derivative f'(x) = 2x
x = np.linspace(-3, 3, 100)
y = x**2
dy_dx = 2*x

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x, y, label='f(x) = x²')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Function')
plt.legend()
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(x, dy_dx, label="f'(x) = 2x", color='red')
plt.xlabel('x')
plt.ylabel("f'(x)")
plt.title('Derivative')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Partial Derivatives

Derivatives of functions with multiple variables. Used in gradient descent!

For f(x, y) = x² + 2xy + y²:

∂f/∂x = 2x + 2y

∂f/∂y = 2x + 2y

# Gradient = vector of partial derivatives
def f(x, y):
    return x**2 + 2*x*y + y**2

def gradient(x, y):
    df_dx = 2*x + 2*y
    df_dy = 2*x + 2*y
    return np.array([df_dx, df_dy])

# Example
x, y = 1.0, 2.0
print(f"f({x}, {y}) = {f(x, y)}")
print(f"Gradient at ({x}, {y}): {gradient(x, y)}")

Gradient Descent

The optimization algorithm that powers machine learning!

def gradient_descent_example():
    # Minimize f(x) = x²
    x = 10.0  # Starting point
    learning_rate = 0.1
    iterations = 20
    
    history = [x]
    
    for i in range(iterations):
        # Calculate gradient (derivative)
        gradient = 2 * x
        
        # Update x
        x = x - learning_rate * gradient
        history.append(x)
        
        print(f"Iteration {i+1}: x = {x:.4f}, f(x) = {x**2:.4f}")
    
    # Plot convergence
    plt.figure(figsize=(10, 5))
    plt.subplot(1, 2, 1)
    plt.plot(history, marker='o')
    plt.xlabel('Iteration')
    plt.ylabel('x value')
    plt.title('Parameter Convergence')
    plt.grid(True)
    
    plt.subplot(1, 2, 2)
    plt.plot([h**2 for h in history], marker='o', color='red')
    plt.xlabel('Iteration')
    plt.ylabel('f(x) = x²')
    plt.title('Loss Convergence')
    plt.grid(True)
    plt.tight_layout()
    plt.show()

gradient_descent_example()

🎲 Probability & Statistics

Probability Basics

Probability: P(A) = number of favorable outcomes / total outcomes
Range: 0 ≤ P(A) ≤ 1
Sum Rule: P(A or B) = P(A) + P(B) - P(A and B)
Product Rule: P(A and B) = P(A) × P(B|A)

Probability Distributions

import scipy.stats as stats

# Normal (Gaussian) Distribution
mean, std = 0, 1
x = np.linspace(-4, 4, 100)
pdf = stats.norm.pdf(x, mean, std)

plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.plot(x, pdf)
plt.title('Normal Distribution\nN(0, 1)')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.grid(True)

# Binomial Distribution
n, p = 10, 0.5
x_binom = np.arange(0, n+1)
pmf = stats.binom.pmf(x_binom, n, p)

plt.subplot(1, 3, 2)
plt.bar(x_binom, pmf)
plt.title('Binomial Distribution\nn=10, p=0.5')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.grid(True)

# Poisson Distribution
lambda_val = 3
x_poisson = np.arange(0, 15)
pmf_poisson = stats.poisson.pmf(x_poisson, lambda_val)

plt.subplot(1, 3, 3)
plt.bar(x_poisson, pmf_poisson)
plt.title('Poisson Distribution\nλ=3')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.grid(True)

plt.tight_layout()
plt.show()

Bayes' Theorem

Foundation of Bayesian ML and Naive Bayes classifier!

P(A|B) = P(B|A) × P(A) / P(B)

P(A|B) = Posterior probability
P(B|A) = Likelihood
P(A) = Prior probability
P(B) = Evidence

Key Statistical Measures

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

# Central Tendency
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data).mode

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")

# Spread
variance = np.var(data)
std_dev = np.std(data)
range_val = np.max(data) - np.min(data)

print(f"\nVariance: {variance}")
print(f"Standard Deviation: {std_dev}")
print(f"Range: {range_val}")

# Quartiles
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)  # median
q3 = np.percentile(data, 75)

print(f"\nQ1 (25th percentile): {q1}")
print(f"Q2 (50th percentile): {q2}")
print(f"Q3 (75th percentile): {q3}")
print(f"IQR (Interquartile Range): {q3 - q1}")

Correlation

# Generate correlated data
np.random.seed(42)
x = np.random.randn(100)
y_positive = x + np.random.randn(100) * 0.5  # Positive correlation
y_negative = -x + np.random.randn(100) * 0.5  # Negative correlation
y_none = np.random.randn(100)  # No correlation

# Calculate correlation coefficients
corr_pos = np.corrcoef(x, y_positive)[0, 1]
corr_neg = np.corrcoef(x, y_negative)[0, 1]
corr_none = np.corrcoef(x, y_none)[0, 1]

# Plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].scatter(x, y_positive)
axes[0].set_title(f'Positive Correlation\nr = {corr_pos:.2f}')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')

axes[1].scatter(x, y_negative)
axes[1].set_title(f'Negative Correlation\nr = {corr_neg:.2f}')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')

axes[2].scatter(x, y_none)
axes[2].set_title(f'No Correlation\nr = {corr_none:.2f}')
axes[2].set_xlabel('x')
axes[2].set_ylabel('y')

plt.tight_layout()
plt.show()

🎯 Applied Math in ML Algorithms

Linear Regression

Math: Matrix operations, least squares

Formula: β = (XᵀX)⁻¹Xᵀy

Gradient Descent

Math: Calculus (derivatives, chain rule)

Formula: θ = θ - α∇J(θ)

Logistic Regression

Math: Probability, sigmoid function

Formula: σ(z) = 1 / (1 + e⁻ᶻ)

PCA

Math: Eigenvalues, eigenvectors

Use: Dimensionality reduction

Neural Networks

Math: Matrix multiplication, calculus

Use: Backpropagation, weight updates

Naive Bayes

Math: Bayes' theorem, probability

Use: Classification with probabilities

💡 Learning Tips

Start with basics: Don't try to learn everything at once
Practice with code: Implement mathematical concepts in Python
Visualize: Plot functions, gradients, distributions
Connect to ML: See how math applies to real algorithms
Use resources: Khan Academy, 3Blue1Brown, MIT OpenCourseWare

📚 Recommended Resources

Books: "Mathematics for Machine Learning" by Deisenroth, Faisal, Ong
Videos: 3Blue1Brown's Essence of Linear Algebra & Calculus series
Courses: Khan Academy (Linear Algebra, Calculus, Statistics)
Interactive: Seeing Theory (visualizing probability and statistics)