📊 Naive Bayes Classifier

Probabilistic classification based on Bayes' Theorem

What is Naive Bayes?

Naive Bayes is a probabilistic classifier based on Bayes' Theorem with a "naive" assumption that features are independent. Despite this simplification, it works remarkably well, especially for text classification.

Bayes' Theorem:

P(Class|Features) = P(Features|Class) × P(Class) / P(Features)

  • P(Class|Features): Posterior probability (what we want)
  • P(Features|Class): Likelihood
  • P(Class): Prior probability
  • P(Features): Evidence (constant for all classes)

🎯 Gaussian Naive Bayes

For continuous features assuming Gaussian (normal) distribution.

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict
y_pred = gnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Understanding the Model

# Class priors (from training data)
print("Class Priors:")
print(gnb.class_prior_)  # [0.33, 0.34, 0.33] - balanced classes

# Mean of features for each class
print("\nFeature Means per Class:")
print(gnb.theta_)  # Shape: (n_classes, n_features)

# Variance of features for each class
print("\nFeature Variances per Class:")
print(gnb.var_)

# Predict probabilities
probabilities = gnb.predict_proba(X_test)
print("\nProbabilities for first sample:")
print(probabilities[0])  # [0.9, 0.05, 0.05] - class 0 most likely

📝 Multinomial Naive Bayes

For discrete count features - perfect for text classification!

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups

# Load text data (20 newsgroups dataset)
categories = ['sci.space', 'comp.graphics', 'rec.sport.baseball']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, 
                                       remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,
                                      remove=('headers', 'footers', 'quotes'))

# Convert text to word counts
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)

# Train Multinomial Naive Bayes
mnb = MultinomialNB(alpha=1.0)  # alpha = smoothing parameter
mnb.fit(X_train, newsgroups_train.target)

# Predict
y_pred = mnb.predict(X_test)
accuracy = accuracy_score(newsgroups_test.target, y_pred)
print(f"Text Classification Accuracy: {accuracy:.3f}")

# Predict new text
new_text = ["NASA launched a new satellite"]
new_vec = vectorizer.transform(new_text)
prediction = mnb.predict(new_vec)
print(f"Prediction: {newsgroups_train.target_names[prediction[0]]}")

Smoothing Parameter (Alpha)

# Alpha prevents zero probabilities for unseen words
# alpha = 0: No smoothing (risky - zero probabilities)
# alpha = 1: Laplace smoothing (default, good choice)
# alpha > 1: Stronger smoothing

for alpha in [0.1, 1.0, 5.0, 10.0]:
    mnb = MultinomialNB(alpha=alpha)
    mnb.fit(X_train, newsgroups_train.target)
    score = mnb.score(X_test, newsgroups_test.target)
    print(f"Alpha={alpha}: {score:.3f}")

🔤 Text Classification Example

Spam Detection

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Sample spam/ham data
data = {
    'text': [
        'Win free money now!!!',
        'Meeting at 3pm today',
        'Get rich quick scheme',
        'Lunch tomorrow?',
        'Claim your prize now',
        'Project deadline reminder',
        'Hot singles in your area',
        'Team meeting notes'
    ],
    'label': ['spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham']
}
df = pd.DataFrame(data)

# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label'], test_size=0.25, random_state=42
)

# Create pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=100, stop_words='english')),
    ('classifier', MultinomialNB(alpha=1.0))
])

# Train
pipeline.fit(X_train, y_train)

# Test
print(f"Accuracy: {pipeline.score(X_test, y_test):.3f}")

# Predict new messages
new_messages = [
    "You won a million dollars",
    "Can we reschedule our meeting?"
]
predictions = pipeline.predict(new_messages)
for msg, pred in zip(new_messages, predictions):
    print(f"{msg} → {pred}")

Feature Importance in Text

# Get most important words for each class
import numpy as np

# Extract feature names and log probabilities
feature_names = vectorizer.get_feature_names_out()
log_probs = mnb.feature_log_prob_

for i, class_name in enumerate(newsgroups_train.target_names):
    # Top 10 words for this class
    top_indices = np.argsort(log_probs[i])[-10:]
    top_words = [feature_names[j] for j in top_indices]
    print(f"\n{class_name}: {', '.join(top_words)}")

🎲 Bernoulli Naive Bayes

For binary features (0 or 1) - useful for document classification with binary word occurrence.

from sklearn.naive_bayes import BernoulliNB

# Binary features (word present or not)
bnb = BernoulliNB(alpha=1.0, binarize=0.0)
# binarize: threshold to binarize features (None = already binary)

bnb.fit(X_train, y_train)
y_pred = bnb.predict(X_test)
print(f"Bernoulli NB Accuracy: {accuracy_score(y_test, y_pred):.3f}")

# Bernoulli is better when:
# - Features are binary
# - Small datasets
# - Documents are short

📊 Comparing Naive Bayes Types

Type Feature Type Use Case Example
Gaussian Continuous (real values) Numerical features Iris classification, sensor data
Multinomial Discrete counts Text classification Spam detection, sentiment analysis
Bernoulli Binary (0/1) Document classification Word presence/absence
Complement Discrete counts Imbalanced text data Imbalanced document classes

🔧 Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# For Multinomial NB
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
    'fit_prior': [True, False]  # Learn class priors or assume uniform
}

mnb = MultinomialNB()
grid_search = GridSearchCV(mnb, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Test best model
best_mnb = grid_search.best_estimator_
test_score = best_mnb.score(X_test, y_test)
print(f"Test accuracy: {test_score:.3f}")

✅ Advantages & Disadvantages

Advantages ✅ Disadvantages ❌
Very fast training and prediction Independence assumption rarely true
Works well with small datasets Can't learn feature interactions
Handles high-dimensional data Zero-frequency problem (needs smoothing)
Excellent for text classification Probability estimates can be poor
Not sensitive to irrelevant features Outperformed by complex models
Probabilistic predictions Assumes feature distribution

⚠️ Important Considerations

1. The "Naive" Assumption

Assumes features are independent given the class. Example: In spam detection, assumes "free" and "money" are independent, but they often appear together. Despite this, Naive Bayes often works well in practice!

2. Zero Probability Problem

# Without smoothing: if a word never appears in training for a class,
# it gets zero probability, making entire prediction zero!

# Solution: Laplace smoothing (alpha parameter)
mnb = MultinomialNB(alpha=1.0)  # Add-one smoothing

# Alpha adds pseudo-counts to all features:
# P(word|class) = (count + alpha) / (total_count + alpha * vocab_size)

3. Feature Scaling Not Needed

Unlike SVM or KNN, Naive Bayes doesn't require feature scaling because it works with probabilities, not distances.

💡 Best Practices

🎯 Real-World Applications

🎯 When to Use Naive Bayes

✅ Use Naive Bayes When:

  • Text classification problems
  • Need fast training/prediction
  • Small training dataset
  • High-dimensional data
  • Need probabilistic outputs
  • Baseline model for comparison
  • Real-time predictions needed

❌ Avoid Naive Bayes When:

  • Features highly correlated
  • Need feature interactions
  • Need best possible accuracy
  • Features don't fit assumptions
  • Accurate probabilities needed
  • Complex decision boundaries

🎯 Key Takeaways