Naive Bayes Classifier

What is Naive Bayes?

Naive Bayes is a probabilistic classifier based on Bayes' Theorem with a "naive" assumption that features are independent. Despite this simplification, it works remarkably well, especially for text classification.

Bayes' Theorem:

P(Class|Features) = P(Features|Class) × P(Class) / P(Features)

P(Class|Features): Posterior probability (what we want)
P(Features|Class): Likelihood
P(Class): Prior probability
P(Features): Evidence (constant for all classes)

🎯 Gaussian Naive Bayes

For continuous features assuming Gaussian (normal) distribution.

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict
y_pred = gnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Understanding the Model

# Class priors (from training data)
print("Class Priors:")
print(gnb.class_prior_)  # [0.33, 0.34, 0.33] - balanced classes

# Mean of features for each class
print("\nFeature Means per Class:")
print(gnb.theta_)  # Shape: (n_classes, n_features)

# Variance of features for each class
print("\nFeature Variances per Class:")
print(gnb.var_)

# Predict probabilities
probabilities = gnb.predict_proba(X_test)
print("\nProbabilities for first sample:")
print(probabilities[0])  # [0.9, 0.05, 0.05] - class 0 most likely

📝 Multinomial Naive Bayes

For discrete count features - perfect for text classification!

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups

# Load text data (20 newsgroups dataset)
categories = ['sci.space', 'comp.graphics', 'rec.sport.baseball']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, 
                                       remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,
                                      remove=('headers', 'footers', 'quotes'))

# Convert text to word counts
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)

# Train Multinomial Naive Bayes
mnb = MultinomialNB(alpha=1.0)  # alpha = smoothing parameter
mnb.fit(X_train, newsgroups_train.target)

# Predict
y_pred = mnb.predict(X_test)
accuracy = accuracy_score(newsgroups_test.target, y_pred)
print(f"Text Classification Accuracy: {accuracy:.3f}")

# Predict new text
new_text = ["NASA launched a new satellite"]
new_vec = vectorizer.transform(new_text)
prediction = mnb.predict(new_vec)
print(f"Prediction: {newsgroups_train.target_names[prediction[0]]}")

Smoothing Parameter (Alpha)

# Alpha prevents zero probabilities for unseen words
# alpha = 0: No smoothing (risky - zero probabilities)
# alpha = 1: Laplace smoothing (default, good choice)
# alpha > 1: Stronger smoothing

for alpha in [0.1, 1.0, 5.0, 10.0]:
    mnb = MultinomialNB(alpha=alpha)
    mnb.fit(X_train, newsgroups_train.target)
    score = mnb.score(X_test, newsgroups_test.target)
    print(f"Alpha={alpha}: {score:.3f}")

🔤 Text Classification Example

Spam Detection

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Sample spam/ham data
data = {
    'text': [
        'Win free money now!!!',
        'Meeting at 3pm today',
        'Get rich quick scheme',
        'Lunch tomorrow?',
        'Claim your prize now',
        'Project deadline reminder',
        'Hot singles in your area',
        'Team meeting notes'
    ],
    'label': ['spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham']
}
df = pd.DataFrame(data)

# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label'], test_size=0.25, random_state=42
)

# Create pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=100, stop_words='english')),
    ('classifier', MultinomialNB(alpha=1.0))
])

# Train
pipeline.fit(X_train, y_train)

# Test
print(f"Accuracy: {pipeline.score(X_test, y_test):.3f}")

# Predict new messages
new_messages = [
    "You won a million dollars",
    "Can we reschedule our meeting?"
]
predictions = pipeline.predict(new_messages)
for msg, pred in zip(new_messages, predictions):
    print(f"{msg} → {pred}")

Feature Importance in Text

# Get most important words for each class
import numpy as np

# Extract feature names and log probabilities
feature_names = vectorizer.get_feature_names_out()
log_probs = mnb.feature_log_prob_

for i, class_name in enumerate(newsgroups_train.target_names):
    # Top 10 words for this class
    top_indices = np.argsort(log_probs[i])[-10:]
    top_words = [feature_names[j] for j in top_indices]
    print(f"\n{class_name}: {', '.join(top_words)}")

🎲 Bernoulli Naive Bayes

For binary features (0 or 1) - useful for document classification with binary word occurrence.

from sklearn.naive_bayes import BernoulliNB

# Binary features (word present or not)
bnb = BernoulliNB(alpha=1.0, binarize=0.0)
# binarize: threshold to binarize features (None = already binary)

bnb.fit(X_train, y_train)
y_pred = bnb.predict(X_test)
print(f"Bernoulli NB Accuracy: {accuracy_score(y_test, y_pred):.3f}")

# Bernoulli is better when:
# - Features are binary
# - Small datasets
# - Documents are short

📊 Comparing Naive Bayes Types

Type	Feature Type	Use Case	Example
Gaussian	Continuous (real values)	Numerical features	Iris classification, sensor data
Multinomial	Discrete counts	Text classification	Spam detection, sentiment analysis
Bernoulli	Binary (0/1)	Document classification	Word presence/absence
Complement	Discrete counts	Imbalanced text data	Imbalanced document classes

🔧 Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# For Multinomial NB
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
    'fit_prior': [True, False]  # Learn class priors or assume uniform
}

mnb = MultinomialNB()
grid_search = GridSearchCV(mnb, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Test best model
best_mnb = grid_search.best_estimator_
test_score = best_mnb.score(X_test, y_test)
print(f"Test accuracy: {test_score:.3f}")

✅ Advantages & Disadvantages

Advantages ✅	Disadvantages ❌
Very fast training and prediction	Independence assumption rarely true
Works well with small datasets	Can't learn feature interactions
Handles high-dimensional data	Zero-frequency problem (needs smoothing)
Excellent for text classification	Probability estimates can be poor
Not sensitive to irrelevant features	Outperformed by complex models
Probabilistic predictions	Assumes feature distribution

⚠️ Important Considerations

1. The "Naive" Assumption

Assumes features are independent given the class. Example: In spam detection, assumes "free" and "money" are independent, but they often appear together. Despite this, Naive Bayes often works well in practice!

2. Zero Probability Problem

# Without smoothing: if a word never appears in training for a class,
# it gets zero probability, making entire prediction zero!

# Solution: Laplace smoothing (alpha parameter)
mnb = MultinomialNB(alpha=1.0)  # Add-one smoothing

# Alpha adds pseudo-counts to all features:
# P(word|class) = (count + alpha) / (total_count + alpha * vocab_size)

3. Feature Scaling Not Needed

Unlike SVM or KNN, Naive Bayes doesn't require feature scaling because it works with probabilities, not distances.

💡 Best Practices

Text classification: Use MultinomialNB with TF-IDF or CountVectorizer
Continuous features: Use GaussianNB
Always use smoothing: Set alpha ≥ 1 to avoid zero probabilities
Baseline model: Great first model - fast and often competitive
High dimensions: Naive Bayes handles it well (curse of dimensionality doesn't apply)
Imbalanced text: Try ComplementNB instead of MultinomialNB
Pipeline: Combine with vectorizers in sklearn Pipeline

🎯 Real-World Applications

Spam Detection: Email/SMS filtering
Sentiment Analysis: Product reviews, social media
Document Categorization: News articles, research papers
Medical Diagnosis: Disease prediction from symptoms
Recommendation Systems: Collaborative filtering
Real-time Classification: Fast prediction required

🎯 When to Use Naive Bayes

✅ Use Naive Bayes When:

Text classification problems
Need fast training/prediction
Small training dataset
High-dimensional data
Need probabilistic outputs
Baseline model for comparison
Real-time predictions needed

❌ Avoid Naive Bayes When:

Features highly correlated
Need feature interactions
Need best possible accuracy
Features don't fit assumptions
Accurate probabilities needed
Complex decision boundaries

🎯 Key Takeaways

Naive Bayes uses Bayes' Theorem with independence assumption
GaussianNB for continuous features
MultinomialNB for count data (text classification)
BernoulliNB for binary features
Fast and efficient - excellent for text and baselines
Alpha smoothing prevents zero probabilities
Works well despite naive independence assumption