What is Naive Bayes?
Naive Bayes is a probabilistic classifier based on Bayes' Theorem with a "naive" assumption that features are independent. Despite this simplification, it works remarkably well, especially for text classification.
Bayes' Theorem:
P(Class|Features) = P(Features|Class) × P(Class) / P(Features)
- P(Class|Features): Posterior probability (what we want)
- P(Features|Class): Likelihood
- P(Class): Prior probability
- P(Features): Evidence (constant for all classes)
🎯 Gaussian Naive Bayes
For continuous features assuming Gaussian (normal) distribution.
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Predict
y_pred = gnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Understanding the Model
# Class priors (from training data)
print("Class Priors:")
print(gnb.class_prior_) # [0.33, 0.34, 0.33] - balanced classes
# Mean of features for each class
print("\nFeature Means per Class:")
print(gnb.theta_) # Shape: (n_classes, n_features)
# Variance of features for each class
print("\nFeature Variances per Class:")
print(gnb.var_)
# Predict probabilities
probabilities = gnb.predict_proba(X_test)
print("\nProbabilities for first sample:")
print(probabilities[0]) # [0.9, 0.05, 0.05] - class 0 most likely
📝 Multinomial Naive Bayes
For discrete count features - perfect for text classification!
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
# Load text data (20 newsgroups dataset)
categories = ['sci.space', 'comp.graphics', 'rec.sport.baseball']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories,
remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,
remove=('headers', 'footers', 'quotes'))
# Convert text to word counts
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)
# Train Multinomial Naive Bayes
mnb = MultinomialNB(alpha=1.0) # alpha = smoothing parameter
mnb.fit(X_train, newsgroups_train.target)
# Predict
y_pred = mnb.predict(X_test)
accuracy = accuracy_score(newsgroups_test.target, y_pred)
print(f"Text Classification Accuracy: {accuracy:.3f}")
# Predict new text
new_text = ["NASA launched a new satellite"]
new_vec = vectorizer.transform(new_text)
prediction = mnb.predict(new_vec)
print(f"Prediction: {newsgroups_train.target_names[prediction[0]]}")
Smoothing Parameter (Alpha)
# Alpha prevents zero probabilities for unseen words
# alpha = 0: No smoothing (risky - zero probabilities)
# alpha = 1: Laplace smoothing (default, good choice)
# alpha > 1: Stronger smoothing
for alpha in [0.1, 1.0, 5.0, 10.0]:
mnb = MultinomialNB(alpha=alpha)
mnb.fit(X_train, newsgroups_train.target)
score = mnb.score(X_test, newsgroups_test.target)
print(f"Alpha={alpha}: {score:.3f}")
🔤 Text Classification Example
Spam Detection
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
# Sample spam/ham data
data = {
'text': [
'Win free money now!!!',
'Meeting at 3pm today',
'Get rich quick scheme',
'Lunch tomorrow?',
'Claim your prize now',
'Project deadline reminder',
'Hot singles in your area',
'Team meeting notes'
],
'label': ['spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham']
}
df = pd.DataFrame(data)
# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df['text'], df['label'], test_size=0.25, random_state=42
)
# Create pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=100, stop_words='english')),
('classifier', MultinomialNB(alpha=1.0))
])
# Train
pipeline.fit(X_train, y_train)
# Test
print(f"Accuracy: {pipeline.score(X_test, y_test):.3f}")
# Predict new messages
new_messages = [
"You won a million dollars",
"Can we reschedule our meeting?"
]
predictions = pipeline.predict(new_messages)
for msg, pred in zip(new_messages, predictions):
print(f"{msg} → {pred}")
Feature Importance in Text
# Get most important words for each class
import numpy as np
# Extract feature names and log probabilities
feature_names = vectorizer.get_feature_names_out()
log_probs = mnb.feature_log_prob_
for i, class_name in enumerate(newsgroups_train.target_names):
# Top 10 words for this class
top_indices = np.argsort(log_probs[i])[-10:]
top_words = [feature_names[j] for j in top_indices]
print(f"\n{class_name}: {', '.join(top_words)}")
🎲 Bernoulli Naive Bayes
For binary features (0 or 1) - useful for document classification with binary word occurrence.
from sklearn.naive_bayes import BernoulliNB
# Binary features (word present or not)
bnb = BernoulliNB(alpha=1.0, binarize=0.0)
# binarize: threshold to binarize features (None = already binary)
bnb.fit(X_train, y_train)
y_pred = bnb.predict(X_test)
print(f"Bernoulli NB Accuracy: {accuracy_score(y_test, y_pred):.3f}")
# Bernoulli is better when:
# - Features are binary
# - Small datasets
# - Documents are short
📊 Comparing Naive Bayes Types
| Type | Feature Type | Use Case | Example |
|---|---|---|---|
| Gaussian | Continuous (real values) | Numerical features | Iris classification, sensor data |
| Multinomial | Discrete counts | Text classification | Spam detection, sentiment analysis |
| Bernoulli | Binary (0/1) | Document classification | Word presence/absence |
| Complement | Discrete counts | Imbalanced text data | Imbalanced document classes |
🔧 Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
# For Multinomial NB
param_grid = {
'alpha': [0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
'fit_prior': [True, False] # Learn class priors or assume uniform
}
mnb = MultinomialNB()
grid_search = GridSearchCV(mnb, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.3f}")
# Test best model
best_mnb = grid_search.best_estimator_
test_score = best_mnb.score(X_test, y_test)
print(f"Test accuracy: {test_score:.3f}")
✅ Advantages & Disadvantages
| Advantages ✅ | Disadvantages ❌ |
|---|---|
| Very fast training and prediction | Independence assumption rarely true |
| Works well with small datasets | Can't learn feature interactions |
| Handles high-dimensional data | Zero-frequency problem (needs smoothing) |
| Excellent for text classification | Probability estimates can be poor |
| Not sensitive to irrelevant features | Outperformed by complex models |
| Probabilistic predictions | Assumes feature distribution |
⚠️ Important Considerations
1. The "Naive" Assumption
Assumes features are independent given the class. Example: In spam detection, assumes "free" and "money" are independent, but they often appear together. Despite this, Naive Bayes often works well in practice!
2. Zero Probability Problem
# Without smoothing: if a word never appears in training for a class,
# it gets zero probability, making entire prediction zero!
# Solution: Laplace smoothing (alpha parameter)
mnb = MultinomialNB(alpha=1.0) # Add-one smoothing
# Alpha adds pseudo-counts to all features:
# P(word|class) = (count + alpha) / (total_count + alpha * vocab_size)
3. Feature Scaling Not Needed
Unlike SVM or KNN, Naive Bayes doesn't require feature scaling because it works with probabilities, not distances.
💡 Best Practices
- Text classification: Use MultinomialNB with TF-IDF or CountVectorizer
- Continuous features: Use GaussianNB
- Always use smoothing: Set alpha ≥ 1 to avoid zero probabilities
- Baseline model: Great first model - fast and often competitive
- High dimensions: Naive Bayes handles it well (curse of dimensionality doesn't apply)
- Imbalanced text: Try ComplementNB instead of MultinomialNB
- Pipeline: Combine with vectorizers in sklearn Pipeline
🎯 Real-World Applications
- Spam Detection: Email/SMS filtering
- Sentiment Analysis: Product reviews, social media
- Document Categorization: News articles, research papers
- Medical Diagnosis: Disease prediction from symptoms
- Recommendation Systems: Collaborative filtering
- Real-time Classification: Fast prediction required
🎯 When to Use Naive Bayes
✅ Use Naive Bayes When:
- Text classification problems
- Need fast training/prediction
- Small training dataset
- High-dimensional data
- Need probabilistic outputs
- Baseline model for comparison
- Real-time predictions needed
❌ Avoid Naive Bayes When:
- Features highly correlated
- Need feature interactions
- Need best possible accuracy
- Features don't fit assumptions
- Accurate probabilities needed
- Complex decision boundaries
🎯 Key Takeaways
- Naive Bayes uses Bayes' Theorem with independence assumption
- GaussianNB for continuous features
- MultinomialNB for count data (text classification)
- BernoulliNB for binary features
- Fast and efficient - excellent for text and baselines
- Alpha smoothing prevents zero probabilities
- Works well despite naive independence assumption