Why do naive Bayesian classifiers perform so well?
Naive Bayes classifiers are a popular choice for classification problems. There are many reasons for this, including:
- “Zeitgeist” - widespread awareness after the success of spam filters about ten years ago
- Easy to write
- The classifier model is fast to build
- The model can be modified with new training data without having to rebuild the model
However, they are ‘naive’ - i.e. they assume the features are independent - this contrasts with other classifiers such as Maximum Entropy classifiers (which are slow to compute).
The independence assumption cannot usually be assumed, and in many (most?) cases, including the spam filter example, it is simply wrong.
So why does the Naive Bayes Classifier still perform very well in such applications, even when the features are not independent of each other?
This paper seems to prove (I can’t follow the math) that bayes is good not only when features are independent, but also when dependencies of features from each other are similar between features:
In this paper, we propose a novel explanation on the superb classification performance of naive Bayes. We show that, essentially, the dependence distribution; i.e., how the local dependence of a node distributes in each class, evenly or unevenly, and how the local dependencies of all nodes work together, consistently (supporting a certain classification) or inconsistently (canceling each other out), plays a crucial role. Therefore, no matter how strong the dependences among attributes are, naive Bayes can still be optimal if the dependences distribute evenly in classes, or if the dependences cancel each other out