Ng, A. Y., & Jordan, M. I. (2002). On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes. Advances In Neural Information Processing Systems, 14.

Summary

Ng & Jordan give a nice analysis of generative vs. discriminative classifiers, and show that there are actually two modes where either type of model might be preferred. They focus in particular on logistic regression vs. naive Bayes, but say that their analysis should be extendable to other generative-discriminative pairs of models.

There are two key results in their analysis:

  1. The asymptotic error of a generative linear classifier is greater than the asymptotic error of a discriminative linear classifier (which converges to the best linear classifier overall).
  2. The number of training examples required for a discriminative linear classifier to reach its asymptotic error is $O(n)$, where $n$ is the VC dimension. In contrast, the number of examples for a generative linear classifier is $O(\log{n})$.

They show these propositions to be empirically true across a number of experiments on the UCI datasets.

Takeaways

The results from Ng & Jordan suggest that generative classifiers may be more useful in situations where there are small amounts of data. However, if there is a lot of data available, it may be more useful to use a discriminative classifier because it will ultimately have less error. The intuition behind this (I think?) has to do with the fact that the generative classifier needs to make assumptions about the distributions for both the likelihood and the prior, while the discriminative model does not. In some cases, if the assumptions are correct, then the generative classifier should have the same asymptotic error as the discriminative classifier. If my intuition about this is correct, then the question is essentially: can you estimate how incorrect your assumptions are, and then use that estimate (combined with knowledge about how much data you have) to determine whether to train a generative vs. discriminative classifier?

Of course, there are also other reasons to potentially use a generative model besides just faster asymptotic error. In many cases, it may be necessary to be able to invert the model (i.e. sometimes you may need $p(x\vert y)$, and sometimes you may need $p(y\vert x)$). Intuitively, it seems like if you need to do this, it is going to be more efficient to train a generative model (from which you can compute both $p(y\vert x)$ and $p(x\vert y)$) rather than training multiple discriminative models.

One question I have about this is regarding training discriminative models with generative models—if you learn a generative model first, and then use samples from it to train a discriminative model, how does that affect the error of the discriminative model? Can the discriminative model only do as well as the generative model, in that case? I want to say the answer is yes, but perhaps combined with true data from the world (in addition to samples from the generative model) the discriminative model could eventually achieve a lower error.