Halle, M., & Stevens, K. M. (1962). Speech recognition: a model and a program for research. IRE Transactions On Information Theory, 8(2), 155–159. doi:10.1109/TIT.1962.1057686

Summary

Halle & Stevens were the first to propose the technique of “analysis-by-synthesis”. Actually, they did so even earlier, in:

Halle, M., & Stevens, K. M. (1959). Analysis by synthesis. In W. Wathen-Dunn & L. E. Woods (Eds.), Proceedings of the Seminar of Speech Compression and Processing (Vol. 2).

Unfortunately, though, I have been unable to find this paper anywhere, but I assume that the present paper covers enough of the same ground.

Halle & Stevens essentially make an argument for the “infinite use of finite means” (Chomsky, Humboldt), saying:

It is then possible to store in the “permanent memory” of the analyzer only the rules for speech production discussed in the previous section. In this model the dictionary is replaced by generative rules which can synthesize signals in response to instructions consisting of sequences of phonemes… The internally generated signal which provides the best match with the input signal then identifies the required phoneme sequence. (pg. 157)

They go on to discuss the other common features of analysis by synthesis: first, that a “preliminary analysis” can cut down on the number of phoneme sequences that need to be analyzed, and second, that a “control” component can determine the order in which to test out phoneme sequences. The data-driven MCMC approach from Yuille & Kersten has echoes of this – in their case, different hypotheses are generated by simple cues (preliminary analysis) and then the MCMC algorithm performs a random walk in hypothesis space proportional to the posterior probability of the hypotheses (control).

Methods

n/a

Algorithm

n/a

Takeaways

This is the foundational work on analysis by synthesis and is remarkably forward thinking. Even though it’s applied to speech perception here, it could easily be applied to many other areas of cognition (e.g., vision as Yuille & Kersten have already shown). It also seems like there is a lot of similarities to this approach and the increasingly popular idea of training a deep network on low-level features, and then using a structured Bayesian model on top of the the output of that network. The similarity is that the deep network takes the place of a powerful learning system for determining relevant cues, and the Bayesian model does the high level work of constraining and evaluating hypotheses.