Bever, T. G., & Poeppel, D. (2010). Analysis by Synthesis: A (Re-)Emerging Program of Research for Language and Vision. Biolinguistics, 43(2), 174–200. Retrieved from http://www.psych.nyu.edu/clash/dp_papers/bever.poeppel.pdf

Summary

Bever & Poeppel describe the analysis by synthesis (AxS) approach and how it applies to language. AxS was apparently first proposed by Halle & Stevens as a hypothesis for how speech production works. It follows the following steps:

  1. Form a rough hypothesis about the input based on simple cues
  2. Synthesize a full simulation of the input based on that hypothesis
  3. Compare the simulated input to the real input
  4. If the two match, then the structure of 2 is taken to be the true structure of the input

If they don’t match, then some iterative process of error reduction is performed. This approach is motivated by the idea that “it is computationally intractable to go directly from the more concrete to the more abstract representation by way of filters or other kinds of ‘bottom-up’ triggering templates” (pg. 177).

Bever & Poepple go on to discuss how AxS is related to the motor theory of speech perception, how it is already being (implicitly) used in many automatic speech recognition systems, how it relates to AxS in vision, and the compatibility with Bayesian models.

Motor theory of speech perception

The motor theory of speech perception states that “listeners are reconstructing the articulatory gestures of the speaker, and using those as the trigger for the perception of the underlying intended sequence of phones as though they actually occurred acoustically” (pg. 179). To me, this sounds a bit like the “simulation theory” in theory of mind.

Automatic speech recognition

They note that automatic speech recognition systems often use generative models, for example of “words-to-waveforms”. This is different from the AxS approach proposed by Halle & Stevens, however, in that AxS uses a model of the articulatory system, while the ASR approaches store “vectors representing [Gaussian] mean and variance of spectral slices”. I suppose this difference essentially makes AxS closer to the motor theory of speech perception. At the computational level, I’m not sure how much this distinction matters, assuming the “fixed” model of speech production is detailed enough to capture the full range of effects that can actually occur when speaking a word.

Analysis by synthesis in vision

Bever & Poepple reference Yuille & Kersten, among others, and discuss how AxS has been fruitful in vision. They note that given the wide array of cross-modal effects, if such a AxS process exists in vision, then it is likely to occur for audition/speech perception as well, perhaps utilizing some sort of general-purpose mechanism.

Bayesian approach

Ultimately, they say that there is no conflict with the Bayesian approach to implementing AxS. I agree with this; there’s no fundamental difference, using a Bayesian model is just a different way of formalizing it.

Methods

n/a

Algorithm

n/a

Takeaways

AxS seems to be a promising approach in both vision and language. However, as I discussed a bit in my notes on Yuille & Kersten, I don’t necessarily see why the “synthesis” component is so crucial. If that’s what is necessary to compute the probability (e.g., if you are using some approach like Approximate Bayesian Computation), then sure, but I’m not convinced that it is. As long as you can compute the PDF at a particular point, then you can evaluate the likelihood of the data given the model without necessarily needing to created a synthesized version of the image or sound production.

So in general, my takeaway from the “analysis by synthesis” approach doesn’t so much have to do with synthesis itself, but of the combination of low-level cues to generate hypotheses, and top-down knowledge to constrain those hypotheses. The reason for having this particular approach (as opposed to just doing inference over all hypotheses) is that the space of hypotheses is large enough so as to make it intractable to consider all possible hypotheses.