Yuille, A. L., & Kersten, D. (2006). Vision as Bayesian inference: analysis by synthesis? Trends In Cognitive Sciences, 10(7), 301–308. doi:10.1016/j.tics.2006.05.002

Summary

Yuille & Kersten argue that vision research should not focus on simple, artificial stimuli (as is what is normally the case). Instead, they argue that real, naturalistic images should be used. However, the reason why they haven’t been used traditionally is that it is not clear how to analyze visual processing on these images due to their complexity. To deal with this issue, they suggest relying on generative Bayesian models as a way to formalize theories of visual processing, and present an example of one such model.

Methods

n/a

Algorithm

Their model has two main components: one for bottom-up proposals, and one for top-down validation:

We propose an “analysis by synthesis” strategy where low-level cues, combined with spatial grouping ruls (similar to Gestalt laws), make bottom-up proposals which activate hypotheses about objects and scene structures. These hypotheses are accepted, or rejected, by direct comparison to the image (or a filtered version of it) in a top down process.

They use a model similar to a probabilistic context-free grammar (PCFG) in which non-terminal nodes $i$ has attributes for the model type (face, texture, shading, etc.) $\zeta_i$, the region of the image $L_i$ that the model generates, and the model parameters $\theta_i$. The set of all non-terminal nodes is denoted by $W$, whose prior is:

$P(W)=p(N)\prod_{i=1}^N p(L_i)p(\zeta_i\vert L_i)p(\theta_i\vert \zeta_i)$

Note that the number of non-terminal nodes, $N$, is also a random variable – so non-terminal nodes are themselves another structure that inference is performed over.

Terminal nodes correspond to the actual image, and have a likelihood of $p(I_{R(L)}\vert \zeta,L,\theta)$ for the particular region of the image that they correspond to. Combining all the regions, the overall likelihood for an image is:

$P(I\vert W)=\prod_{i=1}^N p(I_{R(L_i)}\vert \zeta_i,L_i,\theta_i)$

Now given $P(I)$ and $P(I\vert W)$, inference can be performed to compute $\mathrm{arg}\max_W p(W\vert I)$ which corresponds to the most likely scene grammar given the observed image. To actually perform inference, they use “data-driven MCMC” which allows for using bottom-up discriminative cues to make proposals for a “transition kernel”:

$K_i(W,W^\prime)=q_i(W,W^\prime)a_i(W,W^\prime)$

where each $K_i$ gives the probability for transitioning from $W$ to $W^\prime$ and corresponds to a particular operation on the parse tree. The $q_i$ function is the discriminative proposal function, and $a_i$ is the acceptance function.

The model from this paper is actually one described in more detail in another paper (which I have not yet read through, only the abstract):

Tu, Z., Chen, X., Yuille, A. L., & Zhu, S.-C. (2005). Image Parsing: Unifying Segmentation, Detection, and Recognition. International Journal Of Computer Vision, 63(2), 113–140. doi:10.1007/s11263-005-6642-x

Takeaways

Generative Bayesian models allow us to build powerful, structured models that may allow us to better capture complex real-world phenomena such as realistic images. While Yuille & Kersten don’t explicitly turn their model of visual segmentation and processing into a theory for visual perception, it certainly provides a foundation that could be used in future work (and perhaps has been – I haven’t investigated this yet).

It’s unclear to me whether there is actual “synthesis” going on in this model. Computing a likelihood function doesn’t necessarily mean that you have to sample from that likelihood. Do they actually syntesize image patches and compare them to the true image? Or are they just computing this likelihood function, without needing to synthesize anything? I do think it is important to have the capacity for simulation/synthesis (certainly it seems we have this, based on mental imagery) but I’m not sure it’s necessary for actually performing inference. The generative capacity is necessary, perhaps, but “generative” still does not mean that images are being synthesized.