Ullman, T. D., Goodman, N. D., & Tenenbaum, J. B. (2012). Theory learning as stochastic search in the language of thought. Cognitive Development, 27(4). doi:10.1016/j.cogdev.2012.07.005

Summary

In this paper, Ullman et al. suggest an algorithm-level approach for theory learning. In contrast to Griffiths & Tenenbaum and Kemp et al., which proposed computational-level models for theory learning, Ullman et al. build on previous work (i.e., they still use a hierarchical Bayesian model) and demonstrate that MCMC-like search procedure can account for similar learning dynamics as are found in children.

Methods

n/a

Algorithm

Ullman et al. express theories using Horn clauses, which are logical expressions of the form $r\leftarrow (f\wedge g\wedge\ldots{}\wedge s\wedge t)$ where each term is a predicate such as $f(X)$ or $s(X, Y)$, where $X$ and $Y$ are entities in the domain.

A theory $T$ is produced from a univeresal theory grammar $U$, which is a probabilistic context-free Horn clause grammar (PHCG). This grammar includes what they call law templates which encode assumptions for what types of laws are good. For example, one template is $F(X,Y)\leftarrow F(X,Z)\wedge F(Z,Y)$ which captures transitivity. The $F$ could then be replaced with something like $\mathrm{interacts}(\cdot{})$.

Together, the grammar and the templates specify the prior $p(T\vert U)$ over theories. A particular theory—for example, in the magnetism domain—might have core predicates $f(X)$ and $g(X)$, surface predicates of $interacts(X, Y)$, and laws such as:

$\begin{align*} \mathrm{interacts}(X, Y)&\leftarrow f(X)\wedge f(Y)\\ \mathrm{interacts}(X, Y)&\leftarrow f(X)\wedge g(Y)\\ \mathrm{interacts}(X, Y)&\leftarrow \mathrm{interacts}(Y, X) \end{align*}$

Given a theory, a model $M$ can be produced which fits the data. For example, in the magnetism domain, the model might assign magnets to $f(X)$ and magnetic objects to $g(X)$ (and everything else unassigned). The prior on models $p(M\vert T)$ is a beta prior (per binary fact).

Finally, the data $D$ are actual observations, for example, “this object interacts with this other object”. The likelihood $p(D\vert M, T)$ is Bernoulli.

Thus the full model is something like:

$\begin{align*} T\vert U &\sim \mathrm{PHCG}(U)\\ M^i &\sim \mathrm{Beta}(\alpha, \beta)\\ D^{ij} &\sim \mathrm{Bernoulli}(M^i) \end{align*}$

and the posterior is:

$p(T\vert D, U)\propto P(T\vert U)\sum_M p(D\vert M, T)p(M\vert T)$

To approximate the sum, they use Gibbs sampling. To search in the space of theories, they use a modified version of MCMC where the acceptance ratio is annealed, so that changes are less likely to be made as time goes on. They propose new theories at each step of the chain with the following possible modifications:

Add or delete a predicate from a law
Change one predicate to an alternative of the same or different form within a law
Add or delete a whole law

Takeaways

This approach of performing a stochastic search in theory space seems generally plausible as a mechanism for how children learn theories. However, I think there are a few things here that aren’t fully satisfing:

The MCMC chain takes many iterations to run: for example, 1600 in the magnetism case. I realize you can’t exactly map inference (“thinking”) time directly to MCMC steps, but it seems like a lot. If this is a mechanistic account, then it’s essentially saying that children propose 1600 different theories before they reach the final one.
Children presumably don’t do inference over the entire dataset at once; I would expect there to be some sort of online learning going on. Ullman et al. do investigate the effect of varying amounts of data on the learning process, but they use separate datasets rather than looking at an online method.

Related to the last point, I suppose there are two options for what people do: either they use some memoryless inference procedure, or they do inference over a set of data, but I would expect that to not include everything they’ve ever encountered. There will be some effect of memory (perhaps remembering data points that were surprising, and more recent data points).

I wonder if, rather than using a random MCMC algorithm, an active search or Bayesian optimization type of algorithm might work better here in terms of sampling efficiency.