Kemp, C., Tenenbaum, J. B., Niyogi, S., & Griffiths, T. L. (2010). A probabilistic model of theory formation. Cognition, 114(2), 165–196. doi:10.1016/j.cognition.2009.09.003

Summary

In this paper, Kemp et al. describe a method for clustering entities based on relations, rather than features. They argue that this clustering algorithm can be thought of as a model of theory formation in people, reflecting how people learn and organize concepts. In particular, they focus on the idea of framework theories, which “specify the fundamental concepts that exist in a domain and the possible relationships between these concepts” (pg. 166). They ask three fundamental questions:

What are theories? Framework theories are “represented as a probabilistic model which includes a set of categories and a matrix of parameters specifying relationships between those categories” (pg. 166)
How are theories used to make inductive inferences? “Each of our theories specifies the relationships between categories that are possible or likely, and predictions about unobserved relationships between entities are guided by inductive inferences about their category assignments” (pg. 166)
How are theories acquired? Kemp et al. consider the case of acquiring both the concepts and causal laws of a theory simultaneously. “Given a formal characterization of a theory, we can set up a space of possible theories and define a prior distribution over this space” (pg. 167)

They apply their model (described below) to several different real-world datasets: clustering animals and features, clustering medical terms and predicates, and recovering kinship categories. They also run two experiments in which they show how their model matches the way that people learn causal theories of physical relationships.

Methods

In Experiment 1, Kemp et al. had people play with a set of blocks. Each block either was a member of category $A$ or category $B$, and depending on the condition, the relationship between $A$ and $B$ was either: that $A$ blocks caused $B$ blocks to light up; that $A$ blocks caused $B$ blocks to light up and vice versa; that $A$ blocks caused $B$ blocks to light up with $p=0.5$; and that $A$ blocks caused $B$ blocks to light up with $p=0.5$ and vice versa. Note that specific blocks were deterministic, $p$ was used to determine which pairs of blocks would affect each other. Participants got experience playing with the blocks and were periodically asked to predict what would happen if a new block was touched to an old block. They then saw the new block touch a different old block, and were asked the same question.

Experiment 2 was similar to Experiment 1, except that only the relationship where $A$ blocks caused $B$ blocks to light up and vice versa was used. After getting experience with the blocks, participants were shown three new blocks, $x$, $y$, and $z$, where either $x$ and $z$ were in the same category and $y$ was in a different category, or where $y$ and $z$ were in the same category and $x$ was in a different category. By seeing that $x$ interacts with $y$ and that $z$ interacts with $y$, participants should (and do) infer that $x$ and $z$ are in the same category—but they shouldn’t know which one. After seeing $z$ interact (or not interact) with an old block, participants should (and do) infer which categories $x$ and $y$ are in.

The model matches people’s behavior in both of these experiments. In contrast, particularly in Experiment 2, a purely feature-based categorization model (which doesn’t take into account relations) does not appropriately generalize.

Algorithm

Let the observed data include $m$ relations ($R$) over $n$ types. Then, a relational system is characterized by a pair $(z, \eta)$, where $z$ is a partition of entities into categories and $\eta$ is a matrix of parameters specifying how the categories interact with each other for a relation $R$ (a.k.a. a category graph where the edge from $A$ to $B$ has weight $\eta(A,B)$). Then, the theory that best characterizes the data is given by the MAP estimate of:

$p(z,\eta\vert R)\propto p(R\vert \eta,z)p(\eta\vert z)p(z)$

where $p(R\vert \eta,z)$ is the probability of observing the relations given categories and parameters, $p(\eta\vert z)$ is the probability of the parameters given the partition, and $p(z)$ is the prior over category assignments. More formally:

$\begin{align*} z\vert\gamma &\sim \mathrm{CRP}(\gamma)\\ \eta(A, B)\vert\alpha,\beta &\sim \mathrm{Beta}(\alpha, \beta)\\ R(i, j)\vert z, \eta &\sim \mathrm{Bernoulli}(\eta(z_i, z_j)) \end{align*}$

In this case, there is one relation and one type (though multiple entities within that type); however, in the general case, there are multiple relations with $m$ dimensions and $n$ types. If $T^i$ is the $i$th type, then let $z^i$ be the partition into category assignments for $T^i$. If $R^j$ is the $j$th relation, then let $\eta^j$ be a separate parameter matrix for each $R^j:T^{d_1}\times T^{d_2}\times \ldots{}\times T^{d_m}$, where $d_k$ is the type corresponding to the $k$th dimension. The full general specification is then:

$\begin{align*} z^i\vert\gamma &\sim \mathrm{CRP}(\gamma)\\ \eta^j(z_{x_1}^{d_1},z_{x_2}^{d_2}\ldots{},z_{x_m}^{d_m})\vert\alpha,\beta &\sim \mathrm{Beta}(\alpha, \beta)\\ R^j(x_1,x_2,\ldots{},x_m)\vert z^1,\ldots{},z^n, \eta^j &\sim \mathrm{Bernoulli}(\eta^j(z_{x_1}^{d_1},z_{x_2}^{d_2}\ldots{},z_{x_m}^{d_m})) \end{align*}$

To fit the model, they first integrate out $\eta$ and find the best category assignments $z$ using a hill-climbing procedure or MCMC. Given $z$, the MAP parameter assignments $\eta$ can be computed analytically. Similarly, unobserved relations between specific entities can be predicted given $z$.

Takeaways

This framework seems to be a generalization of the one described by Griffiths & Tenenbaum — for example, Kemp et al.’s experiments here are very similar to the “blicket detector” paradigm. However, Griffiths & Tenenbaum only consider there to be blickets and non-blickets. In this case, the number of types is left unspecified. Actually, it seems not so much that this is a generalization of Griffiths & Tenenbaum, but that it’s a more abstract sort of learning. This clusters entities and relations into an ontology that could be used by the framework described by Griffiths & Tenenbaum, but it’s not necessarily a causal ontology.

I am less clear on exactly how this type of framework for theories relates to that discussed in the other paper by Kemp et al.. In that paper, they discuss how to parse objects into different ontological kinds, which is essentially a type of feature-based clustering. I suppose that model is something that is meant to be applied a more domain-specific level—e.g. perhaps within a single category of the kind that is discovered by the present paper. It would be nice to see a tighter coupling between these different types of ontologies and ways of learning theories.