Grush, R. (2004). The emulation theory of representation: motor control, imagery, and perception. The Behavioral And Brain Sciences, 27(3), 377–96; discussion 396–442. doi:10.1017/S0140525X04000093

Summary

Wow, this was such a cool paper to read! Grush lays out the “emulation” theory of motor control/perception/imagery/etc. which is very similar to the theory that I have been thinking of (but that I haven’t been able to explain nearly as eloquently). It is what he calls a “pseudo-closed-loop control” scheme, inspired by the ideas behind Kalman filtering. The core ideas from the Kalman filter are:

There is some process that is created by a combination of a driving force as well as the internal process mechanism (and some noise in that mechanism). Measurements are made of the state of this process, but they are also corrupted by noise.
There is an internal model (emulator) of the process mechanisms. Using this internal model plus the driving force, the next state of the system is computed.
A measurement is made of the simulated state, and it is compared to the measurement that was made from the true process. The difference between these two measurements is combined with the inverse of the observation in order to compute a residual correction.
Some proportion of correction is applied to the estimated state in order to produce a corrected estimate of the measurement.

In terms of cognition, the process is something like the state of a limb, or a representation of a scene. The driving force is the motor commands that are issued to change the state of the system. So, the hypothesis is essentially that during motor control or perception, the mind uses an internal model (the emulator) to keep track of a better estimate of the state of the system. But, by having an emulator, it can also be used for other things (such as mental imagery). In the case of imagery, the emulator itself can be used to make predictions about the future state of the system (without applying corrections).

Motor imagery

Grush makes the distinction between emulation theory and simulation theory, where simulation theory is the idea that motor imagery comes from simulating motor commands, but not actually executing those controls. Emulation theory includes that aspect, but also includes the idea of having a separate model that emulates the state of the system. To quote from the paper:

The differenceis that the simulation theory does not posit anything corresponding to an emulator; as far as I can tell, the simulator theory is conceived against the backdrop of closed-loop control, and motor imagery hypothesized to be the free-spinning of the controller (motor centers) when disengaged from the plant (body). In the emulation theory, by contrast, imagery is not produced by the mere free-spinning opertion of the efferent motor areas, but by the efferent motor areas driving an emulator of the musculoskeletal system. (pg. 382)

Visual imagery

In discussing how emulation theory applies to perception and visual imagery, Grush makes the distinction between modal emulators and amodal emulators. A modal emulator is an emulator which mimics one of the senses (e.g., by representing the state as pixels). An amodal emulator, by contrast, represents the state using a more structured abstract or symbolic representation (e.g., the underlying geometry, pose, and kinematics of objects and where the agent is in relation to them). The amodal emulator is clearly more flexible and encodes much more information than modal emulators; however, the modal emulators may be better at predicting information that is specific to a particular sensory domain. Thus, Grush proposes that the mind probably makes use of an amodal emulator as well as multiple modal emulators for vision, motor control, audition, etc.

Sensation and perception

Grush goes as far as to say that modal emulators are really used in sensation (i.e. receiving and processing sensory signals) while a particular amodal emulator, the environment emulator, is what really produces our perception (the interpretation of our sensations). The environment emulator is object-centric, and presumably is what allows us to do things like predict physical events.

There are two points that Grush makes here that are, in my opinion, very important. The first is:

… the current scheme, exactly because it treats perception as one aspect of an integrated information processing strategy, sheds light on the nature of perception itself. In the first place, the scheme highlights the extent to which the outcome of the perceptual process, the state estimate embodied in the emulator, is tuned to sensorimotor requirements. The emulator represents objects and the environment as things engaged with in certain ways as opposed to how they are considered apart from their role in the organism’s environmental engagements. (pg. 393)

I think this is an important point because I think the way we reason about the world—and the objects in it—is precisely because we have to interact in the world. The point of having an ability for mental simulation (emulation) isn’t because it’s useful for higher level problem solving or creativity, but because it’s essential for planning and predicting what is going to happen.

The second important point is:

Another shift in emphasis suggested by this account is that perception is shown to be not a matter of starting with materials provided in sensation and filling in blanks until a completed percept is available. Rather, completed percepts of the environment are the starting point, in that the emulator always has a potentially self-contained environment emulator estimate up and running… The role played by sensation is to constrain the configuration and evolution of this representation. In motto form, perception is a controlled hallucination process. (pg. 393)

I love the characterization of perception as a controlled hallucination process. This is also exactly in line with a Bayesian interpretation: the prior is the internal model of how the world works (the hallucination), and the likelihood constrains the prior based on actual observations of the real world.

Takeaways

I think the emulation theory proposed by Grush is exactly the right way to think about perception and action. However, it is also a very high-level schematic of how to approach the problem: there are many questions that aren’t answered. For example:

How are the internal models learned? Here, perspectives from robotics might be particularly useful. For example, something like the “distal teaching” approach discussed by Nguyen-Tuong & Peters for training a mixed model (forward plus inverse models) could be used to learn new forward models and their inverses.
The emulation theory as discussed here just takes as input to the controller some high level goal, which may or may not be produced by using the emulator to make further predictions. What is the tradeoff between using the emulator for tracking the current state of the system vs. doing planning? Or are there multiple emulators used?
How does the controller decide what control signal to produce? Again, this is something that could potentially engage the emulator, which means there is also potentially a tradeoff for how the emulator gets used. For both this point and the previous point, work in robotics may provide some insight.
The Kalman filtering approach described by Grush assumes that the forward dynamics are the same for both the real and imagined process. What if they aren’t? What sorts of implications does this model mismatch have?
If such emulators are used for imagery, how is the initial state for them constructed in the first place?