Challenges for theories of vision
(This is a hastily written first draft, and liable to change)

Aaron Sloman
School of Computer Science, University of Birmingham

Some challenges for vision theorists and philosophers of mathematics

Three short videos
While watching each video, try to identify changes in spatial relationships, whether they are changes in image contents, or changes in the objects seen (and their relationships), or both.

I suspect you will find that although you cannot perceive absolute spatial relationships, e.g. exact distances, directions, slopes, degrees of curvature of surfaces, trajectories of motion, you can nevertheless with high certainty perceive many partial orderings and changes in partial orderings.

For example, for many pairs of visible surface features of different objects visible in the scene, you can tell which of two visible surface features is closer to the viewer, and whether the perceived 2D (projected) distance between two visible features is increasing or decreasing as the video progresses, e.g. the perceived size of gap between two bars, in the back of the chair, or the perceived projected distance between part of the top edge of the chair and an edge of a floor tile, or the bottom edge of one of the cupboards.

I suggest that both the richness of human visual experience and the enormously skillful use of visual information by fast moving animals in complex environments, e,g. squirrels moving through tree-branches to get to food or to get to another tree, and birds moving rapidly towards or away from a nest, during nest construction, or when feeding fledglings, suggest that a great deal of spatial information about structures and changes in the visual field and their static and changing relationships is used (with the help of multiple constraint-propagation processes) to assemble a mass of detailed information required for a variety of control decisions on different time scales.

What are the likely causes of the changes in sensory/sensed information, and insofar as there are specific changes e.g. in relative size, relative distance, relative orientation, of objects or object parts, consider whether those visible changes provide information about what exists or is happening in the "world"?

What difference would it make if, in addition to the visual sensory information available in the videos you also had information (proprioceptive information and efferent feedback) about motor signals a brain might generate to limbs, fingers, etc. to produce the changes in viewpoint? What might you be able to perceive (infer, learn) from the visual information that you cannot without that? Compare Berthoz (2000).

E.g. are there changes in relationships between objects in 3D space (e.g. contact, direction, distance, relative orientation, obstructing line of sight) that help to provide useful information about what objects there are, what parts they are composed of, and what the spatial relationships between various parts and various objects are?

What sorts of AI reasoning mechanisms could make use of all those sources of sensory information, possibly combined with records of motor signals ("afferent copies" is the misleading jargon) in order to acquire information that would be relevant to control decisions related to possible actions on the perceived objects, e.g. biting, grasping, pushing, moving, lifting, stacking, avoiding, etc.?

Readers are invited to look for examples of these points in the videos, and to consider what difference it would make if in addition to the visual input signals you also had information about motor output signals.

Seeing a single chair, plus background, from a moving viewpoint (MP4) (1 min 43 secs)
Think about how what you see in the background changes as the viewpoint changes, and how visual relationships between parts of the chair and parts of the background change. Many details change in coordinated ways that give strong clues as to the 3D spatial layout of the perceived scene, as well as providing information about affordances as suggested by Gibson. A simple example noted by Gibson is that when moving towards a surface, the centre of expansion of visible texture in the visual image and the rate of expansion of texture give information about the likely point of contact with the surface, and the time to contact. We can generalise this point to include different directions, speeds, and centres of optical flow happening simultaneously when viewing a scene with multiple surfaces. Moreover, relationships between patterns of optical flow can give evidence as to whether two visible surface patches are parts of the same flat surface (e.g. floor, wall, table-top, cupboard door, etc.) If we combine Gibson's ideas with those of Trehub(1991), according to which the primary visual cortex is mainly an optical feature capture device, whose results are immediately copied to other brain centres for processing according to different needs. (This would explain why the blind spot is not perceived: there is nothing there to be copied to other cognitive sub-systems.)

Seeing two chairs, plus background, from a moving viewpoint (MP4) (1min 17 secs)
Think about how the presence of a second chair affects the patterns of change in visual input related to motion of the viewer. What additional problems, and additional opportunities, does the added 3D complexity bring to the visual task.

Seeing a chair and a pot plant, plus background, from a moving viewpoint (MP4) (47 secs)
The added complexity produced by the second chair and pot plant differ in both kind, numerosity (of changing features) and variety of changes of visibility of surfaces, edges, textures, etc. Is it possible that the extra spatial complexity in the structures and processes enriches what an intelligent perceiver can perceive without having to be trained on different objects, configurations of objects and processes.

Some questions about the videos:
  1. How can we decide whether a machine sees these videos as humans do?
    (Some humans? All humans? I have not investigated different human responses.)

  2. What mechanisms would enable a robot to see these in the same ways as you can?

  3. How could evolution have
    -- (a) produced
    -- (b) implemented
    such mechanisms? (E.g. how does a brain represent straightness, planarity, possible shapes of 3D surface fragments, such as bumps, dents, grooves, etc?)

  4. How many different sorts of process are perceived in each video, including
    -- motion of viewpoint?
    -- absolute or relative motions of perceived surfaces?
    -- changes of occlusion/visibility of surface fragments?
    -- changes of orientation (i.e. rotation) of visible surface fragments?
    -- variations in optical flow patterns (Gibson)?

  5. How can brains represent such processes?

  6. Can the functionality of the biological mechanisms be replicated using digital information processing technology?

  7. If not, what sorts of information-processing mechanisms could replicate the biological functionality?

  8. How would the visual challenges presented by these three videos compare with the visual challenges of mobile animals, e.g. animals walking or running through a forest, birds flying through branches to get to or from their nests, or to food, squirrels or monkeys moving through treetops without being ale to fly?

  9. Compare the problem for a squirrel and the problem for an insect getting pollen from flowers among the branches.

  10. What has all this to do with evolution of mathematical intelligence?

Some questions relevant to specifying the requirements to be met:
Which surfaces and parts of surfaces are visible at any time in these videos?

How are they seen? As collections of 3D or 2D points? As planar surface fragments stitched together? As curved 3D surfaces? As moving surfaces? Moving in which directions relative to what? As processes in which there are surface-like features?

Which answers should be regarded as parts of specifications to be met by designs for intelligent robots with human-like perception and action capabilities?

How are the contents of the perception processes implemented?
(a) spatially,
     e.g. absolute positions, orientations, etc. or merely partial orderings of distance, size, curvature, etc.?
(b) in terms of occlusion or partial occlusion,
(c) in terms of function (e.g. supporting, leaning on, constraining, ...)?

What movements are made by the camera (eyes) during the video? How would the eye movements be represented in the control subsystems?
     Changes of location, changes of orientation, changes of "fixation centre"?
     Which of the changes should a robot visual system be able to detect and represent?

How do visibility changes depend on relative locations and relative motions of objects seen and the camera?

At any stage, select a portion of a visible surface and consider whether and how its visibility would be changed by various changes of viewpoint (camera translation -- forwards, backwards, sideways, at some other angle relative to the line of sight, ...?).

How should an intelligent perceiver control viewing direction (rotation of camera)? When watching videos like these you cannot retrospectively change the viewing direction of the camera, though you can
-- change the viewing direction of your eyes, fixating different image fragments,
-- change portions and aspects of the image and the scene attended to
     (how would change of attention be implemented in a robot?)
-- notice and attend to changes of the camera's viewing direction (camera rotation)

Does an intelligent machine, or animal (or normal human) need to be able to estimate absolute distances, sizes, speeds, orientations, rates of rotation?

What alternatives to absolute metrics would be useful for an intelligent agent, and how?

What sorts of internal data-structure could enable a robot (or a brain) to represent
-- what is visible at any time (structures and processes)?
-- the changes that occur?
-- how various movements alter what it is possible, or impossible, to see?

Try the following experiments at various stages in the videos:
Stop the video and select two locations in the scene where a surface is partly visible and partly occluded (because an occluding edge is present). Then consider
-- which motions of the chair, or of the camera/eye location, if any, will simultaneously make MORE of
   both surface fragments visible
-- which motions will simultaneously make LESS of both surface fragments visible.

What can you conclude about differences in spatial consciousness between a normally sighted person and a blind person?
(Note: humans born blind from birth may have access to brain mechanisms that could only have evolved in ancestors with visual capabilities.)

Are there important questions relating to the nature and functions of spatial perception that I've left out?

I suspect answering questions like that would be more useful for an intelligent machine than computing actual locations in a coordinate space. One reason is that qualitative changes in those features can be useful for controlling movements, e.g. steering towards a doorway, without having to compute 3D locations, directions, or distances. (Compare G

Perceptual mechanisms of humans and other intelligent animals (including many mammals and birds) include abilities to reason about how physical changes will affect availability of information.

Such abilities could be labelled abilities to detect (and use) cognitive affordances.

In reflex responses to triggering stimuli the processing happens without any consideration of alternative options. In more intelligent responses, animals, or robots may include varying levels of sophistication and self-awareness in choosing how to process the new information.

Origins of ancient mathematics
The ability to perform a variety of processes of reasoning about possible and impossible changes in visibility (epistemic affordances), including reasoning about possibilities and impossibilities in novel configurations of objects, was an important evolutionary and developmental precursor to abilities to make ancient mathematical discoveries in geometry and topology.

I suspect there is no mechanism known to neuroscientists that explains such abilities.

The polyflap domain was proposed in 2005 as a domain in which robotic and psychological experiments could be performed to investigate these mechanisms.

Humans do not always have to be trained on particular configurations of information in order to reason about them. Presumably some of the mechanisms are specified in the genome, though they may not all be expressed at birth.
(For further discussion of mechanisms required in the genome in order to support human-like mathematical capabilities, see The Meta-Configured Genome:
Also pdf.)

There are many unanswered questions:
-- How do brains do such reasoning?
-- What brain mechanisms make that possible?
-- How are these abilities related to abilities to make discoveries in topology and geometry?
-- Is it possible to implement mechanisms with those powers using digital computing machinery?
-- What alternatives are there?

See this brief discussion of Turing's distinction between the roles of intuition and ingenuity in mathematical cognition:
What kinds of geometric/topological insights are used in perceiving and understanding the above videos?

Is it possible that replicating those abilities in robots will require use of new kinds of computing machinery, combining discrete and continuous changes? For an incomplete discussion see:

Video Camera vs Kinect?
These movies were obviously taken with a standard (mobile phone) video camera. Many roboticists have tried to "improve" on cameras as visual sensors by using arrays of range-finders or using a Kinect-like device that instead of simply providing a digitised planar projection of a 3D scene uses range-finder technology to produce clouds of 3D points, represented by their coordinates.

How would use of Kinect alter a machine's ability to answer the questions above?

Is it possible that biological evolution made far more use of structures and processes in 2D projections of 3D scenes than 3D information available using stereo mechanisms, requiring visual fields to overlap?

What are the relative advantages of both for a mobile viewer perceiving and acting in a 3D environment?


Alain Berthoz, 2000, The Brain's sense of movement, Harvard University Press, Perspectives in Cognitive Science, London, UK

James J. Gibson, 1979 The Ecological Approach to Visual Perception, Houghton Mifflin, Boston, MA,

Aaron Sloman (2007-2018), Predicting Affordance Changes (Alternative ways to deal with uncertainty), (Unpublished technical report), School of Computer Science, University of Birmingham,

Aaron Sloman
The Meta-Morphogenesis project (2012, ff)

Arnold Trehub, 1991, The Cognitive Brain, MIT Press, Cambridge, MA,

cc-license Creative Commons License

This work, and everything else on my website, is licensed under a Creative Commons Attribution 4.0 License.
If you use or comment on my ideas please include a URL if possible.

Installed: 9 Oct 2018
Last updated: 12 Oct 2018; 15 Oct 2018; 21 Oct 2018
This document is
This is part of the CogAff (Cognition and Affect), and Meta-Morphogenesis (M-M) projects
Additional papers and discussion notes:

Maintained by Aaron Sloman
School of Computer Science
The University of Birmingham