School of Computer Science

Seminar details - A new theory of vision

A new theory of vision

( Departmental (old) Series )

Aaron Sloman, The University of Birmingham

Date and time: Thursday 13th October 2005 at 16:00
Location: UG40, School of Computer Science
Host: Volker Sorge

For many years I have been working on a collection of related problems that are usually studied separately. In the last few weeks I have come across a way of thinking about them

o that seems to be new, though it combines several old ideas

o seems to have a lot of explanatory power

o opens up a collection of new research issues in psychology, (including animal psychology), neuroscience, AI, biological evolution, linguistics and philosophy.

The issues I have been concerned with include the following:

o what are the functions of vision in humans, other animals and intelligent robots -- and what mechanisms, forms of representation and architectures make it possible for those requirements to be met?

o how do we do spatial/visual reasoning, both about spatial problems, and also about non-spatial problems (e.g. reasoning about search strategies, or about transfinite ordinals, or family relationships)?

o what is the role of spatial reasoning capabilities in mathematics?

o what are affordances, how do we see them and how do we use them including both positive and negative affordances? I.e. how do we see which actions are possible and what the constraints are, before we perform them?

o what is causation, how do we find out what causes what, and how do we reason about causal connections?

o how much of all this do we share with other animals?

o what are the relationships between being able to understand and reason about what we see, and being able to perform actions on things we see?

o how do all these visual and other abilities develop within an individual?

o how did the abilities evolve in various species?

Examples of things I wrote about these topics nearly 30 years ago can be found in my 1978 book, e.g. these chapters (of which the first was originally a paper at IJCAI 1971 attaching logicist AI).

http://www.cs.bham.ac.uk/research/cogaff/crp/chap7.html http://www.cs.bham.ac.uk/research/cogaff/crp/chap9.html

The first presents a theory of diagrammatical reasoning as 'formal' and rigorous in its own way, and the second reports a theory of vision as involving perception of structure at different levels of abstraction, using different ontologies, with information flowing both bottom up (driven by the data) and top down (driven by problems, expectations, and prior knowledge).

But most of what I wrote over many years was very vague and did not specify mechanisms. Many other people have speculated about mechanisms, but I don't think the mechanisms proposed have the right capabilities.

In particular AI work on vision over the last 30 years, has mostly ignored the task of perceiving and understanding structure, and has instead focused on classification, tracking, and prediction, which are largely statistical, not visual processes.

Recently I was thinking about requirements for vision in a robot with 3-D manipulation capabilities, as required for the 'PlayMate' scenario in the CoSy project

http://www.cs.bham.ac.uk/research/projects/cosy/PlayMate-start.html

Thinking about relations between 3-D structured objects made it plain that besides obvious things like 'the pyramid is on the block' the robot will have to perceive less obvious things such as that the pyramid and block each has many parts (including vertices, faces, edges, centres of faces, interiors, exteriors, etc.) that stand in different relations to one another and to the parts of the other object. I.e. the robot (like us) needs to be able to perceive 'multi-strand relationships'.

Moreover, when things move, whether as part of an action performed by the robot or for some other reason, many of these relations change *concurrently*.

E.g. one corner of the pyramid might move off the face of the block while the other parts of the pyramid change relationships to other parts of the block. If the object moving is flexible, internal relationships can change also. So the robot needs to perceive 'multi-strand processes', in which multi-strand relationships change.

Thinking about this, and linking it up with a collection of older ideas (e.g. 'vision is controlled hallucination' Max Clowes) led to the following hypothesis:

Visual perception involves:

- creation and running of a collection of process *simulations*

- at different levels of abstraction

- some discrete, some continuous (at different resolutions)

- in (partial) registration with one another and with sensory data (where available), and with motor output signals in some cases,

- using mechanisms capable of running with more or less sensory input (e.g. as part of an object moves out of sight behind a wall, etc.)

- selecting only subsets of possible simulations at each level depending on what current interests and motivations are (e.g. allowing zooming in and out)

- with the possibility of saving re-startable 'check-points' for use when searching for a solution to a problem, e.g. a planning problem.

So, paradoxically, perceiving a static scene involves running simulations in which nothing happens.

The ability to run these simulations during visual perception may be shared with many animals, but probably only a small subset have the ability to use these mechanisms for representing and reasoning about processes that are not currently being perceived, including very abstract processes that could never be perceived, e.g. processes involving transformations of infinite ordinals.

In the talk I shall attempt to explain all this in more detail and identify some of the unanswered questions arising out of the theory. There are many research questions raised by all this.

I would welcome criticisms, suggestions for improvement of the theory, and suggestions for implementation on computers and in brains.

"If a problem is too hard to solve, try a harder one". (I have not found out who said that. If you know, please tell me.)