Web Site for the EC-Funded CoSy Project
(Birmingham Component)

Aaron Sloman

This is an expanded version of the abstract for talk I gave on Thursday 13th October at 4pm in the School of Computer Science (room UG40).

The more detailed actual presentation is available as COSY-PR-0505: A (Possibly) New Theory of Vision (PDF)


This is Birmingham CoSy Discussion Paper COSY-DP-0508


For many years I have been working on a collection of related problems that are usually studied separately. In the last few weeks I have come across a way of thinking about them The issues I have been concerned with include the following: Examples of things I wrote about these topics nearly 30 years ago can be found in my 1978 book (The Computer Revolution in Philosophy: Philosophy Science and Models of Mind), e.g. Chapter 7 and Chapter 9. Chapter 7 was based on a paper presented at IJCAI 1971 attacking logicist AI. It presented a theory of diagrammatical reasoning (a special case of reasoning using 'analogical representations') as 'formal' and rigorous in its own way, including, for instance, reasoning about propagation of causation in physical mechanisms such as this (where the points of the triangles are hinge points and the pulley has a fixed axis):

This was contrasted with reasoning using Fregean representations, where relations are represented not by relations but by symbols. For further discussion on the role of vision in reasoning see this presentation on 'When is seeing (possibly in your mind's eye) better than deducing, for reasoning?.

Chapter 9 reports a theory of vision as involving perception of structure at different levels of abstraction, using different ontologies, with information flowing both bottom up (driven by the data) and top down (driven by problems, expectations, and prior knowledge), as illustrated in this figure depicting processing in a system called POPEYE we implemented around that time, which could fairly quickly recognise words in messy pictures of overlapping capital letters.


But most of what I wrote over many years was very vague and did not specify mechanisms (apart from the Popeye system). Many other people have speculated about mechanisms, but I don't think the mechanisms proposed have the right capabilities.

The current state of AI research on vision

AI work on vision over the last 30 years, has mostly ignored the task of perceiving and understanding structure (apart from image structure), and has instead focused on classification, tracking, and prediction, which are largely statistical, not visual processes.

For example, enquiries among internationally known vision researchers revealed that nobody at present has an image interpretation system able to see what we can see in images like these, which would enable us to plan a process of rearranging the items from the first configuration to the second, or vice versa, using only finger and thumb for grasping items to be moved:

CupOnSaucer SaucerOnCup

For some time I have been thinking about requirements for vision in a robot with 3-D manipulation capabilities, as required for the 'PlayMate' scenario in the CoSy project.

In particular, thinking about relations between 3-D structured objects made it plain that besides obvious things like 'the pyramid is on the block' the robot will have to perceive less obvious things such as that the pyramid and block each has many parts (including vertices, faces, edges, centres of faces, interiors, exteriors, etc.) that stand in different relations to one another and to the parts of the other object. For example if you look at the picture on the left you can probably (fairly) easily point (approximately) at the part of the spoon that is under the handle of the cup. If I point at a location on the rim of the cup you will easily see (approximately) how finger and thumb need to be oriented to pick the cup up at that point. The required orientation keeps varying around the rim of the cup. The same applies to seeing how to pick up the cup by grasping the handle at various indicated points, though now the orientation requirement changes in a different way.

I.e. a robot with manipulative capabilities (like us) needs to be able to perceive 'multi-strand relationships' involving not only whole objects (as in seeing that 'the cup is above the saucer') but also parts of the objects, portions of surfaces, etc. Moreover, we do not need to be able to describe in words what we see. Many animals that do not talk can manipulate things using their jaws, beaks, claws, and in some cases fingers. In particular, this ability to see multi-strand relationships seems to be a require for hunting animals that have to catch prey and then tear open their prey in order to get at the flesh. Likewise, many young children who cannot yet talk can manipulate objects like blocks, spoons, cups, and simple jig-saw puzzles.

It is not merely the case that there are multiple relationships: they are also relationships of different kinds, e.g. metrical relationships, ordering relationships, topological relationships, functional relations ('A supports B', 'A keeps B and C apart'), and others. So quite a rich ontology is required in perceiving those relationships, as discussed in COSY-DP-0501 ('Towards an ontology for factual information for a playful robot').

Seeing Motion

Moreover, when things move, whether as part of an action performed by the robot or for some other reason, many of these relations change concurrently.

E.g. one corner of the pyramid might move off the face of the block while the other parts of the pyramid change relationships to other parts of the block. If the object moving is flexible, internal relationships can change also. So the robot needs to perceive 'multi-strand processes', in which multi-strand relationships change.

The (possibly new) theory

Thinking about all the above (and much more, including looking at some of the things children can and cannot do, and watching videos of Betty the Hook-Making crow ), talking to Jackie Chappell (now in Birmingham) about varieties of animal cognition and the Altricial Precocial Spectrum for robots, and linking all this up with a collection of older ideas (e.g. 'vision is controlled hallucination' Max Clowes) led to the following hypothesis:

Visual perception (in humans and many but not all other animals) involves:

So, paradoxically, perceiving a static scene involves running simulations in which nothing happens.

The ability to run these simulations during visual perception may be shared with many animals, but probably only a small subset have the ability to use these mechanisms for representing and reasoning about processes that are not currently being perceived, including very abstract processes that could never be perceived, e.g. processes involving transformations of infinite ordinals.

I claim that the ability to understand causal relations in changing structures is deeply connected with one of our two concepts of causation, the Kantian concept, which is different from the much more popuplar Humean interpretation of causation. The difference is discussed in this draft paper.

In the talk I shall attempt to explain all this in more detail and identify some of the unanswered questions arising out of the theory. There are many research questions raised by all this, including questions about the stages of development of such a visual architecture --- e.g. the process of constructing new kinds of simulative capabilities.

I would welcome criticisms, suggestions for improvement of the theory, and suggestions for implementation on computers and in brains.

I shall continue working on these notes, including incorporating some of the ideas in the notes on the polyflap domain and notes on dynamical systems vs other underpinnings of action

"If a problem is too hard to solve, try a harder one"
(I have not found out who said that. If you know, please tell me.)

Here is An earlier 'New Theory of Vision'

Comments to Aaron Sloman
Last updated 16 Oct 2005