http://www.cs.bham.ac.uk/research/projects/cosy/
REQUEST FOR COMMENTS ON A NEW(?) THEORY OF VISION
Aaron Sloman
This is an expanded version of the abstract for talk I
gave on
Thursday 13th October at 4pm
in the School of Computer Science (room UG40).
The more detailed actual presentation is available as
COSY-PR-0505: A (Possibly) New Theory of Vision (PDF)
Abstract
This is
Birmingham CoSy Discussion Paper COSY-DP-0508
Background
For many years I have been working on a collection of related problems
that are usually studied separately. In the last few weeks I have come
across a way of thinking about them
- that seems to be new, though it combines several old ideas
- seems to have a lot of explanatory power
- opens up a collection of new research issues in psychology,
(including animal psychology), neuroscience, AI, biological
evolution, linguistics and philosophy.
The issues I have been concerned with include the following:
- what are the functions of vision in humans, other animals and
intelligent robots -- and what mechanisms, forms of representation
and architectures make it possible for those requirements to be
met?
- how do we do spatial/visual reasoning, both about spatial problems,
and also about non-spatial problems (e.g. reasoning about search
strategies, or about transfinite ordinals, or family
relationships)?
- what is the role of spatial reasoning capabilities in mathematics?
- what are affordances, how do we see them and how do we use them
including both positive and negative affordances? I.e. how do we
see which actions are possible and what the constraints are,
before we perform them?
- what is causation, how do we find out what causes what, and how
do we reason about causal connections?
- how much of all this do we share with other animals?
- what are the relationships between being able to understand and
reason about what we see, and being able to perform actions on things
we see?
- how do all these visual and other abilities develop within an individual?
- how did the abilities evolve in various species?
Examples of things I wrote about these topics nearly 30 years ago can be
found in
my 1978 book
(The Computer Revolution in Philosophy: Philosophy Science and Models of
Mind), e.g.
Chapter
7 and Chapter
9. Chapter 7 was based on a paper presented at IJCAI 1971 attacking
logicist AI. It presented a theory of diagrammatical reasoning
(a special case of reasoning using 'analogical representations')
as
'formal' and rigorous in its own way, including, for instance, reasoning
about propagation of causation in physical mechanisms such as this
(where the points of the triangles are hinge points and the pulley has a
fixed axis):
This was contrasted with reasoning using Fregean representations, where
relations are represented not by relations but by symbols. For further
discussion on the role of vision in reasoning see
this presentation
on
'When is seeing (possibly in your mind's eye) better than deducing,
for reasoning?.
Chapter 9
reports a theory of vision as
involving perception of structure at different levels of abstraction,
using different ontologies, with information flowing both bottom up
(driven by the data) and top down (driven by problems, expectations, and
prior knowledge), as illustrated in this figure depicting processing in
a system called POPEYE we implemented around that time, which could
fairly quickly recognise words in messy pictures of overlapping capital
letters.
But
most of what I wrote over many years was very vague and did not
specify mechanisms (apart from the Popeye system).
Many other people have speculated about mechanisms,
but I don't think the mechanisms proposed have the right capabilities.
The current state of AI research on vision
AI work on vision over the last 30 years, has mostly ignored the task of
perceiving and understanding structure (apart from image structure), and
has instead focused on classification, tracking, and prediction, which
are largely statistical, not visual processes.
For example, enquiries among internationally known vision researchers
revealed that nobody at present has an image interpretation system able
to see what we can see in images like these, which would enable us to
plan a process of rearranging the items from the first configuration to
the second, or
vice versa,
using only finger and thumb for grasping items to be moved:
For some time I have been thinking about requirements for vision in a
robot with 3-D manipulation capabilities, as required
for the 'PlayMate' scenario in the CoSy project.
In particular, thinking about relations between 3-D structured objects
made it plain that besides obvious things like 'the pyramid is on the
block' the robot will have to perceive less obvious things such as that
the pyramid and block each has many parts (including vertices, faces,
edges, centres of faces, interiors, exteriors, etc.) that stand in
different relations to one another and to the parts of the other object.
For example if you look at the picture on the left you can probably
(fairly) easily point (approximately) at the part of the spoon that is
under the handle of the cup. If I point at a location on the rim of the
cup you will easily see (approximately) how finger and thumb need to be
oriented to pick the cup up at that point. The required orientation
keeps varying around the rim of the cup. The same applies to seeing how
to pick up the cup by grasping the handle at various indicated points,
though now the orientation requirement changes in a different way.
NB: it is very important that I am not claiming that we see any of this,
or visualise the motions involved with perfect precision. On the
contrary, part of the power of our ability to see, think and plan is
that we can do so at a low level of resolution that considerably reduces
the processing requirement whilst at the same time giving more re-usable
representations, because they are more abstract and general. This is
obviously related to a long history of AI work on qualitative
representation and reasoning, but I don't think simply talking about
ordering relations on unknown high precision real values is what
we need. We need something closer to what the 'fuzzy' community (e.g. in
fuzzy logic, fuzzy set theory, fuzzy control, fuzzy chunking) has been
talking about for the last few decades, but I suspect that is at most
only a small subset of what we need!
I.e. a robot with manipulative capabilities (like us) needs to be able
to perceive 'multi-strand relationships' involving not only whole
objects (as in seeing that 'the cup is above the saucer') but also parts
of the objects, portions of surfaces, etc. Moreover, we do not need to
be able to describe in words what we see. Many animals that do not talk
can manipulate things using their jaws, beaks, claws, and in some cases
fingers. In particular, this ability to see multi-strand relationships
seems to be a require for hunting animals that have to catch prey and
then tear open their prey in order to get at the flesh. Likewise, many
young children who cannot yet talk can manipulate objects like blocks,
spoons, cups, and simple jig-saw puzzles.
It is not merely the case that there are multiple relationships: they
are also relationships of different kinds, e.g. metrical relationships,
ordering relationships, topological relationships, functional
relations ('A supports B', 'A keeps B and C apart'),
and others. So quite a rich ontology is
required in perceiving those relationships, as discussed in
COSY-DP-0501
('Towards an ontology for factual information for a playful robot').
Seeing Motion
Moreover, when things move, whether as part of an action performed by
the robot or for some other reason, many of these relations change
concurrently.
E.g. one corner of the pyramid might move off the face of the block
while the other parts of the pyramid change relationships to other parts
of the block. If the object moving is flexible, internal relationships
can change also. So the robot needs to perceive 'multi-strand
processes', in which multi-strand relationships change.
The (possibly new) theory
Thinking about all the above (and much more, including looking at some
of the things children can and cannot do, and watching videos of
Betty the Hook-Making crow
), talking to
Jackie Chappell
(now in Birmingham) about varieties of animal cognition and
the
Altricial Precocial Spectrum for robots,
and linking all this up with a collection of older ideas
(e.g. 'vision is controlled hallucination' Max Clowes) led to the
following hypothesis:
Visual perception (in humans and many but not all other
animals) involves:
- creation and running of a collection of concurrent process
simulations
- at different levels of abstraction
- some discrete (including symbolic specifications of structural
changes), some continuous (at different resolutions)
- in (partial) registration with one another and with sensory data
(where available), and with motor output signals in some cases,
- using mechanisms capable of running with more or less sensory
input (e.g. as part of an object moves out of sight behind
a wall, etc.)
- selecting only subsets of possible simulations at each level
depending on what current interests and motivations are (e.g.
allowing zooming in and out)
- with the possibility of saving re-startable 'check-points'
for use when searching for a solution to a problem, e.g. a planning
problem (without this a continuous simulative model of spatial
reasoning is useless for problem solving or planning, except in the
very simplest cases where the simulation always 'homes in' on a
solution).
So, paradoxically, perceiving a static scene involves
running simulations in which nothing happens.
The ability to run these simulations during visual perception
may be shared with many animals, but probably only a small subset have
the ability to use these mechanisms for representing and reasoning about
processes that are not currently being perceived, including very
abstract processes that could never be perceived, e.g. processes
involving transformations of infinite ordinals.
I claim that the ability to understand causal relations in changing
structures is deeply connected with one of our two concepts of
causation, the Kantian concept, which is different from the much more
popuplar Humean interpretation of causation. The difference is discussed
in
this draft
paper.
In the talk I shall attempt to explain all this in more detail and
identify some of the unanswered questions arising out of the theory.
There are many research questions raised by all this, including
questions about the stages of development of such a visual architecture
--- e.g. the process of constructing new kinds of simulative
capabilities.
I would welcome criticisms, suggestions for improvement of the theory,
and suggestions for implementation on computers and in brains.
I shall continue working on these notes, including incorporating some of
the ideas in
the
notes on the polyflap domain
and
notes on dynamical systems vs other underpinnings of action
"If a problem is too hard to solve, try a harder one"
(I have not found out who said that. If you know, please tell me.)
Here is
An earlier 'New Theory of Vision'
Comments to Aaron Sloman
Last updated 16 Oct 2005