School of Computer Science THE UNIVERSITY OF BIRMINGHAM CoSy project CogX project

VISION AND ACTION
REQUIREMENTS FOR SEEING THE REAL WORLD
Aaron Sloman
Last updated: 2 Oct 2009
Installed: 7 Mar 2009

I have been trying to work out what I would do if I had a team of
outstanding vision researchers with whom I could work for the next
few years (three to five years). What follows is a partial, draft,
set of answers, which will be updated from time to time.

People who specialise on vision research do not regard me as a
vision researcher, and there is some justification for that, insofar
as I spread myself very thinly over many topics, and I do not read
most of the published vision research reports. Nevertheless I have
been thinking about, reading about and writing about vision for over
30 years, including chapter 9 of
    The Computer Revolution in Philosophy
    I list some more papers, presentations, and discussions on vision
     at the end of this file.

This work has mostly been about requirements for human-like
or animal-like vision systems, rather than specific designs,
although the details of requirements do suggest constraints on
designs, and indicate some minimal architectural features, as shown
crudely here (2nd page).

Notice however that that gives merely one view of a complex multi-level, multi-functional, dynamical system. A different view is being developed within the CogAff project based on the variety of types of architecture that can be accommodated within the CogAff schema
CogAff
Different ways of filling in the schema will put different
mechanisms in the boxes and different connections between the
mechanisms. The lowest layer is found only in the simplest
organisms. The middle layer evolved much later, under pressure to
represent and reason about what doesn't exist. The top layer
probably evolved in parallel with the other two (and makes use of
them). It is concerned with meta-semantic competences: abilities to
represent and reason about things that represent and reason, with
obvious implications for self-monitoring and self control.

  The mechanisms do not all exist at birth in humans: they grow in carefully
   controlled phases using delayed development, for reasons explained in:
    Natural and artificial meta-configured altricial information-processing systems (2007).

The vision and action columns are also layered because evolution
discovered the need for perceptual and motor subsystems concerned
with acquiring and using information about the environment at levels
of abstraction corresponding to the different ontologies and
functions in the different layers. So waving to someone is an action
that requires meta-semantic competences and would be at least partly
under the control of the top, meta-management layer.

Likewise, seeing happiness or sadness in a face, or seeing an
intention in an action requires meta-semantic competences.

These meta-semantic perceptual, thinking, and action competences are
complex, but not necessarily more complex than abilities to perceive
and think about complex 3-D structures and processes in the
environment. E.g. ask yourself why it is that when a bolt goes
through a fixed nut, if the bolt is rotated about its axis that
makes it translate along its axis. Some more examples are in this
short discussion note
http://www.cs.bham.ac.uk/research/projects/cogaff/challenge.pdf
    "Perception of structure: Anyone Interested?"

Some disagreements with prevalent views

My work on vision has mainly been concerned with identifying
requirements for human-like or animal-like vision, including
requirements that will need to be met by visual systems in
intelligent robots that are currently far beyond the state of the
art in machine vision.

This work has led me to disagree with three widely held assumptions
regarding functions of vision, (1)-(3) below, and one widely used assumption
about good means to achieve those functions (4) below:

(1) I don't agree with the widely held assumption that the main function
    of a 3-D vision system in an animal or a robot is recognizing objects:
    recognition is a secondary function, which results from seeing.
    There are many situations in which we can see an object, and even do
    things like pick it up, jump on it, avoid touching it, break it
    apart, prod it, push it, etc., even though we do not recognise the
    whole object either as being an instance of a known category, or
    as being a previously encountered individual. So we need to make
    object-recognition occur as a by-product of seeing, not as the main
    or most basic function of seeing.

    It is also important to stress that perception is at least as much
    about processes as about objects. Biological visual systems
    did not evolve to cope with a series of snapshots.

    Animals exist in and interact with an environment in which many
    processes of different sorts occur, including processes in which
    object change their properties (e.g. shape or colour), their spatial
    relationships and their causal relationships and interactions.
    Furthermore these changes may be metrical, or qualitative,
    geometrical or topological, and may preserve or change complexity
    (e.g. as objects are combined to form more complex objects, or
    disassembled to form a larger collection of simpler objects).
    Perceiving these processes should not be confused with recognition.

    There are several issues concerning visual perception of 3-D objects
    that are not being addressed in part because of the excessive focus
    on recognition. One way to appreciate those problems is to consider
    how humans perceive objects they do not recognize. The proposal to
    study perception of polyflaps grew out of this requirement.

(2) I don't think 3-D vision should be thought of as producing
    some sort of internal model replicating or representing all the
    details of the scene, in such a way as to enable images of the scene
    to be generated by projection to different viewpoints. (This is one
    of the standard tests for success of a 3-D stereo system, but I
    think it is a misguided test).

    My brain cannot do that, yet I see a great deal of 3-D structure,
    and a great many processes in which 3-D structures are created or
    changed. That seems to be true of most people and animals with good
    vision. A small subset of individuals can learn to draw or paint
    what they see, but that is relatively rare.

    Examining things humans can do with pictures of impossible objects
    helps to undermine this 'isomorphic model-construction' view of 3-D
    vision. Some examples can be found here (PDF)

(3) Most vision researchers, in AI, psychology, etc. assume that
    vision is concerned with detecting what exists in the environment.
    This ignores the very important collection of issues first
    identified by J.J.Gibson which he described in terms of perception
    of affordances.
       J. J. Gibson, The Ecological Approach to Visual Perception,
       Houghton Mifflin, Boston, MA, 1979,

    Detailed examination of Gibson's examples, and further investigation
    of functions of vision indicates that a great deal of human vision
    is concerned not with what actually exists in the environment but
    with processes and objects that do not exist, but could exist,
    including both processes that could occur or be prevented as a
    result of actions of the perceiver (these involve affordances) and
    processes that could occur or be prevented by other things, e.g.
    something blowing in the wind, or being moved by gravity, or by
    another agent (I call these "proto affordances").

    A paper investigating some of the logical and philosophical
     implications of this is online here:
        'Actual Possibilities', in Principles of Knowledge Representation
        and Reasoning: Proceedings of the Fifth International Conference (KR `96)
        Eds L.C. Aiello and S.C. Shapiro 627--638. 1996


(4) (Added 2 Oct 2009) Most vision researchers, in AI, psychology, etc.
    appear to assume that spatial locations, distances and angles are
    represented within a single global coordinate system, where

   (a) distances between items in the scene use a common metric so that everything
       can, for example, be expressed in cm., or multiples of some other
       fixed unit of length,

   (b) positions have coordinates relative to some common origin, where the
       coordinates make use of the common distance metric
   and
   (c) orientations in space have measurable angles relative to axes of that
       global coordinate system.

    I suspect that instead of that a young human child or animal develops a web
    of semi-metrical spatial relationships in each scene where lengths or
    distances are measured relative to other things in the scene, using partial
    orderings, e.g. X is longer than Y, X is longer than Z, the distance from P
    to Q is more than twice the distance from R to S, etc.

    The precise details of how this works, how the form of representation is
    learnt, and how the the information thus expressed is used are all topics
    for further research. (See the presentation on ontologies for baby robots
    below, for more information.)

What are the functions of vision?

Exactly what the functions of vision in animals are, and what the
functions should be in intelligent robots, is a hard unsolved
research topic on which more work needs to be done so that we have
much richer sets of requirements against which to evaluate
proposed designs.

I have been working on collecting requirements for a long time, and
trying to organise them into different categories. But I think there
is still a long way to go.

My paper for the Dagstuhl workshop on vision in February 2008 is the
latest in a series of attempts to get clear about this, and I still
think I am missing important requirements.

    http://www.cs.bham.ac.uk/research/projects/cosy/papers/#tr0801a
    Architectural and representational requirements for seeing
    processes, proto-affordances and affordances.

In particular, I think there are three major functions of vision to
be distinguished, that are shared with other animals, and some
additional ones that are unique to humans.

Three major functions of vision

    1. visual servoing -- online control of actions involving
       production or prevention or alteration of 3-D processes of
       various kinds. This uses transient, constantly changing
       information.

       This is sometimes mistakenly referred to as the "where"
       function of vision, assumed to be the role of the "dorsal"
       visual stream.

    2. Producing factual, descriptive, re-usable, information that
       endures for different time-scales, about processes and
       structures in the environment, with perception of processes
       as probably more important than perception of structures.

       This is often mistakenly referred to as the "what" function
       of vision, assumed to be the role of the "dorsal" visual
       stream.

       (Milner and Goodale recommended switching from the what/where
       terminology to a perception/action distinction, which I think
       is a mistake. Visual servoing includes both action and
       vision.)

    3. Producing information about what is not occurring, or does
       not exist but could occur or exist in the environment, and
       seeing constraints on such possibilities.

       This can be subdivided into seeing proto-affordances, seeing
       action-affordances, seeing epistemic-affordances, and
       limitations of epistemic affordances (e.g. seeing that information
       is not available, or that it is imprecise, etc.)

       Often perceiving such affordances involves recognising what
       kind of stuff (material) things are made of -- e.g. rigid,
       flexible, elastic, impenetrable, fragile, squishy, heavy,
       hard, soft, liquid, powdery, etc. Many of these are not
       properties that can be directly sensed. They often need
       to be inferred from perceived results of actions (i.e.
       perceived processes).

       Examples of possible processes that are hard to see and easy to
       see, (at least for adult humans) can be found here.

Additional functions of vision, that build on those

    4. Seeing causes and effects of things that happen or could
       happen.

    a. Seeing why something happens or happened involves reasoning
       about causes and finding explanations,
       e.g. seeing that something is being moved because something
       else is pushing it.

    b. This is related to but different from predicting what will
       happen, e.g. a moving object will hit an obstacle.

       It seems that such reasoning can use visual structures and
       visual mechanisms in some cases, and logical or other
       non-visual information in other cases.

       NB: these affordances are seen as directly related to perceived
       parts, features and relations, especially relations between
       surface fragments and to possible processes.

       So they should not be thought of as involving abstract
       inferences based on recognition of object categories, e.g.
       "That's a handle so it is graspable", "that's a door so it is
       openable", etc.

       Instead, seeing something as graspable involves seeing how two
       or more controllable surfaces can be moved so that the object
       comes to be between them, and if the surfaces are then moved
       towards each other the object will be gripped, so that
       thereafter it will move together with the controllable surfaces.
       How all that might be expressed in the mind of an child, an
       robot, or a chimpanzee is an open research questions.

    5. Seeing other things in the environment as 'sentient' with
       abilities to have intentions, perform actions, and have
       responses to things happening in the environment.

       E.g. seeing in which direction someone is looking, seeing
       what someone is looking at, seeing what someone is doing,
       seeing what someone is trying to do, seeing that someone is
       failing to achieve a goal, etc. This includes something like
       adopting what Dennett calls "the intentional stance" or using
       what Newell called "the knowledge level". But it need not
       assume rationality, as they claim.

    6. Seeing and understanding communications. That can include
       reading written text, understanding gestures, reading music,
       reading mathematical notation or program code, reading maps,
       etc.

NOTE ADDED 10 Mar 2009 (Revised 10 Jul 2009):
    PDF slides presented at a number of workshops and seminars
    recently elaborates on some of these points:

    http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#brown
    Ontologies for baby animals and robots
    From "baby stuff" to the world of adult science: Developmental AI
    from a Kantian viewpoint.


I don't expect any project to achieve all of those, or even to aim for all of them. But I think it is important when researching on subsets of the functions of vision to pay attention to what the full range of functions is, so that work done on the subsets can be informed by the requirement to be used later on as part of a more general system. Otherwise, there is the risk that work done on subsets will not 'scale out' to interface with other subsets, and will therefore have to be discarded when more ambitious projects are attempted. It may be desirable to develop a research project specifically to identify long term requirements for visual systems that could be the basis of a partially ordered scenario-based roadmap for vision research (which will also necessarily involve research on other functions that interact with vision systems). Some ways of thinking about such roadmaps are indicated in this diagram: roadmap Taken from this presentation: What's a Research Roadmap For? Why do we need one? How can we produce one? euCognition Research Roadmap meeting, January 2007. If anyone is interested in collaborating on trying to assemble more complete requirements for future vision systems, to provide the context for the work to be done in the near future, then I would be very interested to hear suggestions, including suggestions for collaboration. However, I do not intend to apply for funding for research in this area. I shall go on doing it anyway, time-sharing with other research activities.

Papers, presentations and discussion notes on vision

Papers (including book chapters) Presentations on vision (PDF files) Discussion notes on vision (HTML, plain text and PDF) See also the vision sections of my Doings file.


Maintained by Aaron Sloman
School of Computer Science
The University of Birmingham