URL:
http://www.cs.bham.ac.uk/research/projects/cosy/deliverables/matrix/general-input/axs-fido.html
Last changed: 10 Dec 2005
General Issues about Perception in Fido
And some comments on CoSy's requirements
Fido, the domestic robot of the long term future, described
here,
will have several forms of perception, including vision, hearing, touch,
proprioception, temperature, and probably a host of internal sensors
monitoring physical needs. It will probably not need taste for its own
purposes (except in some wildly futuristic scenarios), but it may be
useful for a domestic robot to do taste testing if it is preparing food
for someone who is disabled.
The end-of-project robots in CoSy (PlayMate and Explorer) will have at
most
-
vision (one or more cameras)
-
hearing (mainly for speech input)
-
various kinds of additional sensors on the mobile robot
including bumpers, whiskers(?), infrared, sonar, ...
-
some feedback in the arm, but probably very little, and probably no
touch or force feedback.
In contrast, Fido may have powerful new sensing devices, including
cameras with much higher resolution and framerates than now, and vast
amounts of computing power and processing memory all available in
compact, light weight physical packaging.
The functions of the various perceptual systems in Fido will need to be
derived from careful analysis of requirements in diverse scenarios, and
many of the requirements will not be obvious in advance. For example,
humans can hear someone approaching a door from the corridor outside.
Will Fido's hearing need to include that sort of capability? Humans can
understand a speaker or their own language of almost any age, either
gender, in a variety of emotional states, including shouting in anger,
whispering in order not to wake the baby, or speaking while sobbing.
Will we require that sort of flexibility in a domestic robot, or will it
suffice to train it to cope with a limited range of speakers speaking
with some care?
In CoSy we probably cannot expect to have unrestricted speech input, and
for some tasks may need to use typing or a mixture of typing and
graphical pointing via the screen.
Vision
One of the hardest problems for the design of intelligent human-like
systems is to be clear what the functions of vision are. The vast
majority of vision researchers (if not all) seem to make very specific
assumptions about the role of vision.
-
Some people focus on segmentation, recognition and classification, where
that means attaching labels to images or portions of images as a result
of some sort of training, often without any attempt to interpret
anything in terms of 3-D structure. That is one of several applications
of statistical methods, which also include tracking or prediction of
what is likely to happen next.
We can certainly assume that Fido will have to be able to do those
things, but they are relatively minor (shallow) aspects of vision, and
far more will be required for a human-like or animal-like visual system,
especially capabilities related to spatial structure.
-
Others such as David Marr assumed that the function of vision was to
segment a scene into 3-D objects and find their locations, pose, shape,
motion, colour, texture and possibly other physical properties and
relationships. This included finding the orientations of visible
surfaces.
-
J.J. Gibson produced a theory that implied that the functions of vision
were much less concerned with information about objective,
observer-independent physical properties. Instead he emphasised
affordances, which are different for different kinds of animals, and
might be different at different times, even for the same animal. On this
view the function of vision in an animal (and presumably a robot) is
to provide information about
what sorts of actions that the animal is capable of performing that
might be relevant to its goals or preferences are and are not possible
or what needs to be done to make then possible. On this view the
information provided by a vision (or other perception) system is
not simply about the contents of the environment but about
relations between at least the following
-
things in the environment,
-
the perceiver's body,
-
physical capabilities,
-
current and possible future concerns (goals, preferences, needs, etc.)
-
Those who emphasise a 'dynamical systems' approach to the study of
intelligence emphasise the role of perception, and vision in particular,
in fine-grained or continuous control of actions, including catching
things, avoiding things, lifting things, throwing things, etc. Where
control is continuous differential equations may be used to represent
relationships between sensory states and motor control states.
-
Philosophers have thought of vision and other senses as providing
information relevant to generating and testing beliefs about
regularities in the environment, e.g. unsupported objects fall, things
that look like apples are good to eat, the sun goes round the earth, ...
Less sophisticated versions focus only on correlations between different
sensory data. More sophisticated versions of this assume that
perception, possibly augmented with additional devices such as measuring
instruments, can be relevant to far more abstract theories, e.g. about
the relations between different physical forces, or the atomic structure
of different substances.
-
Another use that is part of our everyday life is communication: we read
many written statements, questions, stories, instructions, equations,
tables of numbers, etc. We also see many other things as representing
some meaning, including maps, flow-charts, diagrams in proofs, etc.
We also visually see intentional gestures as communications.
-
Closely related to the previous point is our ability to see states of
other minds by 'reading' involuntary expressions, including facial
expressions, postures and various forms of movement and eye gaze. It is
significant that we
see, rather than merely infer, happiness, sadness and
other mental states. That presupposes that the visual system has access
to representations of non-physical phenomena.
-
A quite different abstract function of perception is to provide
understanding of how things work. This happens for example when a clock
is opened up and we see how the various movements are causally linked.
This kind of function of perception is discussed in
this
presentation
on kinds of causality.
This seems to be closely related to human abilities to reason
mathematically, especially using diagrams and other visual aids, for
instance using maps to reason about routes. It is also relevant to uses
of vision in designing new machinery, designing new algorithms and many
other design activities.
At this stage it is not clear how many of these functions of vision will
be needed in a domestic robot like Fido. It is very likely that
eventually all of them will be found in robots. But for now we can
attempt to identify a subset of visual abilities that are not yet
achieved by robots and which will be of general use if ways can be found
to implement them.
Representations of objects, especially shape
One of the hard problems in specifying requirements is to clear what
sort of information perception should provide about physical objects.
Location:
Some things are relatively clear such as the requirement to be able to
represent where an object is in space, though that leaves open whether
that means providing
-
an absolute location specification relative to some
global frame of reference,
or (more likely)
-
a collection of spatial relationships to other things in the same
general location,
or (even more likely)
-
a combination representations of different types, including relatively
global coarse-grained 2-D and 3-D topographical 'maps' that have various
regions with both topological and rough metrical properties and
relations, along with a wide variety of more detailed task-specific
relations not only between whole objects but also parts of objects (as
discussed in connection with 'multi-strand' relations in
CoSy Report DR.2.1
also available
here.
Some of the maps may actually be best thought of as interlinked networks
of routes (as many transport maps are). For animals that lack vision or
mostly live in underground tunnel networks, it may be that such 'route
maps' form the only representation of spatial relationships, apart from
the temporary relations involved in manipulating (e.g. eating)
small objects. Perhaps that is also true of a young child's
representation of a house, or the representations created by adults of
large buildings in which they live or work, e.g. a hotel, hospital or
office block, for which they have no usable global 3-D representation,
only a collection of routes beteen the places actually visited in the
building. For Fido that sort of partial representation may be all that
is available during learning about a new building, unless the robot has
access to a database of richer information about the building.
Shape as mediator between perception and action:
Jeremy's document on requirements
points out the need for the PlayMate to understand relations between
actions and perceived shape and relationships: if a block is against a
wall understanding its shape includes understanding how it will respond
to forces applied to different parts of the surface of the block in
different directions. This is a special case of the sensory-motor
contingencies studied in
the CNRS group.
This can be contrasted with attempts at constructing ontologies for
objects that assume that all spatial structure can be expressed as
part-whole hierarchies. The use of such hierarchies ignores the
fact that a typical natural object, such as a rock, or a tree-trunk, or
a human body has no unique decomposition into a tree parts, and any such
decomposition will capture only a small subset of relationships between
parts, focusing mainly on permanent relationships, making it hard to
express, for example, the fact that the end of a person's left little
finger is in his left ear.
Can we say what Fido will see in arbitrary objects in a domestic
context?
If we ask what is common to clothing, blankets, food and drink of
various kinds, containers, kitchen utensils and gadgets, furniture,
wall-fitings, doors, carpets, windows, curtains, animals, humans,
spaces, regions, routes and other things that Fido will perceive, it may
at first seem that the answer is very little. But perhaps we can find
the right sort of generality by moving to the right level of
abstraction. One way of thinking about that is to ask about the
dimensions in which information about objects can vary, and then
fit objects into appropriate categories within different dimensions.
The complication is that we can probably find no useful dimensions that
are completely independent of one another (orthogonal).
-
What sort of physical material is it made of?
This will include information about
-
whether the object is
rigid or compressable, stretchable, twistable, bendable
-
how light or heavy the object is, which influences both attempts to lift
it and also to push, pull, roll, it
-
what sort of surface properties the object has, e.g. rough, smooth,
sticky, slippery, hard, soft, furry, fluffy,
-
What sorts of actions can be applied to it and what results from those
actions?
This will capture many aspects of shape, but will interact with the
material the object is made of. Actions applied to the object include
-
touching it in various locations and applying slight pressure
E.g. what does the surface feel like and does it give or not?
This could be extended to things like tapping the surface with a hard
part of the body, like a fingernail, and noting both the haptic feedback
and the audible effects.
-
sliding a finger along various parts of the surface in various
directions
E.g. note how the existence of bumps, dents, grooves, ridges, convexity,
concavity, smooth changes of curvature, sharp changes of curvature,
affect possible motions along the surface, including
the kinds of tactile feedback to be expected, and the kind of control
program needed to maintain contact and maintain the motion.
-
applying forces at different locations in different directions
E.g. there could be a single force at one location, or several, as in
grasping, stretching, twisting, steering; different consequences will be
found if the force is applied with or without slippage, which may depend
on the properties of the surface.
-
changing viewpoint, including moving towards or away from the object,
moving left, right, up, down or in other directions, all of which will
produce changes in the 2-D appearance and the 3-D information available.
Some of the changes will be discontinuous e.g. when a new face of a cube
(along with some of its edges and corners) becomes visible or invisble
as a result of a move (discussed in Minsky's 1978 frames paper, and used
in 'aspect graphs')
Some will be continuous, e.g. as lateral movement changes the projective
relationships between edges of a face or the variation in texture on a
visible face, or as a larger area of a surface gradually becomes visible
or invisible, while the motion proceeds (can all this be expressed in a
learnable mapping (e.g. perhaps using a trainable neural net) from
surface shape + viewer location + type of motion
to
visible changes
Perhaps we can combine all the above ideas into the notion of perception
of an object as involving creation of a generalised aspect graph
which includes a great deal of information about possible actions and
the consequences that would follow, where many of the actions are
associated with particular parts of the object (e.g. indexed by parts in
the information structure) and probably only created on demand as
opposed to being automatically always generated by perceptual processes.
If this is correct then much of what happens during early learning could
involve extensions of abilities to create various kinds of aspect graphs
adding new kinds of actions, new kinds of object parts, new kinds of
materials of objects, new kinds of consequences of actions, and new
levels of abstraction to simplify and generalise the use of the
information planning and reasoning.
-
What can the object be used for?
This will depend on all the other features in a host of different ways.
It will also, of course, depend on the kinds of needs, desires or goals
that the perceiver is capable of having and the kinds of actions it is
capable of performing. This is one of the ideas behind the notion of
perceiving affordances.
That can be generalised to include perceiving vicarious
affordances, namely affordances for other individuals, as required
when looking after an child, or when considering opportunities for
another agent such as a predator or malicious person to do something it
might wish to do. The ability to perceive vicarious affordances would be
important in a robot whose function includes looking after other
individuals.
A subtle generalisation of this is the ability to perceive
hypothetical affordances, i.e.
possible future affordances for oneself, e.g. if I moved to location X
could I see Y, or grasp Z, etc.? This is an essential part of the
ability to plan movements in space and is obviously linked to the
features of objects that are involved in perceiving currently available
affordances, but requires the ability also to represent what does not
now exist.
It could be that in animals with deliberative capabilities,
the ability to acquire and use information about hypothetical
affordances for oneself evolved before the ability to perceive and make
use of vicarious affordances in social interaction. In fact there is
some evidence that some animals (e.g. primates) have the former without
the latter, at least in relation to some manipulative tasks.
Implications for CoSy
The above discussion of long term requirements for domestic robots
provides an indication of many of the difficulties that lie ahead.
As far as CoSy is concerned we can expect only a tiny subset of these
capabilities to be provided by August 2008.
Some examples can be found
here.
Use 'back' button to return to matrix