School of Computer Science THE UNIVERSITY OF BIRMINGHAM CoSy project CogX project

Can a Robot Grasp Grasping?
How can a robot understand what's going on when it grasps something?
(WARNING: This is work in progress, and liable to change)

Aaron Sloman
School of Computer Science, University of Birmingham.
(Philosopher in a Computer Science department)

Installed: 6 Jun 2011
Last updated: 9 Jun 2011; 16 Jun 2011; 9 Jul 2012

This paper is
Also available as:
A PDF version may be added later.

A partial index of discussion notes is in


There are many projects aiming to give machines competences involving perceiving and
acting on objects in the environment, or exploration of an environment to develop some
sort of map of the terrain, or part of a building.

Insofar as these projects aim to contribute to our scientific understanding, as opposed
to being wholly justified by their practical usefulness (like the note-counting
machines in automatic cash dispensers, and many robots tailored to accurate and reliable
performance of a very specific task on a factory production line), there is a requirement
for the designs of the machines to have some well-defined kind of generality, so that
the researchers can explain in a principled way what the machines can and cannot do and
why, and preferably also show how this achievement contributes towards broader and deeper
longer term goals.

There are different ways of characterising the required generality. A common way is to
collect a large and varied collection of test cases from some corpus, e.g. pictures or
sentences on the internet, or a collection of behaviours generated by a sizeable sample of
naive subjects in a laboratory experiment.

I have always found those ways of specifying the scope of a theory unsatisfactory: there
should be a more principled way than merely collecting examples. It should be possible to
explain what those examples have in common and why it is of interest to find a general way
of handling them, and why other things should not be included in the scope of the theory
or model, nor used for testing it. For example, it is fairly easy to give good reasons for
not expecting Newton's theory of gravitational attraction and his mechanics to provide a
satisfactory explanation of the pattern of motion of a leaf falling from a tree (though
this may not have been easy before Newton's time: a good theory may teach us to
characterise its domain of applicability).

I feel that a very high proportion of research being done in AI and Robotics fails to
meet this criterion -- even if the research is interesting and potentially valuable
for other reasons (including being a step on the way to producing a theory or model
that does meet the criterion).

That leaves the problem of deciding how to select collections of cases that have the right
sort of generality. I have many examples in things I have been writing about child or
animal development, or challenges for AI (e.g. proposing the polyflap domain
as a potentially useful robotic challenge).

That domain is generative in the sense that there is a (fairly) precisely specified way
of producing more and more complex and varied examples that could be used as test cases.

In this document I'll attempt to characterise a domain that is generative in the
sense that its examples can be decomposed into features that can be combined in
systematically varied ways. I have not yet tried to produce a precise formal
specification of that generality, but I hope the examples will suffice for now,
making use of the powerful human ability to observe the structure common to a
collection of cases, currently lacking in computers. Later we need to characterise
the domain, and criteria for success, or at least progress, more precisely.

So far, my characterisation of the domain, below, is not complete. I'll investigate the
possibility of addressing that later. For now I want to indicate how a domain of processes
can be generated by systematically varying the geometric configurations, materials used,
and operations or forces applied to different parts of objects.

One kind of generality that is missing from the examples below is the recursive use of
abilities to rearrange physical matter in order to achieve a new state in which
possibilities and constraints are altered so as to allow (or help, or prevent) certain
additional rearrangements. See this discussion of varieties of deliberation for more on
the requirements for such competences:

Some of the proposals below are similar to points made by Karmiloff-Smith in Beyond Modularity,
about the transitions that can occur in a learner after "behavioural mastery" has been
achieved. See

How to Develop Scenarios for a Grasping Robot (and others)

A general principle for designing scenarios so as to avoid dead ends is that every
particular kind of process in the scenario is a special case of a well defined class
of processes. Finding out what that class should be is a non-trivial research problem.
(It is probably connected with what goes on during infant and toddler learning:
discovering good ways to generalise beyond examples already learnt -- by developing
a generative theory, where possible.)

Some early work in vision attempted to meet this criterion by considering types of image
that could be generated by a grammar (e.g. a web grammar) and then specifying an algorithm
or collection of algorithms able to cope with all instances of the grammar, e.g. by
producing a 3-D interpretation.

Instead of grammars some researchers systematically studied classes of picture element and
ways of combining them to form larger pictures, and deriving general modes of
interpretation of such pictures (e.g. the Huffman-Clowes line-labelling algorithm for
interpreting 2-D pictures of tri-hedral polyhedra, later expanded by Waltz to include a
wider range of scenes and pictures. More recent work aimed at extending that generality is
    Can Machines Interpret Line Drawings?
    P. A. C. Varley, R. R. Martin and H. Suzuki1

The competences involved in the particular scenario should be particular cases of
general competences. The combinations of competences in the scenario should be
special cases of modes of composition of competences, in the sense discussed here.

So even if a practical project has narrowly specified goals, if it is to contribute
to scientific understanding it should have the sort of generality described here,
even if not all of the generality is required for practical goals. Not all practical
projects need have scientific goals. Many don't.

However, if a project is to produce results that are robust and extendable, then it
is important for the tests and designs chosen in the scenarios to include cases that
are not required for the specific practical goals. For example, some situations can
arise that are undesirable, but the fact that they are not desired does not mean that
they should not be understood and dealt with if they arise. This is a way to avoid
premature over-specialisation, which can easily hold up a field like AI (viewed as
science rather than engineering), including robotics.

This principle can be applied to:
    kinds of material,
    kinds of relationship,
    kinds of causal influence,
    kinds of shape,
    kinds of action,
    kinds of learning,
    kinds of reasoning,

addressed in the project. I have previously referred to this as the need for models
not just to scale up (e.g. cope with larger data-sets) but also to scale-out (i.e.
cope with more varied types of challenge, and in combination with different parts of
a whole architecture, when required).

[NOTE: I think this requirement to "scale out" is related to what John McCarthy
called "Elaboration tolerance", though he presented that as a criterion for adequacy
of a formalism rather than a mechanism. I recently found that some computing
researchers use the same labels for a different distinction also sometimes
contrasting "scaling vertically" with "scaling horizontally". I suspect there is some
loose connection with the contrast I am making.]

Intelligent robots need not only to do things, but also to know what they are
doing. Any type of action or process or state of affairs that an agent needs to
be able to produce should also be something the agent can perceive, think about,
reason about, etc., even when the process or state of affairs is not a product of its
own actions.

I think new-born human infants lack that kind of intelligence. It develops through
extensions to the architecture and to the forms of representation and types of
mechanism required. The ability is never fully developed even in adult humans:
they can go on learning indefinitely as they acquire new domains of expertise.

This apparently subsumes what Karmiloff-Smith calls "Representational Redescription"
(in her book Beyond Modularity 1992) as I've discussed in
    (Work in progress.)

The ontology needed for perception, planning, reasoning, action-control
Actions involving manipulation include not only processes involving changing spatial
relationships within and between objects, but also causal interactions of various
kinds. Causation is not perceived in the same way as shape, position, velocity,
shape-change, colour, etc. (Humans, some animals, and future intelligent robots need
both Humean (associative) and Kantian (structure-based) conceptions of causation, as
discussed here (with Jackie Chappell):

So projects aimed at producing robots with (adult) human-like intelligence will have
to specify what it is for a robot to understand and be able to reason about,
different sorts of causation. (That's very hard. Even good philosophers find it very

That's not an exhaustive list, merely illustrative.

Here are some example test cases for a robot that is to be able to manipulate
non-rigid materials. Each case can be varied either by changing the material, or by
changing the initial situation, or by changing the final state or by varying the
process of going from initial to final state.

For each action type that the robot can perform it should also be able to
perceive that action, done by itself, done by others, perceived from different
viewpoints. Examples follow:
Agent sees a square of some material on a table with a small portion sticking out
over the edge -- so that it can be grasped and moved by the robot, or someone else.

Variations: the material can be cloth (handkerchief), towelling, tissue paper,
cardboard, writing paper, clingfilm or other plastic, tinfoil, a slice of bread,
pastry, dough, flattened plasticine, ... (Some of these may be very difficult, and
best postponed. At what ages can young children deal with them?)

Variations: the shape can be rectangular, with different ratios of long and short
side, it can be triangular, or some other polygonal shape, or a curved shape.

Variations: the orientation of the shape with respect to the edge of the table can
vary (so that for the same shape the bit sticking out can have different appearances
and grasping requirements, and the same action after grasping can have different

    the motion after grasping (with a firm grasp that allows no slippage between the
    fingers) can be horizontal and unidirectional for a short distance. The motion
    can continue indefinitely. The motion of the grasped edge can oscillate at
    various speeds.

    The motion can be vertical (lifting the grasped edge), varying amounts, at
    varying speeds, with the orientation of the grasped bit either kept horizontal or
    varied e.g. so as to avoid a sharp bend beyond the grasp area. It can be
    unidirectional (just lifting) or lifting and lowering.

    The motion can be pulling: either pulling horizontally away from the edge of the
    table or pulling downwards below the edge of the table, and various directions of
    pull in between.

    The motion can be pushing: pushing the grasped edge along the surface of the
    table orthogonally to the edge of the table, and further varied by pushing in
    different directions.

    The motion can be folding: lifting the grasped edge and moving it over the table
    then down onto another part of the object. Variations include trajectory height,
    the orientation of the plane of the trajectory relative to the edge of the table,
    where the trajectory ends, and how the orientation of grasp varies during the

    The folding motion may be followed by pressing down on parts of the material
    along the fold and in other places.

    Other variations can involve holding down portions of the material while the
    grasped portion is moved.

    (It would be good to have photographs or videos illustrating all the above

After a learning process many different tests are possible, with different materials,
different shapes, different kinds of motion.

Can the agent (at least roughly) predict what changes will occur if a pair of fingers
(one above the other) grasps the overlapped portion and lifts it straight up until
there's no more contact with the table, without altering the orientation of the
grasping point?

Can the robot predict what will happen if instead of moving up, the fingers move
horizontally, parallel to the edge of the table for a metre or more? What sorts of
obstacles could obstruct, or modify the motion?

Can the robot predict what will happen if the fingers gripping the corner rotate
until that corner is pointing upwards, and then they move to where the opposite
corner is?
Two cases:
(a) horizontal motion
(b) motion in an arc, going up then down.

Varieties of imprecision and uncertainty

What forms should the predictions take: I cannot predict precise changes, but I can
talk about how relationships will change during the predicted motion. I can make the
predictions at various levels of abstraction, with different kins of certainty. E.g.
if the object moved is made of cloth and the grasped edge is lifted a distance
that is more than the maximum diameter of the cloth then the cloth will eventually no
longer be in contact with the table. I don't need to know what the maximum diameter
is for that prediction to hold.

I can point to a height that I know will be sufficient to raise the cloth so that it
is no longer in contact.

I can make predictions about how the shape will change during the motions, using
notions like folding, curvature, increasing or decreasing curvature, flattening, etc.
without being able to specify numerical values for those processes or their results.

Some of the changes involve topological relations (e.g. loss of contact) and in that
sense are described precisely. Some of the changes can be given bounds that are
definite, though not precise upper or lower bounds. E.g. I know that during vertical
movement of the corner of the cloth the cloth will lose contact with the table
before the grasping point has reached this height (indicated by pointing) even
though I don't know the exact height at which it will lose contact. I can also say
that there will still be contact when then grasped point has reached this
height (pointing at a lower height).

These requirements merely scratch the surface of what is required in the
specification for a human-like robot.

There are lots of deep and difficult implications regarding
    the ontologies required
    the forms of representation
    the forms of reasoning
    the implementation mechanisms
    the architectural decomposition of functions
    the processes of learning
    the processes of development

(To be extended...)

This paper follows on (in various directions) from these:
Orthogonal Recombinable Competences Acquired by Altricial Species
    (Blankets, string, and plywood)

Introduction to the 'Polyflap' domain for robot manipulation.
    Discussion note on the polyflap domain (to be explored by an `altricial' robot)
    Also here:

Requirements for animals and robots to develop ontologies for "kinds of stuff"

Requirements for predicting affordance changes

Presentation on seeing processes

Comments on "The Emulating Interview... with Rick Grush"

And various discussions on requirements for abilities to perceive, understand and reason
about spatial structures, and processes involving changes of spatial structures, since
about 1971.


These ideas relate closely to Maria Petrou's entertaining discussion of
robot ironing. See:

See also the impressive laundry-manipulating robot at UC-Berkeley
    The video includes an intriguing but unexplained comment about the robot "simulating
    everything" before acting. There are very different ways of simulating:

    (a) simulations that provide very precise predictions about a single configuration and a
    single trajectory,

    (b) simulations that allow reasoning about classes of cases, as illustrate below.
    The latter is required for animal intelligence involving perception of affordances of
    various kinds -- including proto-affordances, action affordances for the robot, vicarious
    affordances (for someone else), epistemic affordances, deliberative affordances,
    communicative affordances...

    Simulations of type (a) can use variants of "game-engine" technology. They can be very
    useful for on-line control of actions using feed-forward mechanisms, e.g. to predict
    required adjustments to the current trajectory, etc.

    Simulations of type (b) have quite different functionality and can be used in
    answering questions about what would happen if, what might have caused something to
    happen, what options would be available if some action were performed, etc.
    Type (b) simulations require something very different from the precise modelling done
    in game-engines. For example, the sort of reasoning you do when working out how to get
    an arm-chair through a door that's too narrow for it to be pushed through upright,
    involves representing types of sub-process and types of intermediate situations, rather
    than the precise details required for controlling motion when the action is actually
    being performed.

    You can work out combinations of types of translations and rotations without having
    the kind of representational precision required to generate a video of the process.

    This document is about the types of representation of structure and process required
    for competences of type (b). But it is only a small start. (I have made other starts
    in related directions in the other documents referred to.)

    [Compare confusions about dorsal and ventral visual streams.[REF...]]

Maintained by Aaron Sloman
School of Computer Science
The University of Birmingham