School of Computer Science THE UNIVERSITY OF BIRMINGHAM CoSy project CogX project

Can a Robot Grasp Grasping?
How can a robot understand what's going on when it grasps something?
(WARNING: This is work in progress, and liable to change)

Aaron Sloman
School of Computer Science, University of Birmingham.

Installed: 6 Jun 2011
Last updated:
27 May 2015: Minor reformatting and additions.
30 Jan 2015: Homage to Maria Petrou;
1 Jan 2013: Added distinction between online and offline intelligence;
9 Jun 2011; 16 Jun 2011; 9 Jul 2012;

This paper is
A PDF version may be added later.

A partial index of discussion notes is in

Maria Petrou (1953-2012) (Added 30 Jan 2015)

This document is closely related to a subset of Maria Petrou's ideas presented in her cartoons and discussions of an ironing robot, linked below.

I first met and talked with her at the BMVA conference in Manchester in 2001, and thereafter met her intermittently at workshops and conferences. We did not ever work together, but I found our discussions and some of her online papers, including the semi-serious presentation of her robot ironing tutorial and her discussion of her great aunt's ideas on ironing, very stimulating, and closely related to my own investigations regarding the need to get machines to understand 'kinds of stuff', illustrated below, and in this presentation:

In 2011, during the proposal phase, Maria invited me to be a member of the advisory board of the CloPeMa project (see below) of which she was the leader, and I accepted, but I assume I was found unacceptable by the EU. The last exchange we had was in June 2011, when she thanked me for accepting, and for informing her of the Laundry project, below. She must have become ill soon after, as I heard no more from her before she died.

I was recently delighted to learn that the EU robot project that she had inspired and coordinated had made impressive progress, as reported on the project web site:

    Clothes Perception and Manipulation (CloPeMa)
    Select the 'Home' tab for background.
    Select the 'News' tab for videos.

Some of the questions arising from the "Laundry" project (discussed below) are very relevant to the CloPeMa project.


There are many projects aiming to give machines competences involving perceiving and acting on objects in the environment, or exploration of an environment to develop some sort of map of the terrain, or part of a building.

Insofar as these projects aim to contribute to our scientific understanding, as opposed to being wholly justified by their practical usefulness (like the note-counting machines in automatic cash dispensers, and many robots tailored to accurate and reliable performance of a very specific task on a factory production line), there is a requirement for the designs of the machines to have some well-defined kind of generality, so that the researchers can explain in a principled way what the machines can and cannot do and why, and preferably also show how this achievement contributes towards broader and deeper longer term goals.

There are different ways of characterising the required generality. A common way is to collect a large and varied collection of test cases from some corpus, e.g. pictures or sentences on the internet, or a collection of behaviours generated by a sizeable sample of naive subjects in a laboratory experiment.

I have always found those ways of specifying the scope of a theory unsatisfactory: there should be a more principled way than merely collecting examples. It should be possible to explain what those examples have in common and why it is of interest to find a general way of handling them, and why other things should not be included in the scope of the theory or model, nor used for testing it. For example, it is fairly easy to give good reasons for not expecting Newton's theory of gravitational attraction and his mechanics to provide a satisfactory explanation of the pattern of motion of a leaf falling from a tree (though this may not have been easy before Newton's time: a good theory may teach us to characterise its domain of applicability).

I feel that a very high proportion of research being done in AI and Robotics fails to meet this criterion -- even if the research is interesting and potentially valuable for other reasons (including being a step on the way to producing a theory or model that does meet the criterion).

That leaves the problem of deciding how to select collections of cases that have the right sort of generality. I have many examples in things I have been writing about child or animal development, or challenges for AI -- e.g. proposing the polyflap domain as a potentially useful robotic challenge:

That domain is generative in the sense that there is a (fairly) precisely specified way of producing more and more complex and varied examples that could be used as test cases.

In this document I'll attempt to characterise a domain that is generative in the sense that its examples can be decomposed into features that can be combined in systematically varied ways. I have not yet tried to produce a precise formal specification of that generality, but I hope the examples will suffice for now, making use of the powerful human ability to observe the structure common to a collection of cases, currently lacking in computers. Later we need to characterise the domain, and criteria for success, or at least progress, more precisely.

So far, my characterisation of the domain, below, is far from complete. I'll investigate the possibility of addressing that later. For now I want to indicate how a domain of processes can be generated by systematically varying the geometric configurations, materials used, and operations or forces applied to different parts of objects.

One kind of generality that is missing from the examples below is the recursive use of abilities to rearrange physical matter in order to achieve a new state in which possibilities and constraints are altered so as to allow (or help, or prevent) certain additional rearrangements. See this discussion of varieties of deliberation for more on the requirements for such competences:

Some of the proposals below are similar to points made by Annette Karmiloff-Smith in Beyond Modularity (1992), about the transitions that can occur in a learner after "behavioural mastery" has been achieved. A very personal (and incomplete) tutorial on some of her ideas is available here:

How to Develop Scenarios for a Grasping Robot (and others)

A general principle for designing scenarios so as to avoid dead ends is that every particular kind of process in the scenario is a special case of a well defined class of processes. Finding out what that class should be is a non-trivial research problem. (It is probably connected with what goes on during infant and toddler learning: discovering good ways to generalise beyond examples already learnt -- by developing a generative theory, where possible.)

Some early work in vision attempted to meet this criterion by considering types of image that could be generated by a grammar (e.g. a web grammar) and then specifying an algorithm or collection of algorithms able to cope with all instances of the grammar, e.g. by producing a 3-D interpretation.

Instead of grammars some researchers systematically studied classes of picture element and ways of combining them to form larger pictures, and deriving general modes of interpretation of such pictures (e.g. the Huffman-Clowes line-labelling algorithm for interpreting 2-D pictures of tri-hedral polyhedra, later expanded by Waltz to include a wider range of scenes and pictures. More recent work aimed at extending that generality is
Can Machines Interpret Line Drawings?
P. A. C. Varley, R. R. Martin and H. Suzuki

The competences involved in the particular scenario should be particular cases of general competences. The combinations of competences in the scenario should be special cases of modes of composition of competences, in the sense discussed in

So even if a practical project has narrowly specified goals, if it is to contribute to scientific understanding it should have the sort of generality described here, even if not all of the generality is required for practical goals. Not all practical projects need have scientific goals. Many don't.

However, if a project is to produce results that are robust and extendable, then it is important for the tests and designs chosen in the scenarios to include cases that are not required for the specific practical goals. For example, some situations can arise that are undesirable, but the fact that they are not desired does not mean that they should not be understood and dealt with if they arise. This is a way to avoid premature over-specialisation, which can easily hold up a field like AI (viewed as science rather than engineering), including robotics.

This principle can be applied to:

kinds of material,
kinds of relationship,
kinds of causal influence,
kinds of shape,
kinds of action,
kinds of learning,
kinds of reasoning,

addressed in the project. I have previously referred to this as the need for models not just to scale up (e.g. cope with larger data-sets) but also to scale-out (i.e. cope with more varied types of challenge, and in combination with different parts of a whole architecture, when required).

I think this requirement to "scale out" is related to what John McCarthy called "Elaboration tolerance", though he presented that as a criterion for adequacy of a formalism rather than a mechanism. I recently found that some computing researchers use the same labels for a different distinction also sometimes contrasting "scaling vertically" with "scaling horizontally". I suspect there is some loose connection with the contrast I am making.]

Intelligent robots need not only to do things, but also to know what they are doing. Any type of action or process or state of affairs that an agent needs to be able to produce should also be something the agent can perceive, think about, reason about, etc., even when the process or state of affairs is not part of or a product of one of its own current or recent actions. The sort of "offline" reasoning that is applied to actions of others, or to observed physical processes, can also be applied prospectively or retrospectively to one's own actions (e.g. why did Y happen when I did X?).

Some illustrations regarding uses of "offline intelligence" rather than "online intelligence", can be found in these examples of reasoning about processes involving triangles:

I think new-born human infants lack that kind of intelligence. Offline intelligence seems to develop through extensions to the architecture and to the forms of representation and types of mechanism required. The ability is never fully developed even in adult humans: they can go on learning indefinitely as they acquire new domains of expertise.

This apparently subsumes what Karmiloff-Smith calls "Representational Redescription" (in her Beyond Modularity) as I've discussed in
(Work in progress.)

An open question is whether such offline intelligence exists in non-human animals: the ability of individuals to deal successfully with novel problems, or to produce novel solutions to old problems, without engaging in trial and error, may be evidence (Betty the hook-making New Caledonian crow studied in Oxford in 2001--2004 seems to be a clear example). Note that the question whether animals can use offline intelligence in using matter to manipulate matter is deeper and more precise than asking whether they can use tools, or make tools.

The ontology needed for perception, planning, reasoning, action-control Actions involving manipulation include not only processes involving changing spatial relationships within and between objects, but also causal interactions of various kinds. Causation is not perceived in the same way as shape, position, velocity, shape-change, colour, etc. (Humans, some animals, and future intelligent robots need both Humean (associative) and Kantian (structure-based) conceptions of causation, as discussed here (with Jackie Chappell):

So projects aimed at producing robots with (adult) human-like intelligence will have to specify what it is for a robot to understand and be able to reason about, different sorts of causation. (That's very hard. Even good philosophers find it very difficult.)

That's not an exhaustive list, merely illustrative.

Here are some example test cases for a robot that is to be able to manipulate non-rigid materials. Each case can be varied either by changing the material, or by changing the initial situation, or by changing the final state or by varying the process of going from initial to final state.

For each action type that the robot can perform it should also be able to perceive that action, done by itself, done by others, perceived from different viewpoints. Examples follow:

Agent sees a square of some material on a table with a small portion sticking out over the edge -- so that it can be grasped and moved by the robot, or someone else.

Variations: the material can be cloth (handkerchief), towelling, tissue paper, cardboard, writing paper, clingfilm or other plastic, tinfoil, a slice of bread, pastry, dough, flattened plasticine, ... (Some of these may be very difficult, and best postponed. At what ages can young children deal with them?)

Variations: the shape can be rectangular, with different ratios of long and short side, it can be triangular, or some other polygonal shape, or a curved shape.

Variations: the orientation of the shape with respect to the edge of the table can vary (so that for the same shape the bit sticking out can have different appearances and grasping requirements, and the same action after grasping can have different consequences).

Variations: the motion after grasping (with a firm grasp that allows no slippage between the fingers) can be horizontal and unidirectional for a short distance. The motion can continue indefinitely. The motion of the grasped edge can oscillate at various speeds.

The motion can be vertical (lifting the grasped edge), varying amounts, at varying speeds, with the orientation of the grasped bit either kept horizontal or varied e.g. so as to avoid a sharp bend beyond the grasp area. It can be unidirectional (just lifting) or lifting and lowering.

The motion can be pulling: either pulling horizontally away from the edge of the table or pulling downwards below the edge of the table, and various directions of pull in between.

The motion can be pushing: pushing the grasped edge along the surface of the table orthogonally to the edge of the table, and further varied by pushing in different directions.

The motion can be folding: lifting the grasped edge and moving it over the table then down onto another part of the object. Variations include trajectory height, the orientation of the plane of the trajectory relative to the edge of the table, where the trajectory ends, and how the orientation of grasp varies during the motion.

The folding motion may be followed by pressing down on parts of the material along the fold and in other places.

Other variations can involve holding down portions of the material while the grasped portion is moved.

(It would be good to have photographs or videos illustrating all the above variations.)

After a learning process many different tests are possible, with different materials, different shapes, different kinds of motion.

Can the agent (at least roughly) predict what changes will occur if a pair of fingers (one above the other) grasps the overlapped portion and lifts it straight up until there's no more contact with the table, without altering the orientation of the grasping point?

Can the robot predict what will happen if instead of moving up, the fingers move horizontally, parallel to the edge of the table for a metre or more? What sorts of obstacles could obstruct, or modify the motion?

Can the robot predict what will happen if the fingers gripping the corner rotate until that corner is pointing upwards, and then they move to where the opposite corner is? Two cases: (a) horizontal motion (b) motion in an arc, going up then down.

Added 4 Feb 2015: Online and offline intelligence One of the important requirements is the ability not merely to act and produce desired changes (online intelligence) but also to think about and reason about what is or is not possible, and why.

Some more examples involving clothing are here:
Shirt Mathematics
Illustrating topological and semi-metrical reasoning in everyday life.

Varieties of imprecision and uncertainty

What forms should the predictions take: I cannot predict precise changes, but I can talk about how relationships will change during the predicted motion. I can make the predictions at various levels of abstraction, with different kinds of certainty. E.g. if the object moved is made of cloth and the grasped edge is lifted a distance that is more than the maximum diameter of the cloth then the cloth will eventually no longer be in contact with the table. I don't need to know what the maximum diameter is for that prediction to hold.

I can point to a height that I know will be sufficient to raise the cloth so that it is no longer in contact.

I can make predictions about how the shape will change during the motions, using notions like folding, angle, curvature, increasing or decreasing curvature, flattening, etc. without being able to specify numerical values for those processes or their results.

Some of the changes involve topological relations (e.g. loss of contact) and in that sense are described precisely. Some of the changes can be given bounds that are definite, though not precise upper or lower bounds. E.g. I know that during vertical movement of the corner of the cloth the cloth will lose contact with the table before the grasping point has reached this height (indicated by pointing) even though I don't know the exact height at which it will lose contact. I can also say that there will still be contact when then grasped point has reached this height (pointing at a lower height).

[Added 1 Jan 2013] There's an entertaining, but deep, video by Vi Hart illustrating some of the facts about folding and production of angles that could first be discovered empirically (using online intelligence) then later understood mathematically (using offline intelligence):

These requirements merely scratch the surface of what is required in the specification for a human-like robot.

There are lots of deep and difficult implications regarding

the ontologies required
the forms of representation
the forms of reasoning
the implementation mechanisms
the architectural decomposition of functions
the processes of development
the processes of learning
-- empirically, by finding out what happens when
-- non-empirically by reasoning about what must be the case,
which is presumably what first led to the development of Euclidean
(To be extended...)

This paper follows on (in various directions) from these:
Orthogonal Recombinable Competences Acquired by Altricial Species
(Blankets, string, and plywood)

Introduction to the 'Polyflap' domain for robot manipulation.
Discussion note on the polyflap domain (to be explored by an `altricial' robot)
Also here:

Requirements for animals and robots to develop ontologies for "kinds of stuff"

Requirements for predicting affordance changes

Presentation on seeing processes

Comments on "The Emulating Interview... with Rick Grush"

And various discussions on requirements for abilities to perceive, understand and reason about spatial structures, and processes involving changes of spatial structures, since about 1971.

These ideas relate closely to Maria Petrou's entertaining discussion of robot ironing. See:

Note added 1 Jan 2013:
     Alas, Maria died in October 2012

See also the impressive laundry-manipulating robot at UC-Berkeley

The video includes an intriguing but unexplained comment about the robot "simulating everything" before acting. There are very different ways of simulating:

(a) simulations that provide very precise predictions about a single configuration and a single trajectory,

(b) simulations that allow reasoning about classes of processes, as illustrated below.

The latter is required for animal intelligence involving perception of affordances of various kinds -- including proto-affordances, action affordances for the robot, vicarious affordances (for someone else), epistemic affordances, deliberative affordances, communicative affordances...

Simulations of type (a) can use variants of "game-engine" technology. They can be very useful for on-line control of actions using feed-forward mechanisms, e.g. to predict required adjustments to the current trajectory, etc.

Simulations of type (b) have quite different functionality and can be used in answering questions about what would happen if, what might have caused something to happen, what options would be available if some action were performed, etc. Type (b) simulations require something very different from the precise modelling done in game-engines. For example, the sort of reasoning you do when working out how to get an arm-chair through a door that's too narrow for it to be pushed through upright, involves representing types of sub-process and types of intermediate situations, rather than the precise details required for controlling motion when the action is actually being performed. Some examples of perception of possibilities for processes to occur, and perception of constraints on such processes (the roots of ancient mathematical discoveries?), e.g. discussed in:

You can work out combinations of types of translations and rotations without having the kind of representational precision required to generate a video of the process.

This document is about the types of representation of structure and process required for competences of type (b). But it is only a small start. (I have made other starts in related directions in the other documents referred to.)

[Compare confusions about dorsal and ventral visual streams.[REF...]]

Maintained by Aaron Sloman
School of Computer Science
The University of Birmingham