9 Dec 2005
Q1. How can a robot control the movement of an object (by pushing and turning) so as to gather information about its appearance from different views?
Q2. How can a robot represent the motions of objects through space, and how can it learn these representations? In particular can we learn how particular kinds of motion are possible because of certain visual features of the object?
Q3. How can a robot represent human actions so as to recognise sequences of actions, classify the subactions, and predict the remainder of a sequence from the beginning?
Q4. How can a robot learn in the tasks specified under Q1, Q2 and Q3 while doing so collaboratively with a human? Specifically how can the learning on the above tasks be guided by a human using language? Can the robot learn the description of a new action, e.g. to pick up, by finding the spatial changes that correspond to the human notion, e.g. pick up always involves lifting the object off the supporting surface, and push involves a force through the object along the surface. How can the representations used in the communication system be linked to the representations used to recognise the actions from visual information?
Q5. How can we decide which information processing actions to carry out in the architecture?
Since I want to focus on vision requirements in this email I'll stick to questions Q1, Q2, Q3.
Bernt mentioned four aspects of the scenario:
I'm going to assume that my Q1 is one of many possible things you can group under Bernt's point B3. My Q2 is one of the requirements for accomplishing Bernt's point B4, and the problem is also related to Bernt's B1. Let's deal with each in turn.
To further focus things we have talked at Birmingham about the set of objects that we wish to use. For some tasks (e.g. visual learning) I suggest that we will need "everyday" objects such as household items (we agreed on a tea-set several months ago). These are fine for learning about the visual appearance of objects so that the robot can identify them. Such objects could be the basis for tackling Q1/B3. At dinner in Saarbrucken Bernt and I briefly discussed the problem of how a robot can actively acquire a model that can be used to recognise an object from many viewpoints. Humans are rather adept at acquiring visual models of objects actively by reorienting them in their hands. We cannot be quite that sophisticated in our robot's manipulation of objects. But we could either push an object to rotate it, or translate it, or grasp it, turn and put it down again (or even turn it over). These manipulations would achieve a similar aim. How should the robot move the object so as to acquire enough views to recongise it reliably in future? To do this we have to have a notion of the information gain that the robot makes as it moves either around the object or as it moves the object. I would really like to work on this question with TUD and UOL. This could build nicely on the work we have already done while allowing us to do some simple pushing or grasping of the objects to reorientate them in order to obtain new views. So that is my best answer to Bernt's question about "which aspects should be addressed, when and how and in cooperation with what other aspect" (I agree that is one of our key questions right now). In your email Bernt you asked why we need to be able to recognise the robot's hand. My answer is simply that we need it for two purposes in this task. First we need to be able to know where the gripper is so as to grasp the objects, and second we need to able to subtract it from the image we use for learning (such as removing our fingers if we are learning about the appearance of an object in the hand.)
For the question of object manipulation for affordances we have done (mainly Marek) a year long analysis of the sorts of objects we think it is worth working with, and the sorts of interactions we can have with them. Essentially the set of objects needs to be very restricted: we suggest using simple polyhedra, and some objects with curved surfaces or hooks/handles, and finally a stick (to manipulate other objects). We also believe that manipulation must be cast broadly.
Without going into details we have discussed whether during the next twelve months we focus on discovering affordances with objects involving pushing, pulling, and turning on the plane. We can create upright surfaces for interactions such as sliding and stopping. In the first instance my Q2 is concerned with how can we predict what will happen when a force is applied to an object using the manipulator. What kind of representation of the motion, and the resulting change in the spatial relationships between the objects will result. How is this a function of the force sequence applied, and points at which the surfaces interact? How can we learn such a function?
To give an idea of a sample problem imagine a block up against a wall (see figure). If I poke the block in different ways different motions will be induced. A good question for vision is how can I learn from visual and haptic information what those motions are in a general way, and how do I represent those motions? We definitely want to go beyond this in twelve months, but I stop now to save space.
This is really a problem that we need a vision group in the consortium to be willing to work on if we are to tackle it successfully. Is it something anyone wants to work with us on? At the moment I'm not sure if it falls in to the suggestions I've seen so far (but I may have misunderstood their scope).
Of course we also want to do work with others on linking with language, planning etc, but we have tough decisions to make about where we put our limited resources,