Metrics and Targets for a Grand Challenge Project
Aiming to produce a child-like robot

Aaron Sloman

These notes originally arose out of discussions on the DARPA Cognitive Systems project regarding ways of assessing intermediate achievements if the aim is to make significant progress towards building a child-like robot. The key idea presented here is that a very large number of carefully designed scenarios and meta-scenarios (partly inspired by research in developmental psychology) can provide a basis for milestones and metrics. Although the discussion is based on the example of a robotics project the ideas are far more general and could be applied to a wide range of AI/Cognitive Science projects.

The methodology proposed here is relevant to one of the proposals to the UK Grand Challenge initiative namely GC-5 'Architecture of Brain and Mind' which aims to advance the study of Brain and Mind and their relationship --- in part by designing a succession of increasingly human-like robots with a demanding subset of the capabilities of a young child aged somewhere between 2 and 5 years.

In this context, the aim is primarily science (understanding the design problems and possible solutions, and perhaps contributing to our understanding of how natural intelligence works) rather than engineering (making useful devices to perform specific practical tasks) -- the practical, engineering goals listed below are chosen because working on them is a good way to acquire a better understanding of the scientific problems.

Some of the ideas have been adopted in the EC-funded multi-site CoSy project due to start in September 2004 and also in Push Singh's Roboverse project within the MIT Media Lab's Commonsense Computing project.


It is not being claimed that we can actually build a convincing replica of a human child within the next few decades, nor even that it would be desirable to do so. Rather, if anyone wishes to build a complete, integrated, intelligent system, with many of the mental capabilities of a human being, there are scientific arguments for first aiming towards something with many of the abilities of a two to five year old child (this age range is illustrative rather than definitive) rather than, for instance, a new-born infant, or something like an adult.

Apart from any engineering goals that might be served by the knowledge so gained, this sort of project would enhance our understanding of what human minds are and how they work, by giving us new insights into young human minds with very impressive capabilities and great potential for many different kinds of further development, producing members of very diverse cultures, with very different languages, educational achievements, artistic, athletic, social and intellectual skills.

Moreover, instead of trying to understand what is going on within new-born infants by observing them, which is the preferred mode of empirical scientists, we may learn more by finding out what infants develop into and then attempting to reason backwards towards earlier stages, in which the mechanisms function to produce that kind of development. For instance if we can learn what sort of information-processing architecture is capable of combining all the features and capabilities of a typical 5-year old, then that tells us that whatever is going on in a new-born infant should be capable of producing that sort of architecture.

In contrast it is often assumed that a new-born human infant has nothing more than a collection of very general learning mechanisms coupled with biological needs and drives, and that all future development can be explained by the learning within such mechanisms. This ignores the likelihood that millions of years of evolution may have produced something more specific to the requirements of an infant capable of developing into a typical adult human.

Reasons for modelling a child rather than an infant or adult
The main reasons why we can learn more by trying to model a child that can talk, manipulate objects, ask for or give help or advice, etc. rather than either a neonate or expert adult are:

Modelling the capabilities of a two to five year old may be a crucial intermediate stage required for developing something like an integrated adult. Apart from any engineering goals, understanding the the mind of a child may be a crucial requirement for understanding those aspects of mind and brain that are products of both biological evolution and development in a natural environment, and which underly culturally very diverse adult minds: a child aged four or five years is still capable of absorbing and being integrated into almost any human culture.

This implies possession of an architecture, mechanisms, forms of representations, learning capabilities, and architectural bootstrapping mechanisms that have very great generality.

Features of a 2-5 year old mind
In particular, major features of the mind of such a child include

John McCarthy wrote:
The innate mental structure that equips a child to interact successfully with the world includes more than universal grammar. The world itself has structures, and nature has evolved brains with ways of recognizing them and representing information about them. For example, objects continue to exist when not being perceived, and children (and dogs) are very likely designed to interpret sensory inputs in terms of such persistent objects. Moreover, objects usually move continuously, passing through intermediate points, and perceiving motion that way may also be innate. What a child learns about the world is based on its innate mental structure.
The Well-designed Child:
We are not trying to pre-judge what is and is not innate, but proposing a method for investigating the issue by investigating requirements for a 'Well-designed child', architectures and mechanisms that could satisfy those requirements, and then, later on, the mixture of innateness and learning that could produce those architectures and mechanisms.

The rest of this document is concerned with a way of specifying the requirements in terms of collections of scenarios and meta-scenarios.

"Quantitative" vs "precise" targets
At the DARPA cognitive systems meeting in November 2002, the DARPA director, Tony Tether, asked for metrics for progress to be specified. Normally "metric" implies use of numbers (e.g. something quantitative -- a measurement). However that is not always relevant for a complex scientific or engineering project which has many aspects that are better assessed using structural descriptions of what has and has not been achieved.

Nevertheless, it is important to be able to specify achievements (or targets) with some precision, so that assessment of progress can be objective, and in such a way that different achievements can be at least partially ordered in terms of difficulty and depth -- a requirement to support claims that progress is being made! This can be done at different levels of abstraction, suitable for different purposes, such as

(a) driving the project,
(b) informing other experts in the field of progress,
(c) informing scientists in other disciplines of the aims and achievements
(d) reporting to funding agencies or the public on aims and progress.
(e) identifying possible usable products and spin-offs from the work

I have started with context (a), driving the project, and attempted to produce some sample high level specifications as an indication of some of the work to be done in the course of the project in extending the specifications. I also indicate some ways in which these descriptions might be transformed into numerical measure of progress, even though any such measure will necessarily be misleading (like any one-dimensional projection of a complex evaluation).

A very important task, on which I'll try to add notes later is (c) informing scientists from other disciplines about our goals and achievements. That will require a different sort of description. (It may be difficult to do because of the problem of "ontological blindness".)

I believe that the methodology of using carefully structured sequences of scenarios as targets to shape a project can also help with task (e) identifying possible usable products and spin-offs from the work. However this is not a topic discussed in this paper.

Sub-domains of competence in the robot-child
The generation of scenarios should not simply be done by groups of people doing brainstorming, though that can play an important role. Doing it well requires careful analysis of different varieties of competence in the target system: 'Sub-domains of competence'. For the different sub-domains different test scenarios need to be constructed, and for testing the integration of different kinds of competence in the whole system, test scenarios need to be devised that require different sub-systems to interact fruitfully, e.g. either by one helping another, or by one aborting another, or through detection of conflicts of goals or resources, and so on. (Further work needs to be done identifying different varieties of interaction that can generate different sorts of test scenarios.)

A typical human child of the sort we are aiming to model in a robot, e.g. one aged between two and five years, will have many different sorts of competence, some of which can be displayed directly in "everyday" tasks, others of which require probing of kinds that psychologists and clinicians use (e.g. investigations of "executive functions") and some of which cannot be directly observed or measured but can be inferred from a deep explanatory theory which can be assessed against rival theories on grounds of generality, simplicity, consistency with known facts and accepted theories of many kinds, explanatory power, precision, etc.

Likewise, if we discovered a robot made by aliens and sent to interact with us, we would have to use different modes of description and hypothesis construction and testing to understand what it can do and how it does it.

These notes address only relatively shallow observations of behavioural competence. They are shallow because infinitely many different sorts of mechanisms could explain them. Nevertheless at present the competences described go well beyond the state of the art in AI/robotics and the state of explanation in cognitive science and neuroscience, so coming up with any working model satisfying these descriptions will be a considerable achievement. If different models or explanatory theories come up they can be tested and compared, employing the usual scientific and engineering criteria for deciding which of two theories is better, e.g. as described by Imre Lakatos, summarised here and here.

Scenarios and meta-scenarios
The key suggestion is that there are very many domains of competence of a typical two to five year old human child. If we can come up with good characterisations of ways in which those competences can manifest themselves, including ways in which they interact, then we can use those to specify (partial) design requirements for an explanatory architecture. We propose to characterise such competences in scenario descriptions, i.e. mini-film-scripts for episodes in which the child or robot displays the competence. A particular class of competences includes the child's (or robot's) ability to understand what it is doing or thinking. For many of the scenarios we therefore require corresponding meta scenarios in which the child (or robot) displays some (not necessarily total) understanding of what it did in the scenario. This could include explaining decisions, answering hypothetical questions about what would have happened, and also talking about why it did not do something. The ability to produce a large set of such scenarios in a non-ad-hoc way, e.g. without each one being explicitly programmed or designed or trained into the robot will impose very strong constraints on its design.

Of course, additional requirements can come from other directions, e.g. physical constraints on possible information-processing mechanisms, facts already known about brain mechanisms and functions in humans and other animals, facts about evolution and the sorts of designs it can produce, and the prior training history of the individual.

A complication is that some kinds of competence cannot be demonstrated by a robot or child in isolation, but only in combination with others. For instance the ability to communicate requires some context which exercises other abilities, e.g. the ability to perceive, to plan, to act in the world, etc.

Likewise demonstrating the ability to understand spatial structure and geometric relations and to reason about them requires a context in which that general ability engages with, or is instantiated with, details of the context: its structure, the goals and tasks, what other agents do, etc.

Negative Scenarios
Positive descriptions of what a system does can mislead people into over-interpreting and expecting more than has been specified. For that reason it is useful in specifying scenarios describing what is to be done in a project, to add some possible related scenarios that will not be within the planned scope of the project. To some extent the restrictions can be specified in an abstract way without using scenario-level detail. However it will often be clearer to add notes to a scenario description explicitly excluding some possible extensions that some readers might expect to be included. This can also help to focus the attention of designers on the reasons why they are excluded, which will generally be technical, but in some cases may be ethical, or may arise out of concerns for the market place: the investment costs would not be repaid or the extensions might make the system too expensive, for example.

An example: ladders, physics, causation, communication
For the sake of illustration, consider the following sequence of descriptions of a robot-child. Each description illustrates some combination of competences. The fact that a child may satisfy some of the descriptions and not others indicates that there may be some modularity in the collection of competences, and that we may be able to break down the overall design tasks in a manner that enables us to achieve various sub-goals and then build on them.

These examples include such competences as

1. Child wants to get box from high shelf. Ladder is in place. Child climbs ladder picks up box, and climbs down.

2. As for 1 except that the child climbs ladder, finds he can't reach the box because it's too far to one side, so he climbs down, moves the latter sideways then as 1.

3. The ladder is lying on the floor at the far end of the room. He drags it across the room lifts it against the wall, then as 1.

4. As for 1, except that if asked while climbing the ladder why he is climbing it the child answers: something like "To get the box".
(Why would it be inappropriate to answer: "To get to the top of the latter" or "To increase my height above the floor"? What is involved in understanding that these are inappropriate answers, even if in some sense they are accurate?)

5. As for 2 and 3, except that when moving the ladder the child can be asked: Why are you moving the ladder? And gives a sensible reply.
(What sorts of replies are sensible? How can changes in context affect what is a sensible reply to that question? E.g. consider the case where there is already a ladder closer to the box. Perhaps it looks unsafe. Perhaps it has just been painted.)

6. If asked: would it be safe to climb if the foot of the ladder is right up against the wall, the child answers No. If asked why, the child gives a sensible answer (what sorts of answers are sensible?)

7. Child can answer questions about what would happen if the foot of the ladder is further from the wall.

(a) e.g. top of ladder will be lower, and may be too low
(a1) why?

Child may not be able to give a sensible answer.
(How many adults could?)

(b) ladder may be unsafe
(b1) why?

(That's too much to expect a child to answer accurately without a deep understanding of friction and how to compute resultants of forces. But a more intuitive answer may be possible.)

8. Another child wants to use the ladder, but it is not long enough to reach the shelf if put against the wall at a safe angle for climbing.

The robot child suggests moving the bottom closer to the wall to raise the top, and offers to hold the bottom of the ladder to make it safe.

If asked why holding it will make it safe, gives a sensible answer about preventing rotation of ladder (as opposed to slipping of foot of ladder).

NOTE: These interactions presuppose a significant degree both of linguistic competence and social competence in choosing relevant answers.

9. There is no ladder, but there are parts (wooden rungs, and rails with holes for rung-ends ) from which a ladder can be constructed, and a mallet. The child makes a ladder and then acts as in previous scenarios.

(This needs further unpacking, e.g. regarding sensible sequences of actions, things that can go wrong during the construction, and how to recover from them, how the mallet is used, sensible ways of holding something you are about to hit with a mallet, etc.)

10. As 9 but the rungs fit only loosely into the holes in the rails. Child assembles the ladder but refuses to climb up it.

11. As 10, but if asked why, child can explain why it's unsafe.

12. As 11 but if asked what could make it safe comes up with some sensible answer (e.g. use nails, glue, or some material to wedge the rung ends firmly in the holes, etc.)

A variant of the last three would be a child-robot watching another who is about to climb up the ladder with loose rungs, appreciating that calamity could result, understanding that the other might be hurt, knowing that people don't like being hurt, and then giving a warning with appropriate advice and answers to questions.

13. As 1. or 9. but with practice the child becomes much more fluent and quick at achieving the task (climbing up the ladder, grasping the object on the shelf, climbing down the ladder, assembling the ladder, etc.)
Contrast fluency at answering questions and giving explanations.


Working out which generic capabilities are involved in each of the above examples is left as an exercise for the reader!

This is merely an simplified illustrative example of a possible scenario. It does not define a project. For that far more detailed scenarios would be required including many scenarios using various sub-actions deployed in the above examples, e.g. different ways of walking to a ladder, different ways of climbing up or down it, different ways of retaining one's balance while reaching for something, different ways of pushing a ladder, the ability to judge causal relations from different viewpoints, different ways of picking up a mallet, different ways of inserting a rung into a hole while assembling a ladder, different ways of explaining the same thing, and so on.

Every scenario involving the ability to act in the environment could include a 'meta-scenario' in which the child shows awareness of what it is doing, understanding of reasons for details of the action, the ability to explain to another how to perform it, the ability to think about how it might have been done differently, etc.

The sort of project under discussion would have very many scenarios, probably several hundred, or even thousands, each displaying some aspect of human/animal intelligence. For example in this scenario all the objects apart from the robot are rigid, there is no bargaining, there is no competition between the agents, there are no explicit punishments and rewards, at most two agents interact, there is no reference to remote objects or people or places, there is nothing invisible (e.g. in a box, under the carpet, in someone's pocket, behind a block), there are no actions involving sand, water, mud, plasticine, wire, paint, glue, or other materials, there are no processes involving bodily needs, e.g. eating, drinking, keeping warm, no perception of moving flexing objects, no interactions with moving objects (including other agents, e.g. in fighting, dancing, lifting and moving heavy objects), avoiding being hit by something, catching something, etc.

Another feature of human intelligence not captured in the examples is development of fluency, speed, accuracy (in movement, thinking, understanding, perceiving) through practice. Some scenarios would have to address the architectural issues relating to highly developed skills which can be exercised to some extent in parallel with thinking, talking, doing other things.

There are no examples of perception of mood, emotion, intention, puzzlement, interest in other agents, and no self-perception on such matters either. There is no reasoning about mental processes in another agent, e.g. trying to work out how the other would think, feel, solve a problem, view a situation, etc.

There are no problems generated by conflicting motives within an individual, the need to think or act under time pressure, distractions of attention coming from outside or from within the agent (e.g. anger, sorrow, excited anticipation), and nothing involving different modes of relating to a current task: wanting to do it, doing it reluctantly, enjoying it, finding it boring, finding it easy or difficult, wanting to do it but not for its own sake, wishing it could be done a different way, etc.

A full project specification would have to include scenarios including demonstrations of all these different kinds of competences (and others).

For a robot that had not previously met the questions or the tasks, coping with them would be a test of creativity.

More pointed questions and tasks could probe specific types of creativity including not just combinatorial creativity within a planning task, but the ability to extend an ontology, to find a new form of representation, to come up with conjectured explanation, to overcome a communication difficulty, etc.

Explaining clinical neuroscientific phenomena
It should be possible to extend the scenarios to include simulations of various kinds of malfunctions, that might arise out of brain damage, disease, or genetic brain disabilities.

Structure of a mini-scenario
The above can be thought of as a mini-scenario. It has several aspects.

The meta-competences would, as far as I know, defeat all existing robots.

Missing types of scenarios
The above examples do not do justice to the varieties of types of scenarios relevant to producing a child-like (or any kind of human-like) scenario.

One reason for this is that in humans and many other animals there are different kinds of mechanism which run in parallel within the whole architecture with almost arbitrary interactions between them. For instance, in a human walking and talking with a friend there could be

The H-Cogaff architecture being investigated in Birmingham and the partly similar Emotion Machine Architecture proposed by Minsky at MIT are both intended to address multiple diverse interacting mechanisms within a single integrated architecture. Other sorts of architectures were surveyed at the Stanford Symposium on Cognitive Architectures, in March 2003.

Deriving metrics from multiple mini-scenarios
Suppose we developed 200 such little scenarios and arranged them (or their components) roughly in order of increasing sophistication, including a host of different contexts and tasks, e.g. dressing and undressing dolls, colouring in a picture book, taking a bath (or washing a dog), making toys out of meccano and other construction kits, making tea, eating a meal, choosing food to make a meal, feeding a baby, cleaning a mess made by spilling some powder or liquid on a smooth hard floor, or on a carpet, reading a story and answering questions about it, making up stories, discussing behaviour of a naughty child, discussing possible ways of treating the child and why they will work, being able to engage in debate with others (e.g. about where to go for a picnic, or what to do at a party), talking about the past, about the future, about distant places.

Some of the scenarios would include the child learning in various ways, e.g. getting better at physical skills, being able to do qualitatively new things, being able to give better explanations, etc.

We could justify the ordering of the scenario fragments on the basis of things like:

(A total ordering would be much harder to justify than a partial ordering.)

Then we could find some way of allocating points to various subsets of the tasks and set up a time-table with points achieved. E.g. after year 1, 10 points would be achieved, after year 3, 40 points, after year 5, 60 points, after year 10 200 points, etc. etc.

Provided that goals are clear and richly structured we can fairly easily devise metrics. Devising a good set of goals that are interesting and worthwhile, challenging, decomposable into subsets, etc. will be harder.

Most of what has been described so far is way beyond the state of the art in AI and robotics as far as I know, and would require many different sorts of theoretical and practical advances, including development of new (virtual machine) architectures, new forms of representation, new algorithms, new types of learning, new developments in making implicit human knowledge explicit, etc.)

Surprising possible beneficiaries
After writing the above I discovered this web site about Vicki the android child in a TV series. The note includes this quote regarding reactions of elderly people to the idea of a robot-child.

Beyond the fancy, the technical possibilities of a real-life android "granddaughter" utterly enchanted seniors, the majority being pretty unabashed in saying that they'd furnish whole wardrobes and even bedrooms for their own little Vicki, as though playing a full-sized dollhouse.
Note that this is not why we are proposing this project. It may be that some people would wish to market a robot-child for this sort of purpose, but that might just be one of many possible application areas.

Another might be a fairly large and strong version of a robot-child that could be a helper for disabled people who would rather depend on a robot than put out fellow human beings.

A particularly useful case might be a robot assistant for a person who has recently become blind and has not yet learnt to cope in the way that the long term blind do. Again, that sort of thing is not the purpose of this project, merely a possible application.

Other applications could include the study of developmental processes or learning processes that might be based on a working model of a young child. Others could be teaching tools for therapists or psychologists. Others might be applications in factories, rescue operations in situations too dangerous for human rescuers, robots performing tasks in a mine with dangerous gases, or at the base of off-shore oil-rig or in an un-manned space station, or other situations too unpleasant or too dangerous for humans. None of these is a specific goal of this project. However it is to be expected that the scientific achievements will support a very wide range of practical applications in the long term.

A more detailed justification for pursuing this sort of aim can be found here:

An EC-funded project, starting September 2004, partly based on these ideas is described here:

This document is part of a web site discussing and promoting a long term (15-20 year) 'Grand Challenge' project, entitled 'Architecture of Brain and Mind', here

See also The MIT Roboverse project.

The Birmingham Cognition and Affect Project has been investigating many of the issues discussed here for some time, including issues concerned with the architectural requirements for integrated human-like agents with multiple types of functionality that can be exercised (to some extect) concurrently. Here is a picture of the proposed H-CogAff architecture (which still has many gaps to be filled). A toolkit for developing such architectures is described here.


Last updated: 23 Nov 2004
Aaron Sloman