Installed: 21 Sep 2010
25-5 Jul 2014 (Further major revision and reorganisation)
18-23 Jul 2014 (fixed broken link, architectural development, reformatted, reorganised, clarified)
7 Jul 2011; 29 Jul 2011; 19 Sep 2011;
22 Sep 2010;23 Sep 2010; 26 Sep 2010; 12 Dec 2010; 21 Apr 2011; 5 Nov 2011
This paper is
From time to time a derived PDF version will be created automatically:
A partial index of discussion notes, including related work, is in
All those abilities are generic and may have sub-cases that have to be learnt
separately, and in some cases the learning can include increasing speed,
fluency, reliability and accuracy of performance.
Intelligent abilities require use of knowledge about types of things that can
exist or happen, i.e. knowledge of an ontology. A simple homeostatic controller,
e.g. a thermostat, may use a very simple ontology perhaps including contents of
a form of sensory input (e.g. temperatures) and contents of a form of output
e.g. 'raise' or 'lower' signals to a heater or cooler.
Organisms (and future robots) with multiple modes of sensing and acting on a
complex independently existing environment need ontologies that straddle modes
of perception and action, for instance the ability to express where something is
or how it is moving, irrespective of how the object's location is sensed or
If we had a better understanding of how various ontologies used for various
purposes in organisms evolved, and how they develop in individuals, we might be
better able to design machines with intelligence that matches those of animals,
including humans. Instead we can now only produce machines with very shallow and
restricted abilities, that often turn out to be very brittle when dealing with
Albert Einstein once wrote: "It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience." In: On the Method of Theoretical Physics Philosophy of Science, Vol. 1, No. 2 (Apr., 1934), pp. 163-169 http://www.jstor.org/stable/184387Often, when people discuss the role of simplicity in science, they do not notice
The ontology used by a theory is determined in part by
(a) the syntax of the formalism that it uses, and (b) the variety of 'atomic' components of the formalism that are not explicitly defined as abbreviations for something expressible using other components. (The notion of atomic component is expanded below.)The atomic components may have some semantic content -- referring to possible
The atomic components may be aspects rather than parts of a form of
representation. For example, in a class of maps different colours or textures
may be used for otherwise similar parts to indicate some feature of whatever is
represented by whatever has the colour or texture. For example, colours might be
used to distinguish rivers, roads, footpaths, railway lines, and routes
travelled by various migrating herds. Or they could be used to distinguish
different connected subsets in a map with a complex topology, as in the London
Tube map. In our discussion of a machine or organism learning to interpret
visual sensory input we'll ignore most of these possibilities, though in actual
biological systems and in robots they can be important.
Another sort of atomic component is a type of relationship between parts. For
example in a map or visual sensory array, neighbourhood relationships in
different directions may be interpreted as depicting various spatial
relationships in the source of the visual information, possibly in a context-
sensitive way -- i.e. what a spatial relationship means depends on the context.
For example, when a 2-D image represents a 3-D structure, the local and global
context can determined whether a vertical line in the image represents a
horizontal crack in the floor, or a vertical edge of a far wall, or a cord
hanging vertically from the ceiling, or a horizontal crack in the ceiling.
(See Sloman(1971) for examples and an explanation of the difference between
Fregean and Analogical forms of representation (among others).
The need for mechanism
Often, when people talk about how something is represented they are thinking of
things humans create outside themselves in order to store or manipulate
information, for instance in pictures, maps, sentences, equations, computer
programs, blueprints, tables, graphs, street signs, direction indicators at road
junctions, and many different types of display on machines humans interact with,
including the driver's displays on the fascia of a car (automobile), aeroplane
or a machine controlled in a factory.
Unfortunately, some researchers seem to think cognition and perception are about
objects, and ignore all the semantic contents and perceptual contents
that are not about objects, including aspects or fragments of objects (e.g.
surface fragments relevant to some manipulation task), relations, processes,
causal interactions, opportunities, constraints, goals, values, theories, or
explanations, for example. Relations can be static, e.g. proximity, alignment,
containment, obscuring, or dynamic, e.g. approaching, rotating around, obscuring
more of, and many more.
Some theories have components that use ontologies at different levels of
abstraction, as I shall illustrate below.
A theory can be extended by adding levels of abstraction that are notExample from perception
definable in terms of previous descriptive mechanisms. This can
add to the complexity of a theory's basic ontology, while either
(a) simplifying the structure of the theory and its deployment in explaining
and predicting specific phenomena, or
(b) significantly extending the explanatory and predictive scope of the
(We'll later see an example where specifying a 3-D process and a projection
process -- producing a 2-D shadow -- enormously simplifies the description of
a 2-D process.)
In particular, adding new indefinables can alter search spaces relevant to
solving problems, finding explanations, making plans, or learning useful
If things perceived are not directly coupled with the sensors perceiving themSuch theories may be very useful in understanding ancient organisms able to
then all sorts of factors other than changes in a perceived object can produce
sensory changes, for example, a light going on or off or changing its intensity,
or another object blocking the view.
In such cases, ideas from signal detection theory, and standard analyses of
sensors as partly noisy or probabilistic measuring devices become irrelevant to
understanding some of the most important features of perceptual processes and
mechanisms. For example, this point seems to be missed in the otherwise very
interesting six lectures on information theory and perception by Prof. William
Bialek available on Youtube, starting here:
(Closely related to "Signal Detection Theory"
But the more
The importance for science of abilities to represent and explain how things
could or might exist, i.e. the ability of a theory to explain possibilities,
was discussed and illustrated in chapter 2 of Sloman(1978), where it was
argued that the ability to formulate predictive laws is of secondary importance.
In the case of chemistry, further power was gained by extending the ontology to
include sub-atomic particles, and their properties, interrelationships and
Below, I'll describe a hypothetical organism with a discrete 2-D sensory array,
starting with an ontology of changing 2-D bit patterns, can gain explanatory
power by extending its ontology (a) so as to allow for continuous lines moving
continuously in 2-D, and (b) by adding a third dimension, allowing 3-D
continuous structures to move in 3-D trajectories. In both cases, there are two
stages, first enriching the ontology to allow for more possible structures and
entities to exist, which cannot be sensed and whose descriptions require
concepts that cannot be defined in terms of what can be sensed, and secondly
using the ontology to formulate conjectures about structures and processes that
cannot be sensed, but which allow predictions about sensory data to be made
because the theory allows continuous 2-D projections of continuous 3-D
structures and processes, and allows a discrete sensory array to sample 2-D
continuous projections. This amounts to a two-stage projection, from 3-D to 2-D,
and from continuous space to a discrete space.
Creating the more complex enriched ontology allows an organism or machine to use
a powerful and (relatively) simple theory to explain complex changing 2-D sensory data.
This idea goes back to Kant(1781). Demonstrating that this is possible constitutes
a refutation of "Symbol Grounding" theory (known to philosophers as "Concept
Empiricism"), which claims that all concepts must be definable in terms of
contents of experiences.
Kantian theory extension can be contrasted with many AI learning mechanisms that
are trained by a teacher to attach labels to collections of 2-D sensory data, or
which use statistical analysis of image data to identify and label recurring
clusters of sensory data at increasing levels of abstraction without ever
postulating the existence of entities that can exist without being sensed, as
proposed by Kant. It is not yet clear to me whether the non-reductive (Kantian)
ontology extension capabilities described below are found in any of the "Deep
Learning" systems described in Schmidhuber (2014)
A challenge for learning theories in psychology, neuroscience, AI and robotics,
will be presented, and implications discussed, starting with a visual learning
example, based on changing contents of a 2-D rectangular array of bits (e.g. 0s
and 1s, or any other pair of distinguishable items). This kind of learning can
use either an exhaustive search through a space of possible explanatory
theories, generated by an apriori theory about possible theories, or innate
mechanisms, produced by natural selection, or a human designer, to drive the
choice of proposed explanatory ontologies and theories on the basis of what has
been found to work in previous generations, or previous designs.
The exhaustive search is more general, but also more likely to be intractable,
and too slow for individuals that need to feed, grow, escape predators, and
reproduce. On the other hand, the use of (Kantian) innate meta-theories can lead
to a failure to find good explanations because of limitations of the meta-theories.
However, in organisms, natural selection can complement the use of innate
mechanisms by individuals with a richer, more thorough, and very much slower,
search across generations. And in some organisms that process of natural
selection can lead to a process of social/cultural evolution that works much
faster than natural selection can change genomes, though it requires individual
organisms to have novel teaching and learning capabilities, including use of
language and creative educational abilities (not found in all human teachers!).
NOTE: I am not claiming that animal visual systems use anything remotely
like the form of input in the examples below: 2-D arrays of bit patterns. The
structure of a typical biological retina is totally different, and that suggests
that biological retinas perform quite different functions from electronic
frame-grabbers. For now, those differences will be ignored. For more information
about the human retina, including numbers of foveal and non-foveal receptors see:
E.g. Total number of cones in fovea: Approximately 200,000. There are 17,500
cones/degree2. Approximately 17,500 cones in the central rod-free fovea.
Total number of cones in the retina. 6,400,000
Total number of rods in the retina. 110,000,000 to 125,000,000
According to the above web site, a human fovea has approximately 200,000 cones,
not far off 450x450 (i.e. 202,500), though retinal cells are not arranged in a square
array, so relationships in a video image derived from cartesian coordinates,
e.g. forming a horizontal vertical, or diagonal line, may have no analogue for
retinal input. Moreover, whereas a computer can calculate the relationship of
direction and distance between an image point and the centre of the image, the
corresponding measure for an organism might be expressed in terms which saccade,
or which head movements, or which limb movments could bring a point of interest
to the centre of a fovea, or both foveas simultaneously (using vergence), and
what positive or negative magnification effects those changes would have on
Figure Lines below is an example of a subset of a snapshot of such a digital
display (much smaller than 450x450):
Figure LinesAt the next time-step that portion of the image could look the same, could look
But if there is some structured mechanism, outside the perceiver, generating the
changing images, the task has a solution, even if it is hard to find.
The program may have to search for a form of description summarising how the
the pixel values, their 2-D coordinates and the time are related. E.g. it may
attempt to construct some sort of law of the form:
C(x, y, t) = F(x, y, t)
where F may be a complex formula or algorithm possibly referring to the initial
state at a particular time, t0, as well as more recent times, e.g. t-1, t-2, etc.
Alternatively, the program may attempt to find a continuous function or some
collection of differential equations, treating the discrete values as a sample
from a set of continuously varying black white and grey patterns distributed
over the 2-D plane, with a thresholding operation determining whether each pixel
is recorded as black (0) or white (1).
Or it might search for a generative model or simulation that depicts both an
external structure and a projection process, where the model supports reasoning
about how the projected images will change.
I'll discuss some of the complexities of detecting and describing patterns of
change in a 2-D rectangular array, and later show how the complexity can be
reduced if the perceiver is able to generate a hypothesis about the changing
2-D image being a projection (shadow) of a rotating 3-D object, a wire frame
Notes on complexity of the taskIf all the changes are completely random there will be no way of simplifying
If the pixel-states change 10 times a second, and information is
collected for an hour, then the program would have 36000x450x450
(over 7 billion) records of the form [colour, x, y, time]. Looking
for relationships between various subsets of the pixels at different
times, possibly including relationships between patterns separated
by several time steps, could require the machine to examine a very
large set of subsets of the pixels, if it starts with no prior knowledge about
which patterns are likely to exist.
The number of subsets of pixels in a 450x450 array is far larger than
the number of atoms in the universe, estimated as about 1080 in
The number of subsets expressed as a decimal number has over 600,000 digits.
So there is no possibility of searching through the set of possible subsets
seeking patterns of association between images. Any such learning will have to
be based on selections of much smaller image fragments. How can a learning
system that knows nothing about the mechanism generating image sequence decide
on good ways to select image fragments to compare across time-steps, or across
a single image if seeking regular 2-D patterns?
However general a learning mechanism is, it cannot check all possible hypotheses
about the causes of its sensory input in parallel. It will have to order the
hypotheses in some way. If the learner is a product of biological evolution on
this planet, it may be predisposed to try hypotheses in an order that the
ancestors of the learner found useful, if the order is encoded in the genome.
But what worked well for ancestors may not work well in a new context, e.g.
attempting to use a device connected to the internet to gain knowledge.
Obviously it will depend in part on what sort of pattern of changes is being
displayed and whether the available feature detectors find features that are
parts of important patterns of change.
Suppose that in the world of our hypothetical learner, all that happens is that
the pixels are either all black or all white and they alternate at regular
That pattern of behaviour may appear to be fairly easy to detect. There will be
only two states of the whole display, between which the display switches (all
black and all white), and that would be obvious to humans). But some
change-detectors may be incapable of recognising a global binary alternating
process, especially if they start by trying to detect static or moving edges and
building on the results, as many AI vision systems do!
But let's consider more complex cases, including, later on, images caused by
projecting the shadow of a rotating 3-D object, e.g. a wire frame cube, onto the
grid. My conjecture is that there are many types of image sequence (or video
stream) that include regularities whose discovery would require either strong
initial ("innate") hypotheses as to their contents or else "astronomically"
large searches through the space of possible explanatory mechanisms.
As the length L of the machine description increases, the number of descriptions
of length L increases explosively. For example, the number of possible descriptions
with 1000 ordered symbols with at least K options for each symbol, is
K1000 -- far larger than the number of atoms in the universe if K > 1.
(This web site suggests that the number atoms is at most around 1080:
This approach is intractable, even if steps are taken to reduce the explosion by
eliminating redundant options (e.g. equivalent descriptions using different
symbols). The approach may appear to work if tested on 'toy' problems, however.
Although I have not yet understood it fully, a more sophisticated approach is
proposed in the "Powerplay" system described by Juergen Schmidhuber, which not
only searches for a good explanatory model to generate the sensory data, but
simultaneously searches for good learning algorithms. I don't know if it
overcomes the combinatorial explosion outside of 'toy' test situations where
the only sensory input is a single bit stream. (Of course a single bit stream
could represent a sequence of 450x450 binary retinal images, but with no prior
information about the order in which the bits are fed into the stream, the
complexity of the search task will be significantly increased.)
POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem. Frontiers in Cognitive Science, 2013. ArXiv preprint (2011) http://arxiv.org/abs/1112.5309
http://www-formal.stanford.edu/jmc/child/ John McCarthy, "The Well-Designed Child", unpublished 1996 Also published in the Artificial Intelligence Journal in 2008. He states: "Evolution solved a different problem than that of starting a baby with no a priori assumptions."Various kinds of non-randomness in the sensory input may be fairly easily
For example, if no pixel ever changes its colour, what would enable that
constancy to be detected? If the system starts with built in change-detectors
at every pixel, all feeding into another 2D array recording all the last recorded
changes at each pixel, then a third layer with a mechanism that scans the second
layer looking for any deviation from a constant value might repeatedly report
that no change is being detected anywhere.
But detecting image constancy quickly would require pre-built mechanisms. If the
system has to learn which algorithms operating on the input values produce
useful results, it may take some time to learn to detect total constancy! The
'no change' condition might be detected only after various tests for interesting
types of change have failed.
[NOTE: Many people are puzzled by 'change blindness' demonstrations of Kevin
O'Reagan and others, and ask for explanations of why change is not detected.
This is really silly: we need explanations of how change is detected, not
why it sometimes is not detected.
Having found an explanatory mechanism (which requires an understanding of the
computational problems), we can then ask under what conditions the mechanism
might fail to produce the right result.]
Using those pre-built detectors for similarity between left and right neighbours
across a time step, an additional mechanism could check whether then all the
similarity detectors had a positive result at every time step. That would be a
relatively simple mechanism to hard-wire into an electronic retina, though if it
had to be discovered by searching through a space of algorithms that search
could take some time.
A slightly more complex challenge would be recognition that the pixel value at
location (1000,y) at time t is always the same as the pixel value at location
(1,y) at time t+1: i.e. the horizontal motion to the right "wraps around" to the
left, so that an unchanging pattern is constantly cycling through the system.
There are many other such patterns (e.g. horizontal flow in the opposite
direction, or vertical flow upward or downward, with or without 'wrapping', or
combined horizontal and vertical flow (i.e. diagonal flow) in any one of four
directions. More complex stepwise diagonal flows could have everything constantly
moving two steps right then one step up. All of those 'global' patterns could
easily be detected if the designer of the system had pre-installed suitable
checking algorithms (or hardware equivalents, in some cases, like the 'optical
flow' detectors that seem to be used in animal visual systems).
Your visual system would very quickly spot a simple movement across the display
from left to right, though it's not clear whether that's because all humans
learn to use such a motion detector or whether there is something in the genome
that ensures its presence in normal brains.
Detecting that the pattern's motion is 'wrapped' round the vertical edges of the
display, could be a result of noticing that some patterns keep repeating,
starting at the edge and moving to the right. That will be easier to do if the
dots do not form a random array but have some clearly visible large scale
structure, e.g. a 10x10 array of large squares moving across the screen. In that
case,detecting that as a square moves off the screen to the right, an exactly
similar square moves onto the screen on the left, should be feasible if the
search for process structures is designed to look for moving vertical edges,
instead of searching among all possible patterns of pixel combinations.
In such cases the formulation of a 'theory' that describes what is going on and
allows predictions to be made about what will happen next, can use the same set
of concepts as was required for the initial data (i.e. pixel locations and their
contents), plus some additional concepts defined in terms of the concepts used
to define the data (e.g. 50x50 array of pixels, 10x10 array of 50x50 pixel
I do not know whether Schmidhuber's 'Powerplay' system referenced above would
easily discover this sort of description in a 1000x1000 array.
(There are minor complications about defining the concept of a complete display
moving horizontally, that need not be discussed here.)
In that case the program would do better if it were able to extend its ontology to
include the concept of a continuous 2-D line projected onto the 2-D
discrete array. Each such line would then be represented approximately by a
nearly co-linear set of black pixels. The description of a line moving
continuously across the screen will be much simpler than the description of a
jagged collection of squares moving, and possibly slightly changing it's shape
because of the different adjustments required to fit a portion of the image
exactly into a square.
A learning program that from "birth" includes the notion of a "line-segment" as
a movable entity that can be manifested or represented by a changing set of
pixels in a display, might be able to detect indicators of such lines and
discover that they move and how they move. Without a suitable set
of innate concepts, searching among all possible configurations of pixel
patterns for useful invariants across time intervals could be at least extremely
slow and possibly also completely intractable.
NOTE ADDED: 29 Jul 2011
Social evolution and cultural transmission could change this: if structures
found to be useful by members of a community are not encoded in the genome but
recorded in the culture and passed on to young learners to constrain their
searches for useful features. That form of guidance is one of the factors that
enables each generation to learn more than previous generations, as discussed
below in connection with the influence of a teacher.
With appropriate initial concepts available, the program might find "maximal
lines", i.e. lines that are not parts of larger lines, by scanning outwards from
linear-fragments and merging adjoining nearly collinear fragments, as is
typically done by computer vision programs.
(The concept of a continuous line segment, with arbitrary orientation, moving
continuously in continuous space in an arbitrary direction, while producing a
projection in a discrete 2-D array is non-trivial but I shall not expand on
requirements for possession of such a concept here. That concept certainly
cannot be defined in terms of experiences in a changing discrete grid.
But from the concept in an appropriate theory it is possible to derive
criteria for detecting the projection of such a continuously moving line in a
2-D discrete grid, and human vision researchers have explored a variety of such
line detectors. Since they all work on rectangular grids of pixel values,
whereas animal retinas do not have that structure, quite different detectors will
be needed by anyone trying to model or understand natural vision systems.)
Could a totally general learning process, without relying on any "innate"
concepts for building explanatory theories, discover this way of explaining the
sensed patterns, e.g. using learning based only on mechanisms for information
compression, without any built-in biases in favour of particular ontologies or
forms of representation to use for compression -- and without any initial bias
towards using increased dimensionality to achieve reduction in complexity? Or
would some sort of innate disposition to use the concept of a continuous
straight line be required to make the learning feasible in the lifetime of an
Things get more complex if the lines can change their 2-D orientation.
Suppose the display includes changing collections of black pixels that can be
taken as evidence for a small number of continuous lines projected onto the
display, including some lines that are neither horizontal nor vertical, each
moving linearly without rotating or bending, then describing the lines could
produce a considerable reduction in the complexity of the perceived process,
compared with a full description of the changing pixel values.
The extra complexity of the latter description would arise out of the need to
continually specify which pixels are black and which white (0, or 1). There is
no unique way to do this: it will depend on the assumed thickness of the line
relative to the pixel size, and the orientation of the line relative to the two
major axes of the pixel grid.
Pixel projections of four such continuous lines are shown in Figure Lines above.
Notice that the relationship between length of line and number of black pixels
required to draw it depends on orientation, as shown by the vertical and
diagonal lines of different lengths, but with the same number of pixels. The
concept of a line is not the same as the concept of a set of black points,
though the latter can be taken as providing information about (i.e.
representing) the former.
So, instead of having to predict behaviours of a million discrete pixels changing
colour in synchrony, such a program can use a richer ontology providing a way of
predicting behaviours of a relatively small number of continuous lines moving
continuously, but sampled discretely at discrete times. In this case the
concept of a continuous line and the concept of continuous motion are not
something given as part of the domain of the original sensory data, but creative
extensions of that original ontology.
For more complex examples, including multiple layers of representation, using
several different ontologies, see the description of POPEYE, the image
interpretation program in Chapter 9 of Sloman(1978)
The pictures interpreted by POPEYE depicted words made of cut-out capital
letters some overlapping others, with additional positive and negative noise, as
in the examples in the Figure POPEYE below and Figure Noise, below.
The program was able, in many cases, to recognise the word before all the
picture fragments had been found because the interpretation proceeded in
parallel at all the levels shown in Figure POPEYE. Gaps or noise at a particular
level level could be compensated for by information acquired at other levels,
using bottom up or top down or middle-out inferences.
This program did not do any learning. Adding the ability to invent the
ontologies required for all the intermediate layers would have been a
challenging task, but the project was terminated for lack of funding. Doing the
learning would have required the program to discover the need for the
intermediate layers when presented with increasingly complex images. Then given
a new image, like Figure Noise, it should be able to generate a suitable
interpretation autonomously, by combining separately learnt ontologies.
When normal adult humans are presented with displays of moving linear
configurations of fixed lengths their visual systems naturally interpret the
display in terms of 2-D objects moving continuously in a plane surface, despite
the discreteness of the display.
However, Johansson and others (see below) have shown that under some conditions,
if the lengths of lines change while their orientation in the plane changes this
may be seen as 3-D motion of an object of fixed length. For example a 2-D line
segment rotating about an end that is fixed while the other end moves on an
ellipse with centre at the fixed end, will often be seen as a line of fixed
length rotating in a plane that is not parallel to the display plane. That
interpretation requires a 3-D ontology and the ability to interpret a sensed
2-D process as a projection of a 3-D process.
In Johansson's demonstrations, more complex moving patterns, with lines changing
their orientations, their lengths and the angles at which they meet, are often
interpreted as moving non-rigid 3-D objects, made of rigid fixed-length
components linked at joints, possibly with motions characteristic of living
organisms, e.g. walking.
A 2-D line-segment is a four dimensional entity insofar as four different items
of information are needed to specify each line. They could be two cartesian
co-ordinates for each end, or a pair of co-ordinates for one end plus a length
and a direction to the other end (polar coordinates for the second end), or a
pair of polar co-ordinates for the first end plus a length and direction (polar
co-ordinates) for the second end. It requires a substantial ontological
extension to switch to representing 3-D line segments, which need six items of
information to identify them. However the switch is much more than merely a
matter of increasing the size of a vector: the set of relations, structures and
processes that can occur in a 3-D space is very much richer, including
projections of structures and processes from 3-D to 2-D.
A more algebraic form of representation for the line could take the form of an
algebraic expression involving some variables, representing a class of lines,
plus some numbers to select an instance from that class. Depending on the
algebraic expression used we might be dealing with more than four dimensions,
e.g. if not only straight lines are considered. The space of algebraic
expressions that could be used to characterise subsets of a 2-D space would not
have any well-defined dimensionality, since the structures of algebraic
expressions can vary infinitely in complexity. But let's ignore that for now.
The concept of a continuous (Euclidean) line moving continuously couldAlthough the ontology extensions described above can simplify the description of
not be explicitly defined in terms of the appearance of its projection
into the discrete array. So in that sense the concept of continuity
cannot be grounded in the sensory-motor information available
to a machine of the sort described.
The notion of "grounding" is a source of confusion for cognitive scientists
and philosophers, as argued here:
It may find that a good way to do that is break the trajectory of each line into
sections during which its motion is in roughly the same direction and then find
some algebraic formula that approximates the motion in each section. (Animal
brains would not use algebraic formulae, but more likely qualitative
descriptions of the forms of motion, e.g. going left, decelerating, changing
direction, going right, etc.)
If the machine is able to look for patterns that are not in the original data,
but in its derived descriptions, it may then discover that there are groups of
lines that share the same motion patterns, allowing further simplification. For
example several of the lines might change their direction of motion
simultaneously -- moving from left to right then from right to left, with
similar accelerations and decelerations, while also altering their vertical
locations in the picture so that their ends move smoothly, in roughly elliptical
paths (depicted of course by jagged discrete sequences of black pixels in the
So, instead of considering a million pixels of which many but not all change
colour at each time-step, it can consider 12 lines each of which has a small
number of continuous trajectories, where the lines can be grouped perhaps into
three sets of lines with similar types of trajectory in each set.
By now, the ontology used by the machine has been enriched with continuous
trajectories of lines, directions and speeds of movement, and accelerations, of
lines and ends of lines. On that basis the machine may define a concept of
change of direction, and identify times at which such direction-changes occur
for different lines or line-ends. This will enable it to represent groups of
lines and groups of line-ends with similar patterns of movement and change of
The form of representation found could then be used, if suitable mechanisms are
available, to predict what will happen over various future time-intervals.
Depending on the ontology used the predictions may be precise and metrical or
imprecise and qualitative.
In principle, this form of representation could also explain some sensed pixel
patterns where a group of black dots shrinks in size as the group approaches an
edge of the display, then later starts expanding. An economical description of
such a process might be a line partly moving beyond the end of the pixel array,
and then moving back.
It would even be possible to find evidence for some lines moving in a circular
pattern, disappearing completely and then reappearing at a different part of the
edge of the display.
[It would be good to have little videos illustrating these possibilities. Offers
gratefully received. Other videos are referenced later.]
NOTE (Added 5 Nov 2011)
John McCarthy has a web page making a similar point -- except that he uses a
rather obscure puzzle that humans don't all find easy -- to make the point that
there's a difference between appearance and reality. See
A great deal of research on machine perception and machine learning has been concerned with techniques for reducing dimensionality of information, e.g. by projecting high-dimensional data into lower dimensional spaces and searching for patterns in the reduced data instead of the original. I have been discussing different techniques above, namely moving to a different space from the original data-space, where the different space may be richer (e.g. continuous instead of discrete) but easier to reason about. A really clever learning system (unlike any so far produced in AI that I know of) might go even further and invent the notion of 3-D space containing rigid structures that can move and rotate in that space, as described above: but that would require something more than a completely general learning system. For example, the learner might start off with the knowledge that, instead of having only 2-D spatial coordinates, simple bits of stuff can have 3-D coordinates, and instead of motions involving changes in 2-D they can be 3-D changes, including changes of distance from the perceiver, and also changes of orientation and direction of motion in space, if the objects rotate. In that case, a learning system presented with the data described above may be able, in some cases to achieve a further simplification of its description of what is going on by describing it as a rotating 3-D wire-frame cube (for example) projected onto the 2-D pixel display, like a shadow projected onto a translucent screen. There are some examples of online demonstrations of 2-D projections of 3-D rotating cubes here, along with further discussion of requirements for being able to make this discovery: http://www.cs.bham.ac.uk/research/projects/cogaff/misc/nature-nurture-cube.html If a 3-D wire-frame cube is rotating about a fixed axis passing through it then its twelve edges will project onto a pixel display as twelve moving groups of black pixels of the sorts described above. Each approximately linear group will move in a manner that depends on the size of the corresponding cube edge and its distance from and orientation relative to the axis of rotation of the cube. In terms of changing black and white pixel patterns the projection will be quite complex to describe and the behaviour of the pixels hard to predict. But if the sensed patterns are conjectured as to be shadows (2-D projections) of a 3-D rotating wife-frame cube then a single changing angle of rotation can be used to explain/predict all the sensed projected data: a very great simplification based on considerable ontological sophistication. All the sensed processes can be summarised by an initial state of the cube, and an angular velocity for the rotation, plus the current time. For each time the 3-D configuration can be computed (including the 3-D linear velocities of all components) and the 2-D projection derived. The encoding of that specification of an unending sequence of pixel displays could be much smaller even than the explicit encoding of a single state of the display. Note that in this case if part of the rotating shadow does not fall within the bounds of the pixel display the theory that assumes the edges continue to exist, whether their shadows are sensed or not, will allow reappearance of the projections to be predicted and explained. My conjecture is that humans and many other animals are innately provided with mechanisms that attempt to interpret visual, haptic and auditory percepts in terms of an ontology of 3-D mobile entities some of which are other humans. How that process works and how it evolved are topics for further research, as is the problem of getting machines to learn and perceive in a similar way. Clearly the animals that walk, suckle and run with the herd almost immediately after birth don't have time to learn to see and respond to the complex 3-D structures and processes they cope with. So evolution can provide very powerful biases (as McCarthy noted). I am not claiming that such highly specialised perceptual mechanisms are always present at birth: biological evolution has produced some species whose specific competences develop through interaction with environment, though the development is constrained by and partly driven by genetic influences, as McCarthy suggested in The Well-Designed Child (mentioned above) The best known arguments for innate knowledge are concerned with human language learning, which is not matched by any other species on earth. Here, it is not the particular language learnt that is innately specified but something more general that can learn a very wide variety of languages. I suggest there are far more examples of innate generic competences that can be instantiated in many ways as a result of interaction with a specific environment after birth, most of which have not yet been noticed. Similar ideas are in Karmiloff-Smith's outstanding survey of the issues Beyond Modularity (1992). Some sketchy ideas about genetically influenced, staggered/layered, developmental processes are presented in Chappell & Sloman (2007) The ideas I have been presenting can be taken as a development of Kant's idea that in addition to concepts of things as they are experienced, an individual perceiving and acting in a world that exists independently of that individual's percepts and actions would have to have a notion of a "thing-in-itself" ("ding an sich") whose existence has nothing to do with the existence of any perceiver. In more modern terminology we can express the conjecture that biological evolution produced some organisms that have innate dispositions to create concepts that are a-modal (not necessarily directly tied to any sensory or motor modality) and exosomatic (refer to things outside the skin of the organism). A conjecture about the evolution of generalised languages required for internal purposes by pre-verbal humans and also by many non-human animals interacting intelligently with a complex world, which might have developed later into a language for communication is presented here: http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#glang
The example could be elaborated by postulating the existence of a teacher who somehow poses problems for the learner to solve and provides positive and negative rewards, or comments, on the basis of evaluating the learner's responses to the problems. In that case we would have a learning system that is a combination of teacher and learner and the prior knowledge of the teacher used in setting questions and providing answers would be part of the total learning mechanism. In the case of many animals, and also much of what goes on in very young humans there is a lot of learning that goes on without any teaching. In fact that must have happened in human evolution before there were adults with enough knowledge to do explicit teaching. So we need to explain what sorts of mechanisms, and what sorts of prior knowledge (including meta-knowledge of various kinds) are capable of generating different sorts of learning. It's a mistake to look for the one right learning system if we want scientific understanding as opposed to (or in addition to) mere engineering success. (We already know how to build wonderful learning systems -- human babies: it doesn't follow that we understand how they learn.)
The theory can include undefined symbols, expressing new concepts that
are part of the theory, and which get their meaning from their role in
the theory. The system is usually associated with methods and mechanisms
of observation, measurement, experiment, manipulation which play the
role of "theory tethering" as explained in
Getting Meaning Off The Ground:
Symbol Grounding Vs Symbol Attachment/Tethering
These additions do not define the terms of the theory: they can change while
the theory endures -- e.g. adopting a new more accurate method of measuring the
charge on an electron does not redefine 'electron' or 'charge', if most of the
structure of the theory is unchanged, including the roles of those two symbols
in the theory.
Often AI theories of learning do not take account of all of these applications
of theories that are results of learning.
Revision of a theory can be motivated by either dissatisfaction with the
structure/complexity of existing theories or problems of explaining or
otherwise accommodating or reconciling the theory with new empirical
information (or new theories).
The process of theory revision can include any or all of:
[Many of these changes have been experienced by the Artificial Intelligence
community as a whole. Even more complex changes are going on within biology.]
All this amounts to an unusual way of looking at the process often labelled
"Abduction", namely going from old information to new information that does not
use deductive or statistical reasoning.
Moreover, the possibility of changing the undefined symbols of the
theory makes the search space for such abductive processes potentially
intractable. So heuristic constraints can on the search will be required.
Those constraints may come from a culture, from individual preferences or
hunches, or, in the case of some processes of individual development, from
the genome, as implied in Chappell & Sloman (2007).
I.e. some of the abductions done by humans and other organisms use constraints
provided by evolution, often in very complex and unobvious ways. The constraints
are not all explicit at birth, but emerge at various stages through interactions
between the genome and the environment Chappell & Sloman (2007).
So far I have said nothing about how one might build a machine that has visual perception mechanisms that use retinal input as a basis for seeing the world. I shall offer some sketchy ideas which are close to ideas that others have proposed and some have implemented in the past, though nowadays it is not clear to me that such designs are being used. The key idea is to abandon any notion that seeing happens either at the retina or in the lowest level processing mechanisms driven by retinal input (such as area V1 in the primate visual cortex). Instead the retina and processing elements that are retinotopically mapped should be thought of as together forming a peripheral sensor device for sampling what J.J.Gibson referred to as "the optic array": a cone of information streaming into the location where the eye is, from all the surfaces that are in view from the eye's location. Only a subset of that information will enter the eye at any time, depending on the direction of gaze. In animals the sampling of the optic array is non-uniform, with a small area of high resolution sampling (the fovea) surrounded by lower resolution areas. For now we can ignore the variation in resolution and just talk about a retina that can be directed at different subsets of the cone of incoming information, to pick up samples. Some of the sampled information may be used instantaneously, while others will mainly be used to extend information structures built up over extended periods of time, of varying lengths. This retina requires a large collection of processing units to find important information fragments in the optic array, including fragments of 'edges', texture fragments, optical flow fragments, evidence for highlights, and many others. These fragments are automatically categorised, and where appropriate grouped into slightly larger fragments (where grouping decisions may be context sensitive), and the results of that processing are fed to various other subsystems that have different uses for the information, e.g. collision avoidance, posture control, detection of faces and other objects, detection of various processes in retinal patterns, description of various structures and processes in the environment, fine grained control of action (e.g. monitoring grasping processes), constructions of dynamically changing histograms that are useful for coarse-grained categorisation of the current scene, and also building up longer term records of what has been seen, where things are, what they are doing etc. The longer term information will include things that are temporarily out of sight because the sampling has been moved to a different part of the optic array, or out of sight because they have been temporarily occluded by something closer to the viewer. All the various information structures need to be kept available for use in various tasks (including controlling actions, avoiding bumping into things, answering questions, finding lost objects, following moving objects, catching things, making plans for future actions, etc.). Bringing items back into use will require mechanisms for re-instating links with the retinal array as needed, after such links have been removed because another region of the optic array is being sampled, and the information fed into another part of the more enduring information about the environment. Added 18 Jul 2014 An implication of the above discussion is that learning (under the influence of both the genome and the environment, as described in Chappell & Sloman (2007) produces not just stored facts and new algorithms, but also a changed architecture, possibly including new subsystems with both acquired factual information and new control information, and new communication channels linking them. The new architectures in some cases, instead of requiring new components and new physical connections may make use of new virtual machines composed of interacting sub-systems that are also virtual machines, as described in http://www.cs.bham.ac.uk/research/projects/cogaff/misc/vm-functionalism.html NOTE: some of the ideas described here are closely related to the retinoid mechanism proposed in this book by Arnold Trehub, online here: The Cognitive Brain, MIT Press, 1991, http://www.people.umass.edu/trehub/ One feature that the author did not intend when he constructed his model was that it should explain why the retinal blind spot does not enter into consciousness as some sort of information gap. This follows from the fact that the blind spot is just an aspect of a sampling device feeding information into a more integrated and enduring information structure whose contents are more closely related the contents of introspection. The retinoid will maintain records of information received, not records of information not received. However I do not believe the details of the retinoid model are adequate to meet the requirements of all aspects of human and animal intelligence. That is a topic for another time.
- The example of the rotating 3-D wire-frame cube projected onto a 2-D retina can be varied in a number of ways, including, for example, allowing the axis of rotation of the cube to rotate e.g. in the surface of a cone, allowing the cube to expand or contract, or pulsate in size, or change its location. All of these will complicate the patterns of 2-D motion of points in the projection into the discrete retina described above. If some of the changes, e.g. orientation of the axis of rotation, velocity of rotation, size change, are under the control of output devices managed by the learning machine, that may make it easier for the explanatory theory to be developed, by partitioning the learning task into various sub-tasks.
- P-geometry: Euclid discovered a feature common to all triangles: the interior angles sum to a straight line. The normal proof uses Euclid's parallel axiom, and axioms about equality of angles formed when a straight line (a transversal) crosses two parallel lines. A former sussex student who became a mathematics teacher, Mary Pardoe, found a proof that was easier to remember, but very different. It involves aligning an arrow with one side then rotating the arrow around each of the vertices in turn, through the internal angles, aligning it with the next side. After being rotated through each angle the arrow ends up on the original side pointing in the opposite direction. Euclid's axioms, and very many proofs based on them are products of human learning, which is clearly triggered by exploring structures and processes in space, but must make use of competences that are somehow products of the human genome, though they are not available at birth. I have been attempting to construct a new axiomatisation of what may be an extension of a subset of Euclidean geometry, which does not explicitly assume the parallel axiom, but does involve types of motion (translation and rotation) of line segments, initially just in a plane. I call this P-geometry (Pardoe-geometry) and have an incomplete discussion paper here: http://www.cs.bham.ac.uk/research/projects/cogaff/misc/p-geometry.html This was motivated by a desire to make explicit the assumptions of Pardoe's proof and to test a conjecture that its assumptions do not include the parallel axiom, though its ontology does include motion, unlike Euclid's axioms (though motion may be implicit in some of the theorems, e.g. about loci of sets of points satisfying a constraint). The process of searching for such an axiomatisation, which would, like Euclid's axiomatisation, enormously compress a vast amount of information about spatial structures and also processes in the case of P-geometry, does not feel like a process using a totally general information-compression engine: rather it depends heavily on the specialised ability to represent, manipulate, and reason about spatial structures in an abstract way that does not depend on precise locations, sizes, orientations, etc. It would not be at all surprising to discover that there are evolved features of the human genome that support the development of such mathematical abilities, and without them humans might be unable to learn what they do learn in a normal lifetime. (Compare the features of the human genome that seem to be required to support language learning.)
- Another example, mentioned to me by Alexandre Borovik in conversation is a small ball moving in a path the shape of a cylindrical spiral coil. Depending on the angle of view (or projection) the 2-D path displayed on the retina will be appear to have very different appearances, in some of which there are discontinuities not present in others, even though the motion of the ball is always continuous, with no discontinuous changes of direction. Invoking a 3-D structure producing these visible paths produces a simple uniform explanation of a lot of messy and complex 2-D trajectories. A similar comment can be made about a 3-D wire coil in the shape of such a cylindrical path. Its 2-D projections will be very different from different angles (including a 2-D circle as one of the special cases, and a zigzag linear shape as another) despite the common simple 3-D structure projected.
- Note on the History of Mathematics: It has often happened in the history of mathematics that puzzles arising in some domain (e.g. natural numbers [1,2,3,4...], integers, real numbers) can be dealt with more simply by embedding that domain in a richer, more complex domain. Examples include adding negative numbers and 0 to the natural numbers, adding fractions (rational numbers) to the line of positive and negative integers, adding so-called irrational and transcendental numbers to the rational numbers to produce the so-called real numbers, and adding imaginary numbers (square root of -1) to the real numbers. For an excellent discussion of this listen to the episode of 'In our time', chaired by Melvyn Bragg, broadcast on 23 Sep 2010 http://www.bbc.co.uk/programmes/b00tt6b2 http://www.bbc.co.uk/radio4/features/in-our-time/
- The work of Gunnar Johansson on perception of moving points of light provides many additional examples. He produced movies made by attaching lights to joints on humans filmed in the dark, walking, dancing, fighting, doing push-ups, climbing a ladder, and performing other actions. In each case where a still snapshot was seen merely as an inexplicable collection of points, the movies were all instantly perceived as 3-D movements of one or more humans. Similar effects were produce with light points attached to simulated 3-D biological organisms of various morphologies moving. There were additional experiments involving just two points that could be seen either as ends of a rotating rod or as moving in various 2-D patterns. Generally the simplest 3-D interpretation was preferred. http://www.questia.com/library/book/perceiving-events-and-objects-by-gunnar-johansson-sten-sture-bergstrom-william-epstein-gunnar-jansson.jsp For more details see Perceiving Events and Objects by Gunnar Johansson, Sten Sture Bergström, William Epstein, Gunnar Jansson Lawrence Erlbaum Associates, 1994
It is often assumed that all motivation must be related to some sort of reward that can be experienced by the individual that has the motivation. This assumption underestimates the power of biological evolution, which is capable of producing many kinds of reflex response. Some of them are externally visible behavioural responses to situations that can cause damage -- e.g. blinking reflexes and withdrawal reflexes. Many such reflexes work without the individual having any anticipation of reward to be obtained or punishment to be avoided, even though the response may have been selected by evolution because it tends to enhance long term reproductive success. Individual animals do not need to know that having a damaged eye can be a serious disadvantage in order to have reflex behaviours that avoid damage. I have argued in this paper: http://www.cs.bham.ac.uk/research/projects/cogaff/09.html#907 Architecture-Based Motivation vs Reward-Based Motivation, that in addition to external behavioural reflexes there can be, and are, internal reflexes that produce not behaviours but powerful motives to achieve or avoid some state, and the mere existence of such a motive can, in many situations, trigger planning processes and action process tending to fulfil the motive. These may be as biologically beneficial as external behavioural reflexes but far more flexible because they allow the precise behaviour to achieve the newly triggered motive to be a result of learning. I suspect the irresistible urge to find proofs in mathematics, to improve elegance or efficiency of computer programs, to find a unified explanation of a range of observed phenomena in science, and to produce works of art all depend primarily on architecture-based motivation. A learning system that has just one long term goal, namely to compress as much as possible of information received, might have only one architecture-based motive that drives all others.
The ideas proposed here are intended not to form a definitive explanatory theory, but to be part of a long term "progressive" research programme, of the type defined by Imre Lakatos, in
Falsification and the methodology of scientific research programmes, Philosophical papers, Vol I, Eds. J. Worrall and G. Currie, Cambridge University Press, 1980, pp. 8--101, CUP See also his Open University Broadcast: Science and Pseudoscience (1973) http://www.lse.ac.uk/collections/lakatos/scienceAndPseudoscience.htm
School of Computer Science
The University of Birmingham