A recent presentation on these topics is http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#gibson What's vision for, and how does it work? From Marr (and earlier) to Gibson and Beyond Recent papers related to the Meta-Morphogensis project include discussions of visual processes in mathematical reasoning, among other aspects of the evolution of visual functions and mechanisms. http://tinyurl.com/CogMisc/meta-morphogenesis.html http://tinyurl.com/CogMisc/evolution-info-transitions.html http://tinyurl.com/BhamCog/ triangle-theorem.html
I have been trying to work out what I would do if I had a team of outstanding vision researchers with whom I could work for the next few years (three to five years). What follows is a partial, draft, set of answers, which will be updated from time to time. People who specialise on vision research do not regard me as a vision researcher, and there is some justification for that, insofar as I spread myself very thinly over many topics, and I do not read most of the published vision research reports. Nevertheless I have been thinking about, reading about and writing about vision for over 30 years, including chapter 9 of The Computer Revolution in Philosophy I list some more papers, presentations, and discussions on vision at the end of this file. This work has mostly been about requirements for human-like or animal-like vision systems, rather than specific designs, although the details of requirements do suggest constraints on designs, and indicate some minimal architectural features, as shown crudely here (2nd page).
Notice however that that gives merely one view of a complex multi-level, multi-functional, dynamical system. A different view is being developed within the CogAff project based on the variety of types of architecture that can be accommodated within the CogAff schema
Different ways of filling in the schema will put different mechanisms in the boxes and different connections between the mechanisms. The lowest layer is found only in the simplest organisms. The middle layer evolved much later, under pressure to represent and reason about what doesn't exist. The top layer probably evolved in parallel with the other two (and makes use of them). It is concerned with meta-semantic competences: abilities to represent and reason about things that represent and reason, with obvious implications for self-monitoring and self control. The mechanisms do not all exist at birth in humans: they grow in carefully controlled phases using delayed development, for reasons explained in: Natural and artificial meta-configured altricial information-processing systems (2007). The vision and action columns are also layered because evolution discovered the need for perceptual and motor subsystems concerned with acquiring and using information about the environment at levels of abstraction corresponding to the different ontologies and functions in the different layers. So waving to someone is an action that requires meta-semantic competences and would be at least partly under the control of the top, meta-management layer. Likewise, seeing happiness or sadness in a face, or seeing an intention in an action requires meta-semantic competences. These meta-semantic perceptual, thinking, and action competences are complex, but not necessarily more complex than abilities to perceive and think about complex 3-D structures and processes in the environment. E.g. ask yourself why it is that when a bolt goes through a fixed nut, if the bolt is rotated about its axis that makes it translate along its axis. Some more examples are in this short discussion note http://www.cs.bham.ac.uk/research/projects/cogaff/challenge.pdf "Perception of structure: Anyone Interested?" Some disagreements with prevalent views My work on vision has mainly been concerned with identifying requirements for human-like or animal-like vision, including requirements that will need to be met by visual systems in intelligent robots that are currently far beyond the state of the art in machine vision. This work has led me to disagree with three widely held assumptions regarding functions of vision, (1)-(3) below, and one widely used assumption about good means to achieve those functions (4) below: (1) I don't agree with the widely held assumption that the main function of a 3-D vision system in an animal or a robot is recognizing objects: recognition is a secondary function, which results from seeing. There are many situations in which we can see an object, and even do things like pick it up, jump on it, avoid touching it, break it apart, prod it, push it, etc., even though we do not recognise the whole object either as being an instance of a known category, or as being a previously encountered individual. So we need to make object-recognition occur as a by-product of seeing, not as the main or most basic function of seeing. It is also important to stress that perception is at least as much about processes as about objects. Biological visual systems did not evolve to cope with a series of snapshots. Animals exist in and interact with an environment in which many processes of different sorts occur, including processes in which object change their properties (e.g. shape or colour), their spatial relationships and their causal relationships and interactions. Furthermore these changes may be metrical, or qualitative, geometrical or topological, and may preserve or change complexity (e.g. as objects are combined to form more complex objects, or disassembled to form a larger collection of simpler objects). Perceiving these processes should not be confused with recognition. There are several issues concerning visual perception of 3-D objects that are not being addressed in part because of the excessive focus on recognition. One way to appreciate those problems is to consider how humans perceive objects they do not recognize. The proposal to study perception of polyflaps grew out of this requirement. (2) I don't think 3-D vision should be thought of as producing some sort of internal model replicating or representing all the details of the scene, in such a way as to enable images of the scene to be generated by projection to different viewpoints. (This is one of the standard tests for success of a 3-D stereo system, but I think it is a misguided test). My brain cannot do that, yet I see a great deal of 3-D structure, and a great many processes in which 3-D structures are created or changed. That seems to be true of most people and animals with good vision. A small subset of individuals can learn to draw or paint what they see, but that is relatively rare. Examining things humans can do with pictures of impossible objects helps to undermine this 'isomorphic model-construction' view of 3-D vision. Some examples can be found here (PDF) (3) Most vision researchers, in AI, psychology, etc. assume that vision is concerned with detecting what exists in the environment. This ignores the very important collection of issues first identified by J.J.Gibson which he described in terms of perception of affordances. J. J. Gibson, The Ecological Approach to Visual Perception, Houghton Mifflin, Boston, MA, 1979, Detailed examination of Gibson's examples, and further investigation of functions of vision indicates that a great deal of human vision is concerned not with what actually exists in the environment but with processes and objects that do not exist, but could exist, including both processes that could occur or be prevented as a result of actions of the perceiver (these involve affordances) and processes that could occur or be prevented by other things, e.g. something blowing in the wind, or being moved by gravity, or by another agent (I call these "proto affordances"). A paper investigating some of the logical and philosophical implications of this is online here: 'Actual Possibilities', in Principles of Knowledge Representation and Reasoning: Proceedings of the Fifth International Conference (KR `96) Eds L.C. Aiello and S.C. Shapiro 627--638. 1996 (4) (Added 2 Oct 2009) Most vision researchers, in AI, psychology, etc. appear to assume that spatial locations, distances and angles are represented within a single global coordinate system, where (a) distances between items in the scene use a common metric so that everything can, for example, be expressed in cm., or multiples of some other fixed unit of length, (b) positions have coordinates relative to some common origin, where the coordinates make use of the common distance metric and (c) orientations in space have measurable angles relative to axes of that global coordinate system. I suspect using a uniform, global, system of metrics and a coordinate system based on cartesian or polar co-ordinates is only something done by humans with a mathematical, scientific or engineering education; and cannot be done by young children or other animals. Instead, a young human child or animal develops a web of semi-metrical spatial relationships in each scene where lengths or distances are measured relative to other things in the scene, using partial orderings, e.g. X is longer than Y, X is longer than Z, the distance from P to Q is more than twice and less than three times the distance from R to S, etc. (This ability to estimate the number of times one length, or a difference in length, fits into another length is what I refer to as using a semi-metrical extension to a partial ordering.) The precise details of how this works, how the form of representation is learnt, and how the the information thus expressed is used, are all topics for further research. (See the presentation on ontologies for baby robots below, for more information.) What are the functions of vision? Exactly what the functions of vision in animals are, and what the functions should be in intelligent robots, is a hard unsolved research topic on which more work needs to be done so that we have much richer sets of requirements against which to evaluate proposed designs. I have been working on collecting requirements for a long time, and trying to organise them into different categories. But I think there is still a long way to go. My paper for the Dagstuhl workshop on vision in February 2008 is one of several attempts to get clear about this, and I still think I am missing important requirements. http://www.cs.bham.ac.uk/research/projects/cosy/papers/#tr0801a Architectural and representational requirements for seeing processes, proto-affordances and affordances. An earlier paper was presented at a vision workshop in 1986 http://www.cs.bham.ac.uk/research/projects/cogaff/12.html#1207 What are the purposes of vision. In particular, I think there are three major functions of vision to be distinguished, that are shared with other animals, and some additional ones that are unique to humans. Three major functions of vision 1. visual servoing -- online control of actions involving production or prevention or alteration of 3-D processes of various kinds. This uses transient, constantly changing information. This is sometimes mistakenly referred to as the "where" function of vision, assumed to be the role of the "dorsal" visual stream. 2. Producing factual, descriptive, re-usable, information that endures for different time-scales, about processes and structures in the environment, with perception of processes as probably more important than perception of structures. This is often mistakenly referred to as the "what" function of vision, assumed to be the role of the "dorsal" visual stream. Since the factual information can include location, orientation and spatial relationships, it can be as much "where" as "what" information. The alleged distinction ignores the facts. (Milner and Goodale later recommended switching from the what/where terminology to a perception/action distinction, which I think is also a mistake. Visual servoing includes both action and vision.) 3. Producing information about what is not occurring, or does not exist but could occur or exist in the environment, and seeing constraints on such possibilities. This can be subdivided into seeing proto-affordances, seeing action-affordances, seeing epistemic-affordances, and limitations of epistemic affordances (e.g. seeing that information is not available, or that it is imprecise, etc.) In many cases, perceiving such affordances involves recognising what kind of stuff (material) things are made of -- e.g. rigid, flexible, elastic, impenetrable, fragile, squishy, heavy, hard, soft, liquid, powdery, etc. Many of these are not properties that can be directly sensed. They often need to be inferred from perceived results of actions (i.e. perceived processes). Examples of possible processes that are hard to see and easy to see, (at least for adult humans) can be found here. Additional functions of vision, that build on those 4. Seeing causes and effects of things that happen or could happen. a. Seeing why something happens or happened involves reasoning about causes and finding explanations, e.g. seeing that something is being moved because something else is pushing it. b. This is related to but different from predicting what will happen, e.g. a moving object will hit an obstacle. It seems that such reasoning can use visual structures and visual mechanisms in some cases, and logical or other non-visual information in other cases. NB: these affordances are seen as directly related to perceived parts, features and relations, especially relations between surface fragments and to possible processes. So they should not be thought of as involving abstract inferences based on recognition of object categories, e.g. "That's a handle so it is graspable", "that's a door so it is openable", etc. Instead, seeing something as graspable involves seeing how two or more controllable surfaces can be moved so that the object comes to be between them, and if the two surfaces are then moved towards each other the object will be gripped, so that thereafter it will move together with the controllable surfaces. How all that might be expressed in the mind of an child, an robot, or a chimpanzee is an open research questions. 5. Seeing other things in the environment as 'sentient' with abilities to have intentions, perform actions, and have responses to things happening in the environment. E.g. seeing in which direction someone is looking, seeing what someone is looking at, seeing what someone is doing, seeing what someone is trying to do, seeing that someone is failing to achieve a goal, etc. This includes something like adopting what Dennett calls "the intentional stance" or using what Newell called "the knowledge level". But it need not assume rationality, as they claim. 6. Seeing and understanding communications. That can include reading written text, understanding gestures, reading music, reading mathematical notation or program code, reading maps, etc. NOTE ADDED 10 Mar 2009 (Revised 10 Jul 2009): PDF slides presented at a number of workshops and seminars recently elaborates on some of these points: http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#brown Ontologies for baby animals and robots From "baby stuff" to the world of adult science: Developmental AI from a Kantian viewpoint.
I don't expect any project to achieve all of those, or even to aim for all of them. But I think it is important when researching on subsets of the functions of vision to pay attention to what the full range of functions is, so that work done on the subsets can be informed by the requirement to be used later on as part of a more general system. Otherwise, there is the risk that work done on subsets will not 'scale out' to interface with other subsets, and will therefore have to be discarded when more ambitious projects are attempted. It may be desirable to develop a research project specifically to identify long term requirements for visual systems that could be the basis of a partially ordered scenario-based roadmap for vision research (which will also necessarily involve research on other functions that interact with vision systems). Some ways of thinking about such roadmaps are indicated in this diagram: Taken from this presentation: What's a Research Roadmap For? Why do we need one? How can we produce one? euCognition Research Roadmap meeting, January 2007. If anyone is interested in collaborating on trying to assemble more complete requirements for future vision systems, to provide the context for the work to be done in the near future, then I would be very interested to hear suggestions, including suggestions for collaboration. However, I do not intend to apply for funding for research in this area. I shall go on doing it anyway, time-sharing with other research activities.
Papers (including book chapters)
Presentations on vision (PDF files)
- Chapter 9 of The Computer Revolution in Philosophy.
"Perception as a Computational Process"
(Including an overview of the Popeye program, developed with David Owen, Frank O'Gorman an Geoffrey Hinton.)
- Image Interpretation, The Way Ahead? (1982)
Invited talk at an international symposium organised by The Rank Prize Funds, London, Sept 1982.
The proceedings were published in Physical and Biological Processing of Images, Editors: Oliver J. Braddick and Andrew C. Sleigh. Pages 380--401, Springer-Verlag 1983
- On Designing a Visual System: Towards a Gibsonian computational model of vision.
In Journal of Experimental and Theoretical AI 1,4, 289-337 1989
- "How to design a visual system -- Gibson remembered"
(Not available online except.)
Jointly written with David Vernon, in Computer vision: Craft, Engineering, and Science,
Ed. D. Vernon, Springer Verlag, 1994.
- Evolvable, Biologically Plausible Visual Architectures
In Proceedings British Machine Vision Conference 2001, pages 313-322.
Discussion notes on vision (HTML, plain text and PDF)
- When is seeing (possibly in your mind's eye) better than deducing, for reasoning?
Presented at CS & AI Theory seminar, Birmingham, Sept 2001
Also at BCS/SGAI meeting, City University London, March 2006
- Evolvable, Biologically Plausible Visual Architectures
Presented at BMVC01 (British Machine Vision Conference, Sept 2001).
- Human Vision --- A multi-layered multi-functional system
Presented at a symposium of the British Machine Vision Association (BMVA) http://www.bmva.ac.uk/ on Reverse Engineering: the Human Vision System Biologically inspired Computer Vision Approaches London, 29 January 2003.
- Requirements for Visual/Spatial Reasoning
Talk to Language and Cognition seminar, School of Psychology, Birmingham, Oct 2003
- A (Possibly) New Theory of Vision (PDF)
Presentation given in several places in October 2005 and following months. Emphasised importance of perception of processes.
Closely related to Two views of child as scientist: Humean and Kantian
- Architectural and representational requirements for seeing processes and affordances
Talk at: BBSRC funded Workshop onClosing the gap between neurophysiology and behaviour: A computational modelling approach
University of Birmingham, United Kingdom
May 31st-June 2nd 2007
- Seeing Possibilities: A new view of Empty Space
Talk at: Intelligent Robotics Lab Seminar, Birmingham, 22nd Jan 2008
Talk at Dagstuhl workshop on vision, Feb 2008
Links between biological mechanisms required for vision in young animals exploring a complex 3-D environment and development of mathematical competences.
Ontologies for baby animals and robots From "baby stuff" to the world of adult science: Developmental AI from a Kantian viewpoint.
What's vision for, and how does it work?
From Marr (and earlier) to Gibson and Beyond
Presented at Birmingham Vision Club (School of Psychology), 17th June 2011, and a Vision/Robotics workshop (Sheffield University Psychology dept.) 23rd June 2011
See also the vision sections of my Doings file.
- What the brain's mind tells the mind's eye.
Incomplete draft paper, begun in 2002.
"Perception of structure: Anyone Interested?"
Things you see you can do with a cup, saucer and spoon.
Some hard challenges for 3-D vision systems, easy for humans.
Prepared as part of requirements study for the CoSy Project.
Problems posed by pictures of a toy plastic meccano crane.
Problems posed by pictures of impossible objects.
- Perceiving polyflaps
The domain of polyflaps and other domains for acting and learning.
- Predicting Affordance Changes
(Alternatives ways to deal with uncertainty) (2007)
Discussion of ways in which perception of action affordances and perception of epistemic affordances can be combined as an alternative to using probabilities to deal with uncertainty.
Informal experiment providing some architectural requirements for a human like visual system (based on the speed at which you see things at many levels of abstraction as you turn a corner, or come up from a Metro station, in an unfamiliar city.
School of Computer Science
The University of Birmingham