Kitty's visual capabilities.
For Kitty the major emphasis on requirements for vision is the ability to support actions of various kinds including actions through which the robot explores the environment and learns about what sorts of objects there are, what it can do to them, what the consequences are, etc.
So the requirement for vision is NOT learning to associate verbal labels with categories.
Rather it will be necessary to see where the object is, and more specifically where the various parts of its surface are, and what it would have to do in order to touch, push, prod, pull, the object in various ways.
It will also need to be able to see what happens when something (itself, or a person) causes things to move. This requires seeing relationships and also seeing changes in relationships -- some continuous (e.g. getting nearer) others discrete, e.g. touching something, going inside something.
However, we do not need exact surface locations, orientations and shapes to be perceived: all that is required is a form of representation that can be used to control coarse grained actions, and then visual servoing can deal with the precise control of movements such as approaching an object in order to touch it.
What this means is that there will be relatively little use initially for vision mechanisms whose main function is recognition of objects associate with names.
Likely techniques for getting the spatial and relational information would probably include finding edges, regions, and maybe also things like texture gradients, colour gradients and optical flow and generating a lot of relatively uncommitted local descriptions and using constraint propagation mechanisms to settle down to a relatively specific shape representation, position representation, etc.
It's probably only after we have selected a suitable form of representation that we'll find out what algorithms are required to produce that representation.
We should not expect that use of stereo, colour vision, or even optical flow is an absolute requirement for doing this, since humans can see a lot even in monochrome, monocularly viewed pictures or real scenes. In any case stereo gives very little extra information beyond some small distance. However, the ability to move the head in order to generate an optical flow field may be very useful.
TO BE EXTENDED
Meanwhile see the general overview of requirements for Kitty