README ¶
Back to All Sims (also for general info and executable downloads)
title: "CCNBook Objrec"
Back to CCNBook/Sims/All or Perception.
Introduction
This simulation explores how a hierarchy of areas in the ventral stream of visual processing (up to inferotemporal (IT) cortex) can produce robust object recognition that is invariant to changes in position, size, etc of retinal input images.
Network Structure
Figure 1: V1 filtering steps, simulating simple and complex cell firing properties, including length-sum and end-stop cells. Top shows organization of these filters in each 4x5 V1 hypercolumn.
We begin by looking at the network structure, which goes from V1 to V4 to IT and then Output, where the name of the object is represented (area V2 is not represented in this model, because it is thought to be important for depth and figure-ground encoding which is not relevant here). The V1 layer has a 10x10 large-scale grid structure, where each of these grid elements represents one hypercolumn of units, capturing in a very compact and efficient manner the kinds of representations we observed developing in the previous v1rf
simulation (Figure 1). Each hypercolumn contains a group of 20 (4x5) units, which process a localized patch of the input image. These units encode oriented edges at 4 angles (along the X axis), and the rows represent simple and complex cells as follows: simple cells are represented by the last 2 rows encoding different polarities (bright below dark and vice-versa); the first row represents complex ''length-sum '' cells that integrate over polarity and neighboring simple cells; and the middle 2 rows are end stop units that are excited by a given length-sum orientation and inhibited by surrounding simple cells at one end.
Neighboring groups process half-overlapping regions of the image. In addition to connectivity, these groups organize the inhibition within the layer. This means that there is both inhibitory competition across the whole V1 layer, but there is a greater degree of competition within a single hypercolumn, reflecting the fact that inhibitory neurons within a local region of cortex are more likely to receive input from neighboring excitatory neurons. This effect is approximated by having the FFFB inhibition operate at two scales at the same time: a stronger level of inhibition within the unit group (hypercolumn), and a lower level of inhibition across all units in the layer. This ensures that columns not receiving a significantly strong input will not be active at all (because they would get squashed from the layer-level inhibition generated by other columns with much more excitation), while there is also a higher level of competition to select the most appropriate features within the hypercolumn.
The V4 layer is also organized into a grid of hypercolumns (pools), this time 5x5 in size, with each hypercolumn having 49 units (7x7). As with V1, inhibition operates at both the hypercolumn and entire layer scales here. Each hypercolumn of V4 units receives from 4x4 V1 hypercolumns, with neighboring columns again having half-overlapping receptive fields. Next, the IT layer represents just a single hypercolumn of units (10x10 or 100 units) within a single inhibitory group, and receives from the entire V4 layer. Finally, the Output layer has 20 units, one for each of the different objects.
- You can view the patterns of connectivity described above by clicking on
r.Wt
, and then on units in the various layers.
Training
Figure 2: Set of 20 objects composed from horizontal and vertical line elements used for the object recognition simulation. By using a restricted set of visual feature elements, we can more easily understand how the model works, and also test for generalization to novel objects (object 18 and 19 are not trained initially, and then subsequently trained only in a relatively few locations -- learning there generalizes well to other locations).
Now, let's see how the network is trained.
- First, go back to viewing
Act
in the Network display. Then, doInit
andStep Trial
to see the first training trial. Use the VCR buttons at bottom right of NetView to replay the quarters of activation states.
You will see the 3 quarters of the minus phase of settling for the input image, which is one of the shapes shown in Figure 2, at a random location, size and slight rotation in the plane, followed by the plus phase with the correct answer provided on the Output layer. The full bitmap image processed by the network can be found in the Image
tab. The patterns on the V1 input layer are the result of processing with the V1 oriented edge detectors shown in Figure 1.
- You can continue to
Step Trial
to see a few more trials, to see what kind of variation is present in the inputs.
Because it takes a while for this network to be trained (maybe only about 2 minutes depending on how new and powerful your computer is), we can just load the weights from a trained network. The network was trained for 50 epochs of 100 object inputs per epoch, or 5,000 object presentations. However, it took only roughly 25 epochs (2,500 object presentations) for performance to approach asymptote (you can see this by training it yourself, or opening the objrec_train1.epc.csv
file in the TrnEpcPlot
if you are working from the original source). With all of the variation in the way a given input can be presented, this does not represent all that much sampling of the space of variability.
- Load the weights using
Open Trained Wts
in the toolbar. Then,Step Trial
a couple of times to see the minus and plus phases of the trained network as it performs the object recognition task. You can click back and forth betweenActM
andActP
to see the difference between the network's answer and the correct one.
You should see that the plus and minus phase output states are usually the same, meaning that the network is correctly recognizing most of the objects being presented.
To provide a more comprehensive test of its performance, you can run the testing program, which runs through 500 presentations of the objects and records the overall level of error.
- Hit the
Test All
button, and selectTstEpcPlot
to speed up processing while waiting for the results. You can click onTstTrlPlot
to see trial-by-trial results, but watching this will slow processing.
You will see that error rates are generally below 5% (and often zero) except for the two final objects which the network was never trained on (which it always gets wrong). Thus, the network shows quite good performance at this challenging task of recognizing objects in a location-invariant and size-invariant manner.
Receptive Field Analysis
Having seen that the network is solving this difficult problem, the obvious next question is, "how?" To answer this, we need to examine how input patterns are transformed over the successive layers of the network. We do this by computing the receptive fields of units in the V4 and IT layers. The receptive field essentially means the range of different stimuli that a given unit in the network responds to -- what it is tuned to detect. During the Test process, the system computes an activation-based receptive field.
In this procedure, we present all the input patterns to the network and record how units respond to them. To determine which patterns activate, e.g., a given V4 unit, then we aggregate activity over another source layer every time that V4 unit is active (weighted by how much it is active). If a given source pattern does not activate the V4 unit, then that source doesn't count toward that unit's overall receptive field. In the end, we divide the product of the unit activity times source by the sum of the source activations (i.e., computing an activation-weighted average of the source). This weighted-average computation ends up producing a useful aggregate picture of what tends to activate that unit. Of particular interest is activity in the Image input -- a great feature of the activation-based technique is that any activity anywhere can be used -- the layer does not need to directly receive from it. We are also interested in the Output
layer activity, which will tell us what LED objects a given unit participates in representing.
Figure 3: Sample of V4 activation-based receptive fields for the Image and the Output layer. The same V4 units are shown (in the outer grid) so you can compare each with its corresponding receptive field for Image and Output.
Figure 4: All IT activation-based receptive fields for the Image and the Output layer, with each unit in the corresponding outer-grid
- After the
Test All
is complete, the tabs starting withV4:Image
contain the resulting activation-based receptive fields. Click on each of them in turn.V4:Image
shows all of the V4 units in the outer large-scale grid (in their respective locations as they show up in the NetView) and their weighted-average response to Images in each inner grid. Because this is so large it requires scrolling -- so we captured a 10x10 sample of V4 units in Figure 3, also showing the same units forV4:Output
. Likewise, the IT tabs show corresponding receptive fields, also shown in Figure 4.
You should see that the V4 units are encoding simple conjunctions of line elements, in a relatively small range of locations within the retinal input. The fact that they respond across multiple locations makes the weight patterns seem somewhat smeared out -- that is a good indication that they are performing a critical invariance role, even though it makes it somewhat difficult to see. You can also see the correspondence between the Image and Output activations -- for example the lower left V4 unit in Figure 3 looks like it responds to a "rotated F" according to the Image pattern, and indeed the most active Output unit corresponds to LED Object 10, which is that rotated F. The fact that each V4 unit responds to multiple different Objects is evident by the crowded, colorful patterns in the Output panel. This means that V4 is producing a coarse-coded distributed representation of objects, not a localist one (as discussed in Chapter 3).
Question 6.3: Explain the significance of the line-element combinations and spatial invariance observed in the V4 receptive fields, in terms of the overall computation performed by the network.
Question 6.4: Pick a different V4 unit from Figure 3, and explain how its strongest Output representation makes sense based on the features shown in its Image receptive field. (Hint: Pick a unit that is particularly selective for specific Image patterns and specific output units, because this makes things easier to see.)
You can now examine the IT patterns (Figure 4), and compare them with the V4 ones.
Question 6.5: Based on your examination of the IT units, do individual neurons appear to code for a single entire object, or rather parts of different objects (such that an individual neuron is active across multiple objects)? Explain.
By focusing specifically on the number of objects a given unit clearly doesn't participate in (based on the Output layer), it should be clear that the IT units are more selective than the V4 units, which substantiates the idea that the IT units are encoding more complex combinations of features that are shared by fewer objects (thus making them more selective to particular subsets of objects). Thus, we see evidence here of the hierarchical increase in featural complexity required to encode featural relationships while also producing spatial invariance. Also, the IT patterns are much blurrier and this means that they are integrating over more locations / distortions of the input images -- i.e., they are more invariant.
Summary and Discussion of Receptive Field Analyses
Using the activation-based receptive field technique, we have obtained some insight into the way this network performs spatially invariant object recognition, gradually over multiple levels of processing. Similarly, the complexity of the featural representations increases with increasing levels in the hierarchy. By doing both of these simultaneously and in stages over multiple levels, the network is able to recognize objects in an environment that depends critically on the detailed spatial arrangement of the constituent features, thereby apparently avoiding the binding problem described previously.
You may be wondering why the V4 and IT representations have their respective properties -- why did the network develop in this way? In terms of the degree of spatial invariance, it should be clear that the patterns of connectivity restrict the degree of invariance possible in V4, whereas the IT neurons receive from the entire visual field (in this small-scale model), and so are in a position to have fully invariant representations. Also, the IT representations can be more invariant, and more complex because they build off of limited invariance and featural complexity in the V4 layer. This ability for subsequent layers to build off of the transformations performed in earlier layers is a central general principle of cognition.
The representational properties you observed here can have important functional implications. For example, in the next section, we will see that the nature of the IT representations can play an important role in enabling the network to generalize effectively. To the extent that IT representations encode complex object features, and not objects themselves, these representations can be reused for novel objects. Because the network can already form relatively invariant versions of these IT representations, their reuse for novel objects will mean that the invariance transformation itself will generalize to novel objects.
Generalization Test
In addition to all of the above receptive field measures of the network's performance, we can perform a behavioral test of its ability to generalize in a spatially invariant manner, using the two objects (numbers 18 and 19 in above Figure 2) that were not presented to the network during training. We can now train on these two objects in a restricted set of spatial locations and sizes, and assess the network's ability to respond to these items in novel locations and sizes. Presumably, the bulk of what the network needs to do is learn an association between the IT representations and the appropriate output units, and good generalization should result to all other spatial locations.
In addition to presenting the novel objects during training, we also need to present familiar objects; otherwise the network will suffer from catastrophic interference. The following procedure was used. On each trial, there was a 50% chance that a novel object would be presented (PNovel
in the panel). If a novel object was presented, its location, scaling and rotation parameters were chosen using .5 of the maximum range of these values in the original training. Given that these 4 factors (translation in x, translation in y, size, and rotation) are combinatorial, that means that roughly .5^4 or .0625 of the total combinatorial space was explored. If a familiar object was presented, then its size and position was chosen completely at random from all the possibilities. This procedure was repeated for just 10 epochs of 100 objects per epoch. Importantly, the learning rate in everything but the IT and Output connections was set to zero, to ensure that the results were due to IT-level learning and not in earlier pathways. In the brain, it is very likely that these earlier areas of the visual system experience less plasticity than higher areas as the system matures.
-
To setup the system for this form of generalization training, click the
Train Novel
button in the toolbar. This loads the trained weights, sets the epoch counter to 40 to get it to train for 10 epochs up to the 50 epoch stopping point, and sets the environment generation to be of the form described above. Once you do this, doStep Run
to continue the training run for 10 epochs. -
After the network is trained, you can then run the testing by doing
Test All
.
The results show that the network got around 80% correct (20% error) on average across the new 18 and 19 patterns. This is given training on only 6% of the space, suggesting that the network has learned generalized invariance transforms that can be applied to novel objects. Given the restriction of learning to the IT and Output pathways, we can be certain that no additional learning in lower pathways had to be done to encode these novel objects.
To summarize, these generalization results demonstrate that the hierarchical series of representations can operate effectively on novel stimuli, as long as these stimuli possess structural features in common with other familiar objects. The network has learned to represent combinations of these features in terms of increasingly complex combinations that are also increasingly spatially invariant. In the present case, we have facilitated generalization by ensuring that the novel objects are built out of the same line features as the other objects. Although we expect that natural objects also share a vocabulary of complex features, and that learning would discover and exploit them to achieve a similarly generalizable invariance mapping, this remains to be demonstrated for more realistic kinds of objects. One prediction that this model makes is that the generalization of the invariance mapping will likely be a function of featural similarity with known objects, so one might expect a continuum of generalization performance in people (and in a more elaborate model).