6 Summary and Discussion

We see the world in scenes, where objects are embedded and often partially occluded in rich and complex surroundings containing additional objects. How does the brain extract and transform diagnostic low-level visual features into richer representations that facilitate recognition, whilst there are so many factors that affect the appearance of natural object categories? In this thesis, I examined to what extent object and background information is represented and used for object recognition in human subjects and in deep convolutional neural networks. More specifically, I evaluated how different functional architectures or differences in information flow (feed-forward vs. recurrent) exhibit sensitivity to natural scene properties. My experiments focused on the role of natural scene complexity, as indexed by two biologically plausible image statistics, and the manipulation of ‘informative’ (congruent) information in visual scenes. Overall, results show that recognizing objects in simple scenes can occur in a feed-forward manner, on the basis of a first, coarse representation. For more complex scenes or more challenging situations, additional extensive processing (in the form of recurrent computations) are required. Additionally, results indicate that object recognition can be performed based on feature constellations, without any determination of boundary or segmentation. Finally, it showcases a potential role for DCNNs as artificial animal models of human visual processing. In the following sections I will discuss the obtained results in more detail. Throughout, I will discuss their implications for our understanding of object recognition in natural scenes. Finally, I will go into the broader context of our research and discuss some outstanding questions.

6.1 Motivation and summary of the results

The initial motivation came from the findings in Groen, Jahfari, et al. (2018), where we found that object detection was more difficult for scenes with low spatial coherence (SC) and high contrast energy (CE), i.e., high SC/CE values. CE and SC are computed using a simple visual model that simulates neuronal responses in one of the earliest stages of visual processing. Specifically, they are derived by averaging the simulated population response of LGN-like contrast filters across the visual scene (Ghebreab et al., 2009; Scholte et al., 2009). In turn, they could serve as a complexity index that affects subsequent computations towards a task-relevant visual representation. Combined fMRI and EEG results from that study showed that for complex scenes only, early visual areas were selectively engaged by means of a feedback signal. These findings suggested that when the initial global scene impression signals the presence of a high SC/CE scene (indicating that it contains clutter), the visual system has to perform more effortful detailed analysis of the scene, which involves recruitment of information from early visual areas.

In chapter 2, we wondered whether these low-level, task-irrelevant properties would also influence perceptual decision-making. In addition, we attempted to dissociate the contributions of the two different axes describing the image complexity ‘space’ (CE and SC). We used regression analyses in which we included both linear terms as well as second-order polynomials to examine whether the relationship between SC/CE and two parameters from the Drift Diffusion Model (DDM; Ratcliff & McKoon (2008); Wiecki et al. (2013)) was linear or curvilinear (e.g. followed an inverted U-shape). Results indicated that scene complexity, as indexed by our two parameters (SC, CE), modulated perceptual decisions through the speed of evidence accumulation. Our results indicated that the speed of evidence accumulation was related to differences in both SC (linear) and SC\(^2\) (inverted U-shape). That is, low and high SC were associated with a decreased drift rate, as indicated by a negative shift in the posterior distribution. A second experiment refined these observations by showing how the isolated manipulation of SC alone resulted in weaker yet comparable effects, whereas the manipulation of CE had no effect. Overall, these results showed that very basic properties of our natural environment influence perceptual decision-making. Because SC and CE could be plausibly computed in early stages of visual processing, they could indicate the need for more cautious or elaborate processing by providing the system with a global measure of scene complexity.

A question that arose from these findings, was whether the effects were driven by SC and CE ‘itself’ (low-level regularities) or because they covary with other sources of information in the scene. SC and CE clearly covary with interesting properties of natural scenes, but exactly because of this covariance it is difficult to isolate their impact on visual processing. Therefore, in chapter 3, we explored whether these effects could be based on the computation of SC and CE more directly, as a ‘general measure’ of complexity, or indirectly, as diagnostic information to estimate other task-relevant scene properties (e.g. naturalness). To this end, we manually segmented the objects from their real-world scene backgrounds and superimposed them on phase scrambled versions of the real-world scenes. For each complexity condition, backgrounds were selected using the same cut-off values from Groen, Jahfari, et al. (2018), and each object was presented in all conditions. This allowed us to evaluate the influence of \(SC\) and \(CE\) and the subsequent effect on segmentability, while removing any (ir)relevant object and context information. Additionally, in half the trials, we hindered recurrent processing with visual backward-masking (Fahrenfort et al., 2007). A convergence of results indicated that recurrent computations were increasingly important for recognition of objects in more complex environments (i.e. objects that were more difficult to segment from their background). First of all, behavioral results indicated poorer recognition performance for objects with more complex backgrounds, but only when feedback activity was disrupted by masking. Second, EEG measurements showed clear differences between complexity conditions in the ERPs around 200 ms - a time window beyond the first feed-forward visual sweep of activity (Lamme & Roelfsema, 2000). Additionally, object category decoding based on the multivariate EEG patterns showed later decoding onsets for objects embedded in more complex backgrounds. This indicated that object representations for more complex backgrounds emerge later, compared to objects in more simple backgrounds. Finally, Deep Convolutional Neural Network (DCNN) performance confirmed this interpretation; feed-forward network architectures showed a higher reduction in recognition performance for objects in more complex backgrounds compared to networks equipped with recurrent connections (Kubilius et al., 2018).

A limitation of any experiment with artificially generated (or artificially embedded) images is that it is unclear whether the findings generalize to ‘real images’ that have not been manipulated in any way. Together with the previous findings, however, our results corroborate the idea that more extensive processing (in the form of recurrent computations) is required for object recognition in more complex, natural environments (Groen, Jahfari, et al., 2018; Kar et al., 2019; Rajaei et al., 2019; Tang et al., 2018). Nonetheless, the manipulation of SC/CE using artificial vs. naturalistic backgrounds led to slightly different patterns of results. Using artificial backgrounds, scene complexity showed to have a linear effect (seen in chapter 3). When using natural scenes, results showed enhanced performance for medium complex trials (Chapter 2 and Groen, Jahfari, et al. (2018); inverted U-shape). While there are several other factors that could explain this discrepancy (described in the next section), we wondered to what degree real-world scene context influenced recognition performance. Furthermore we wanted to compare the process of scene segmentation for object recognition between human and artificial neural networks in the hope that this could give insight into the question how scene segmentation might be implemented computationally.

Therefore, in chapter 4, we evaluated how object and context information is represented and used for object recognition in different DCNNs. More specifically, we investigated how the number of layers (depth) in a DCNN influences scene segmentation and how this compares to human behavior.

Experiment 1 showed both substantial overlap, and differences in performance between human participants and DCNNs. Both humans and DCNNs were better in recognizing an object when it was placed on a congruent versus an incongruent background. However, whereas human participants performed best in the segmented condition (object on homogenous background), DCNNs performed equally well (or better) for the congruent condition. Performance for the incongruent condition was lowest. This effect was particularly strong for more shallow networks. Notably, the shift in performance from the most shallow network to deeper networks (ResNets; He et al. (2016)) showed the same pattern as the shift from a shallow feedforward architecture to a recurrent architecture (CORnets; Kubilius et al. (2018)], suggesting that there is a functional equivalence between additional nonlinear transformations and recurrence.
Further analyses, investigating which parts of the image were most important for recognition (Zeiler & Fergus, 2014), showed that the influence of background features on the response outcome was relatively strong for less deep networks and almost absent for deeper networks. These findings suggest that one of the ways in which network depth improves object classification, is by learning how to select the features that belong to the object, and thereby implicitly segregating the object features from the other parts of the scene. To complement these findings, we performed an additional experiment in which we tested how training was influenced by network depth. If shallow networks fail to correctly recognize objects, merely because they do not learn to implicitly segment the object from the background (while deeper networks do), we expected them to show a larger increase in performance when trained with segmented vs. unsegmented stimuli (as compared to deeper networks). Indeed, results indicated a benefit of training on segmented objects (as compared to unsegmented objects) for more shallow networks. For deeper networks, this benefit was much less prominent. Training on segmented images thus reduced the difference in performance between shallow and deeper networks.

Deep convolutional neural networks thus seem to learn high-level concepts such as objects based on low-level visual input, without existing conceptual knowledge of these concepts. In chapter 5, we examined visual processing in a situation in which visual information can no longer be reliably mapped onto existing conceptual knowledge. To this end, we evaluated object and scene categorization in a brain-injured patient MS, with severe object agnosia and category-specific impairments. Our findings show a dissociation between two semantically associated tasks (categorization of manmade vs. natural scenes and animate vs. inanimate objects), with better performance for inanimate objects, compared to animate objects (as is usually the case), and better performance for naturalistic scenes compared to manmade scenes. Using Deep Convolutional Neural Networks as ‘artificial animal models’ (Scholte, 2018) we further explored the type of computations that might produce such behavior. Overall, DCNNs with ‘lesions’ in higher order areas showed similar response patterns, with decreased performance for manmade (experiment 1) and living (experiment 2) things. This indicates that behavioral category representations (and subsequent impairments) might be explained by a difference in low-level image statistics or physical properties of the stimuli, and thus by a difference in visual input that they provide to the visual system. Altogether, results from this study indicated that, at least in specific cases such as MS, category-specific impairments can be explained by perceptual aspects of exemplars within different categories, rather than semantic category-membership.

Taken together, our results suggest that recognizing objects in simples scenes, or categorizing very dissimilar target options (e.g. in terms of global properties) could occur on the basis of a first, coarse representation, often described as visual gist (Greene & Oliva, 2009 b; Oliva & Torralba, 2006; Torralba & Oliva, 2003). Overall, our findings are line with theories of visual processing proposing that a global impression of the scene accompanies (Rousselet et al., 2005; Wolfe et al., 2011) or precedes (Hochstein & Ahissar, 2002) detailed feature extraction (“coarse-to-fine” processing Hegdé (2008)]). The current results add to this view by showing how this complexity could arise, and what type of functional architecture might produce this behavior. Our results suggest that ‘core object recognition’ can occur in a feed-forward manner when the visual input is ‘simple’, but that recurrent computations aid object recognition performance in more challenging conditions. Additionally, our results show that for object recognition, an explicit segmentation step is potentially not necessary. This is in line with recent findings from Tang et al. (2018) and Rajaei et al. (2019), where they showed that backward masking led to a large reduction in human object recognition performance for partially visible or occluded objects. Similarly, both studies found that (more shallow) feed-forward architectures were not robust to partial visibility or occlusion of objects, and that adding recurrent computations led to improvements. From a perspective of vision as subservient to action, this makes sense: if certain visual elements form an object in the first sweep of information, the aim of the brain is often to use this information to characterize or interact with the object, not to go back or zoom in on all possible details about its constituting elements. In order words: to recognize a cat, we do not necessarily need to know where its’ legs are (or whether it still has all four). If the first sweep of information is insufficient, it might pay off to wait a little longer and implement recurrent computations to gather more evidence (chapter 2) and obtain a sufficiently detailed representation.

6.2 Manipulations of visual processing

The central aim of this thesis was built around two different ways of evaluating visual processing: 1) increasing task difficulty, thereby enhancing the need for recurrent computations, and 2) the effect of decreased quality of visual processing by interfering with recurrent processing or investigating a patient with bilateral temporo-occipital damage. There are a myriad of ways to interfere with visual processing and our design choices have undoubtedly affected our results. Here I will describe the most important varieties in our experimental paradigms, and their implications for our interpretations.

To increase the need for recurrent computations, we mostly focused on the low-level complexity of the visual input (chapter 2-3) and the manipulation of congruent vs. incongruent context information (chapter 4). However, there were several other varying factors in our experimental paradigms.
For example, the amount of response options varied between the different studies, potentially influencing the level of categorization required to accurately perform the task. Objects can be categorized at different levels of abstraction, from superordinate (e.g. animal vs. no-animal in chapter 2 and chapter 5), ordinate (or ‘basic’, e.g. dog or cat; chapter 3), to more subordinate (e.g. school bus or sports car; Chapter 4). At the perceptual level, features to account for distinct object categories may have differed between the tasks and decreasing the amount of response options may have influenced the amount and/or type of information necessary to analyze the scene (Macé et al., 2009; Rosch et al., 1976). For example, to distinguish between an animate or inanimate object one might rely on more global features, whereas to identify a certain object out of twenty-seven options (chapter 4) more detailed (local) information is needed to accurately distinguish between them.

What is clear from all studies reported in this thesis is that object recognition can (almost) always be solved, given enough time, a-priori knowledge or visual processing capacity. Therefore, in most experiments we additionally manipulated the quality or opportunity for recurrent processing to take place. In the experiments with human participants, we shortened presentation times (ranging from 34 - 100 ms) or applied visual backward-masking. In chapter 5, visual processing of patient MS was severely impaired by lesions to most of the ventral temporal cortex of both hemispheres. For the DCNNs we manipulated network depth, the presence or absence of recurrent connections and the removal of certain connections to ‘mimic’ lesions.

Taken together, these (sometimes subtle) differences in experimental paradigms and procedures can explain some of the discrepancies in our current findings. For example, in chapter 2, using naturalistic scenes and a superordinate type of task, medium scene complexity was associated with an increased speed of evidence accumulation and enhanced behavioral performance. In that chapter, we discuss several explanations for why scenes with medium CE/SC values could be processed more efficiently, including higher daily frequency (as in, occurring more often in the ‘real-world’) or the amount of contextual information. In chapter 3, using five ‘basic-level’ objects embedded in artificially generated backgrounds, higher scene complexity led to an incremental decrease in performance (with visual backward masking). This suggests that low complex naturalistic scenes might be processed differently than artificial scenes. Whether this is because of task demand, expectations or context is unclear from the current results, and should emerge from future research.

6.3 Probing cognition with DCNNs

Classic models of object recognition focusing on grouping and segmentation presume an explicit process in which certain elements of an image are grouped, whilst others are segregated from each other, by a labelling process. Our results from behavioral experiments in DCNNs show that, when the task is object recognition, an explicit segmentation step might not be necessary. We interpret these findings as indicating that with an increase in network depth there is better selection of the features that belong to the output category (vs. the background), resulting in higher performance during recognition. Thus, more layers are associated with ‘more’ or better segmentation, by virtue of increased selectivity for relevant constellations of features. There is thus no discrete ‘moment’ at which segmentation is successful or ‘done’. This process is similar, at least in terms of its outcome, to figure-ground segmentation in humans and we speculate that it might be one of the ways in which scene segmentation is performed in the brain (using recurrent computations). What these results additionally show is that certain psychological concepts or classifications (e.g. ‘object’ and ‘background’) make distinctions that are not recognized by deep convolutional neural networks, and potentially fail to capture the computations of visual processing. For example, while edges and borders of objects were traditionally seen as very important for successful recognition, the current results suggest that we do not necessarily need to detect those in many everyday behaviors. For these networks, objects are not ‘things’ that ‘exist’ in a certain location with a clear boundary. Visual properties from all regions in the image are processed, and together result in a robust representation that the network can utilize for classification. Thus, on the basis of low-level input features that map reliably enough onto high-level feature constellations. The ‘invention’ of deep convolutional neural networks as computational models of the human visual system in this sense allows addressing questions that previously could not be answered (or had not been asked). Without the constraints of experimental set-ups, using DCNNs enables us to ask questions about the underlying mechanisms producing behavior. Instead of building an experiment on the building blocks of psychological concepts, we can explore the borders of experimental manipulations and start asking ‘when’ and ‘how’ questions. Crucially, this serves as hypothesis generation, and any obtained new insight from DCNNs will need to be verified and confirmed in human data. Of course, counterarguments can be made to this approach. First, of all, DCNNs are far, far away from being an ultimate model explaining all biological visual processing (Cichy & Kaiser, 2019; Kriegeskorte, 2015; Lindsay, 2020). They generally lack many types of biological properties that are known to be involved in neural processing, they makes different types of errors compared to humans, they generalize poorly beyond the datasets on which they are trained, etc. Clearly, DCNNs are very different from a biological visual system. A second, and perhaps more important, counterargument is that the search space might be too large to solve. It is probably impossible or unfeasible to explore the immense zoo of different architectures, combined with an infinite number of possibilities to investigate different visual diets, training regimes and tasks. One fruitful approach to help navigate or constrain the search space is by combining knowledge from biological vision with existing models. Over the last years there has been an increase in research aiming to augment or equip DCNNs with additional biologically-inspired features and mechanisms. For example, by implementing biological attention mechanisms (Lindsay & Miller, 2018), artificial spiking neural networks (Tavanaei et al., 2019); biological learning rules (Pozzi et al., 2018), or recurrent computations to capture the representational dynamics of the human visual system (Güçlü & Gerven, 2017; Kietzmann, McClure, et al., 2019). Overall, the combination of research in both human and artificial vision offers a promising framework for the investigation of both human visual processing and the development of computational models.

References

Cichy, R. M., & Kaiser, D. (2019). Deep neural networks as scientific models. Trends in Cognitive Sciences, 23(4), 305–317.

Fahrenfort, J. J., Scholte, H. S., & Lamme, V. A. (2007). Masking disrupts reentrant processing in human visual cortex. J. Cogn. Neurosci., 19(9), 1488–1497.

Ghebreab, S., Scholte, S., Lamme, V., & Smeulders, A. (2009). A biologically plausible model for rapid natural scene identification. Adv. Neural Inf. Process. Syst., 629–637.

Greene, M. R., & Oliva, A. (2009b). The briefest of glances: The time course of natural scene understanding. Psychol. Sci., 20(4), 464–472.

Groen, I. I. A., Jahfari, S., Seijdel, N., Ghebreab, S., Lamme, V. A., & Scholte, H. S. (2018). Scene complexity modulates degree of feedback activity during object detection in natural scenes. PLoS Computational Biology, 14(12), e1006690.

Güçlü, U., & Gerven, M. A. J. van. (2017). Modeling the dynamics of human brain activity with recurrent neural networks. Front. Comput. Neurosci., 11, 7.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.

Hegdé, J. (2008). Time course of visual perception: Coarse-to-fine processing and beyond. Progress in Neurobiology, 84(4), 405–439.

Hochstein, S., & Ahissar, M. (2002). View from the top: Hierarchies and reverse hierarchies in the visual system. Neuron, 36(5), 791–804.

Kar, K., Kubilius, J., Schmidt, K., Issa, E. B., & DiCarlo, J. J. (2019). Evidence that recurrent circuits are critical to the ventral stream’s execution of core object recognition behavior. Nat. Neurosci., 22(6), 974–983.

Kietzmann, T. C., McClure, P., & Kriegeskorte, N. (2019). Deep neural networks in computational neuroscience. In Oxford research encyclopedia of neuroscience.

Kriegeskorte, N. (2015). Deep neural networks: A new framework for modelling biological vision and brain information processing. In bioRxiv (p. 029876).

Kubilius, J., Schrimpf, M., Nayebi, A., Bear, D., Yamins, D. L. K., & others. (2018). CORnet: Modeling the neural mechanisms of core object recognition. BioRxiv.

Lamme, V. a F., & Roelfsema, P. R. (2000). The distinct modes of vision offered by feedforward and recurrent processing. Trends Neurosci., 23(11), 571–579.

Lindsay, G. (2020). Convolutional neural networks as a model of the visual system: Past, present, and future. J. Cogn. Neurosci., 1–15.

Lindsay, G. W., & Miller, K. D. (2018). How biological attention mechanisms improve task performance in a large-scale visual system model. ELife, 7, e38105.

Macé, M. J.-M., Joubert, O. R., Nespoulous, J.-L., & Fabre-Thorpe, M. (2009). The time-course of visual categorizations: You spot the animal faster than the bird. PloS One, 4(6), e5927.

Oliva, A., & Torralba, A. (2006). Building the gist of a scene: The role of global image features in recognition. Progress in Brain Research, 155, 23–36.

Pozzi, I., Bohté, S., & Roelfsema, P. (2018). A biologically plausible learning rule for deep learning in the brain. arXiv Preprint arXiv:1811.01768.

Rajaei, K., Mohsenzadeh, Y., Ebrahimpour, R., & Khaligh-Razavi, S.-M. (2019). Beyond core object recognition: Recurrent processes account for object recognition under occlusion. PLoS Comput. Biol., 15(5), e1007001.

Ratcliff, R., & McKoon, G. (2008). The diffusion decision model: Theory and data for Two-Choice decision tasks. Neural Comput., 29(6), 997–1003.

Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8(3), 382–439.

Rousselet, G., Joubert, O., & Fabre-Thorpe, M. (2005). How long to get to the “gist” of real-world natural scenes? Visual Cognition, 12(6), 852–877.

Scholte, H. S. (2018). Fantastic DNimals and where to find them. In NeuroImage (Vol. 180, pp. 112–113).

Scholte, H. S., Ghebreab, S., Waldorp, L., Smeulders, A. W. M., & Lamme, V. A. F. (2009). Brain responses strongly correlate with weibull image statistics when processing natural images. J. Vis., 9(4), 29–29.

Tang, H., Schrimpf, M., Lotter, W., Moerman, C., Paredes, A., Ortega Caro, J., Hardesty, W., Cox, D., & Kreiman, G. (2018). Recurrent computations for visual pattern completion. Proc. Natl. Acad. Sci. U. S. A., 115(35), 8835–8840.

Tavanaei, A., Ghodrati, M., Kheradpisheh, S. R., Masquelier, T., & Maida, A. (2019). Deep learning in spiking neural networks. Neural Networks, 111, 47–63.

Torralba, A., & Oliva, A. (2003). Statistics of natural image categories. Network: Computation in Neural Systems, 14(3), 391–412.

Wiecki, T. V., Sofer, I., & Frank, M. J. (2013). HDDM: Hierarchical bayesian estimation of the drift-diffusion model in python. Front. Neuroinform., 7, 14.

Wolfe, J. M., Võ, M. L.-H., Evans, K. K., & Greene, M. R. (2011). Visual search in scenes involves selective and nonselective pathways. Trends in Cognitive Sciences, 15(2), 77–84.

Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. European Conference on Computer Vision, 818–833.