Biologically Inspired Computer Vision (2015)
Bioinspired Vision Sensing
Nature has been a source of inspiration for engineers since the dawn of technological development. In diverse fields such as aerodynamics, the engineering of surfaces and nanostructures, materials sciences, or robotics, approaches developed by nature during a long evolutionary process provide stunning solutions to engineering problems. Many synonymous terms like bionics, biomimetics, or bioinspired engineering have been used for the flow of concepts from biology to engineering.
The area of sensory data acquisition, processing, and computation is yet another example where nature usually achieves superior performance with respect to human-engineered approaches. Despite all the impressive progress made during the last decades in the fields of information technology, microelectronics, and computer science, artificial sensory and information processing systems are still much less effective in dealing with real-world tasks than their biological counterparts. Even small insects outperform the most powerful man-made computers in routine functions involving, for example, real-time sensory data processing, perception tasks, and motor control and are, most strikingly, orders of magnitude more energy efficient in completing these tasks. The reasons for the superior performance of biological systems are only partially understood, but it is apparent that the hardware architecture and the style of computation are fundamentally different from what is state of the art in human-engineered information processing. In a nutshell, biological neural systems rely on a large number of relatively simple, slow, and imprecise processing elements and obtain performance and robustness from a massively parallel principle of operation and a high level of adaptability and redundancy where the failure of single elements does not induce any observable system performance degradation.
2.1.1 Neuromorphic Engineering
The idea of applying computational principles of biological neural systems to artificial information processing has existed for decades. An early work from the 1940s by Warren McCulloch and Walter Pitts introduced a neuron model and showed that it was able to perform computation . Around the same time, Donald Hebb developed the first models for learning and adaptation . In 1952, Alan Hodgkin and Andrew Huxley linked biological signal processing to electrical engineering in their famous paper entitled A quantitative description of membrane current and its application to conduction and excitation in nerve , in which they describe a circuit model of electrical current flow across a nerve membrane. This and related work earned them the Nobel Prize in physiology and medicine in 1963.
In the 1980s, Carver Mead and colleagues at Californian Institute of Technology (Caltech) developed the idea of engineering systems containing analog and asynchronous digital electronic circuits that mimic neural architectures present in biological nervous systems [4–6]. He introduced the term “neuromorphic” to name these artificial systems that adopt the form of, or morph, neural systems. In a groundbreaking paper on neuromorphic electronic systems, published 1990 in the Proceedings of the IEEE , Mead argues that the advantages (of biological information processing) can be attributed principally to the use of elementary physical phenomena as computational primitives, and to the representation of information by the relative values of analog signals, rather than by the absolute values of digital signals. He further argues that this approach requires adaptive techniques to correct for differences of nominally identical components and that this adaptive capability naturally leads to systems that learn about their environment. Experimental results suggest that adaptive analog systems are 100 times more efficient in their use of silicon area, consume 10 000 times less power than comparable digital systems, and are much more robust to component degradation and failure than conventional systems . The “neuromorphic” concept revolutionized the frontiers of (micro-)electronics, computing, and neurobiology to such an extent that a new engineering discipline emerged, whose goal is to map the brain's computational principles onto a physical substrate and, in doing so, to design and build artificial neural systems such as computing arrays of spiking silicon neurons but also peripheral sensory transduction such as vision systems or auditory processors [7, 8]. The field is referred to as neuromorphic engineering.
Progressing further along these lines, Indiveri and Furber argue that the characteristics (of neuromorphic circuits) offer an attractive alternative to conventional computing strategies, especially if one considers the advantages and potential problems of future advanced Very Large Scale Integration (VLSI) fabrication processes [9, 10]. By using massively parallel arrays of computing elements, exploiting redundancy to achieve fault tolerance, and emulating the neural style of computation, neuromorphic VLSI architectures can exploit to the fullest potential the features of advanced scaled VLSI processes and future emerging technologies, naturally coping with the problems that characterize them, such as device inhomogeneities and imperfections.
2.1.2 Implementing Neuromorphic Systems
Neuromorphic electronic devices are usually implemented as VLSI integrated circuits or systems-on-chips (SoCs) on planar silicon, the mainstream technology used for fabricating the ubiquitous microchips that can be found in practically every modern electronically operated device. The primary silicon computational primitive is the transistor. Interestingly, when operated in the analog domain as required by the neuromorphic concept instead of being reduced to mere switches as in conventional digital processing, transistors share physical and functional characteristics with biological neurons. For example, in the weak-inversion region of operation, the current through an MOS transistor exponentially relates to the voltages applied to its terminals. A similar dependency is observed between the number of active ion channels and the membrane potential of a biological neuron. Exploiting such physical similarities allows, for example, constructing electronic circuits that implement models of voltage-controlled neurons and synapses and realize biological computational primitives such as excitation/inhibition, correlation, thresholding, multiplication, or winner-take-all selection [4, 5]. The light sensitivity of semiconductor structures allows the construction of phototransducers on silicon, enabling the implementation of vision devices that mimic the function of biological retinas. Silicon cochleas emulate the auditory portion of the human inner ear and represent another successful attempt of reproducing biological sensory signal acquisition and transduction using neuromorphic techniques.
2.2 Fundamentals and Motivation: Bioinspired Artificial Vision
Representing a new paradigm for the processing of sensor signals, one of the greatest success stories of neuromorphic systems to date has been the emulation of sensory signal acquisition and transduction, most notably in vision. In fact, one of the first working neuromorphic electronic devices was modeled after a part of the human neural system that has been subject to extensive studies since decades – the retina. The first so-called silicon retina made the cover of Scientific American in 1991  and showed that it is possible to generate in an artificial microelectronics device, in real time, output signals that correspond directly to signals observed in the corresponding levels of biological retinas. Before going into more detail about biological vision and bioinspired artificial vision, and in order to appreciate how biological approaches and neuromorphic engineering techniques can be beneficial for advancing artificial vision, it is inspiring to look at the shortcomings of conventional image sensing techniques.
2.2.1 Limitations in Vision Engineering
State-of-the-art image sensors suffer from severe limitations imposed by their very principle of operation. The sensors acquire the visual information as a series of “snapshots” recorded at discrete points in time, hence time quantized at a predetermined rate called “frame rate.” The biological retina has no notion of a frame, and the world, the source of the visual information we are interested in, works asynchronously and in continuous time. Depending on the timescale of changes in the observed scene, a problem that is very similar to undersampling, known from other engineering fields, arises. Things happen between frames and information gets lost. This may be tolerable for the recording of video data for a human observer (thanks to the adaptability of the human visual system), but artificial vision systems in demanding applications such as, for example, autonomous robot navigation, high-speed motor control, visual feedback loops, and so on, may fail as a consequence of this shortcoming.
Nature suggests a different approach: biological vision systems are driven and controlled by events happening within the scene in view, and not – like image sensors – by artificially created timing and control signals that have no relation whatsoever to the source of the visual information. Translating the frameless paradigm of biological vision to artificial imaging systems implies that control over the acquisition of visual information is no longer being imposed externally to an array of pixels but the decision making is transferred to the single pixel that handles its own information individually.
A second problem that is also a direct consequence of the frame-based acquisition of visual information is redundancy. Each recorded frame conveys the information from all pixels, regardless of whether or not this information – or part of it – has changed since the last frame had been acquired. This method obviously leads, depending on the dynamic contents of the scene, to a more or less high degree of redundancy in the acquired image data. Acquisition and handling of these dispensable data consume valuable resources and translate into high transmission power dissipation, increased channel bandwidth requirements, increased memory size, and postprocessing power demands.
Devising an engineering solution that follows the biological pixel-individual, frame-free approach to vision can potentially solve both problems.
2.2.2 The Human Retina from an Engineering Viewpoint
The retina is a neural network lining the back hemisphere of the eyeball and can be considered an extended and exposed part of the brain. Starting from a few light-sensitive neural cells, the photoreceptors, evolving some 600 million years ago, and further developed during a long evolutionary process, the retina is where the acquisition and first stage of processing of the visual information takes place. The retina's output to the rest of the brain is in the form of patterns of spikes produced by retinal ganglion cells, whose axons form the fibers of the optic nerve. These spike patterns encode the acquired and preprocessed visual information to be transferred to the visual cortex. The nearly 1 million ganglion cells in the retina compare signals received from groups of a few to several hundred photoreceptors, with each group interpreting what is happening in a part of the visual field. Between photoreceptors and ganglion cells, a complex network of various neuronal cell types processes the visual information and produces the neural code of the retina (Figure 2.1). As various features, such as light intensity, change in a given segment of the retina, a ganglion cell transmits pulses of electricity along the optic nerve to the brain in proportion to the relative change over time or space – and not to the absolute input level. Regarding encoding, there is a wide range of possibilities by which retinal ganglion cell spiking could carry visual information: by spike rate, precise timing, relation to spiking of other cells, or any combination of these . Through local gain control, spatial and temporal filtering, and redundancy suppression, the retina compresses about 36 Gbit/s of raw high dynamic range (DR) image data into 20 Mbit/s spiking output to the brain. The retina's most sensitive photoreceptors, called “rods,” are activated by a single photon, and the DR of processable light intensity exceeds the range of conventional artificial image sensors by several orders of magnitude.
Figure 2.1 Schematic of the retina network cells and layers. Photoreceptors initially receive light stimuli and transduce them into electrical signals. A feedforward pathway is formed from the photoreceptors via the bipolar cell layer to the ganglion cells, which form the output layer of the retina. Horizontal and amacrine cell layers provide additional processing with lateral inhibition and feedback. Finally, the visual information is encoded into spike patterns at the ganglion cell level. Such encoded, the visual information is transmitted along their axons, forming the optic nerve, to the visual cortex in the brain. The schematic greatly simplifies the actual circuitry, which in reality includes various subtypes of each of the neuron types with different specific connection patterns. Also, numerous additional electrical couplings within the network are suppressed for clarity.
(Adapted from Ref. .)
The retina has three primary layers: the photoreceptor layer, the outer plexiform layer (OPL), and the inner plexiform layer (IPL) [13, 14]. The photoreceptor layer consists of two types of cells, called cones and rods, which transform the incoming light into electrical signals, triggering neurotransmitter release in the photoreceptor output synapses and driving horizontal cells and bipolar cells in the OPL. The two major classes of bipolar cells, the ON bipolar cells and OFF bipolar cells, separately encode positive and negative spatiotemporal contrast in incoming light by comparing the photoreceptor signals to spatiotemporal averages computed by the laterally connected layer of horizontal cells. The horizontal cells are interconnected by conductive gap junctions and are connected to bipolar cells and photoreceptors in complex triad synapses. Together with the input current produced at the photoreceptor synapses, this network computes spatially and temporally low-pass filtered copies of the photoreceptor outputs. The horizontal cells feed back onto the photoreceptors to help set their operating points and also compute a spatiotemporally smoothed copy of the visual input. Effectively, the bipolar cells are driven by differences between the photoreceptor and horizontal cell outputs. In the yet more complex OPL, the ON and OFF bipolar cells synaptically connect to many types of amacrine cells and different types of ON and OFF ganglion cells in the IPL. Like the horizontal cells do with the photoreceptors and the bipolar cells, the amacrine cells mediate the signal transmission process between the bipolar cells and the ganglion cells, the spiking output cells of the retina network.
Bipolar and ganglion cells can be further subdivided into two different groups1 : cells with primarily sustained and cells with primarily transient types of responses. These cells carry information along at least two parallel pathways in the retina, the magnocellular or transient pathway, where cells are more sensitive to temporal features (motion, changes, onsets) in the scene, and the parvocellular or sustained pathway, where cells are more sensitive to sustained features like patterns and shapes. The relevance of modeling these two pathways in the construction of bioinspired vision devices will be discussed further in Section 2.3.2.
In the following, a simplified set of characteristics and functions of biological vision that is feasible for silicon integrated circuit focal-plane implementation is summarized. As discussed previously, the retina converts spatiotemporal information contained in the incident light from the visual scene into spike trains and patterns, conveyed to the visual cortex by retinal ganglion cells, whose axons form the fibers of the optic nerve. The information carried by these spikes is maximized by the retinal processing, encompassing highly evolved adaptive filtering and sampling mechanisms to improve coding efficiency , such as:
· Local automatic gain control at the photoreceptor and network level to eliminate the retina's dependency on absolute light levels
· Spatiotemporal bandpass filtering to limit spatial and temporal frequencies, such as reducing redundancy by filtering low frequencies and noise by filtering high frequencies
· Rectification in separate ON and OFF output cell types, perhaps to simplify encoding and locally reduce spike-firing rates
The varying distribution of different receptor types along with corresponding pathways across the retina, combined with saccades (precise rapid eye movements), elicits the illusion of high spatial and temporal resolution in the whole field of view. In reality, the retina acquires information in the retinal centre around the fovea at relatively low temporal resolution but with a high spatial resolution, whereas in the periphery, receptors are spaced at a wider pitch, but respond at much higher temporal resolution. In comparison to the human retina, a conventional image sensor sampling at the Nyquist rate requires transmitting more than 20 Gbit/s to match the human eyes' photopic range (exceeding 5 orders of magnitude), its spatial and temporal resolution, and its field of view. In contrast, by coding 2 bits of information per spike , the optic nerve transmits just about 20 Mbit/s to the visual cortex – a thousand times less.
To summarize, biological retinas have many desirable characteristics which are lacking in conventional image sensors – but inspire and drive the design of bioinspired retinomorphic vision devices. As will be further discussed in the remainder of this article, many of these advantageous characteristics have already been modeled and implemented on silicon. As a result, bioinspired vision devices and systems already outperform conventional, frame-based devices in many respects, notably wide DR operation, device-level video compression, and high-speed/low-data-rate vision.
2.2.3 Modeling the Retina in Silicon
The construction of an artificial “silicon retina” has been a primary target of the neuromorphic engineering community from the very beginning. The first silicon retina of Mahowald and Mead models the OPL of the vertebrate retina and contains artificial cones, horizontal cells, and bipolar cells. A resistive network computes a spatiotemporal average that is used as a reference point for the system. By feedback to the photoreceptors, the network signal balances the photocurrent over several orders of magnitude. The silicon retina's response to spatial and temporal changing images captures much of the complex behavior observed in the OPL. Like its biological counterpart, the silicon retina reduces the bandwidth needed to communicate reliable information by subtracting average intensity levels from the image and reporting only spatial and temporal changes [11, 17].
A next-generation silicon retina chip by Zaghloul and Boahen modeled all five layers of the vertebrate retina, directly emulating the visual messages that the ganglion cells, the retina's output neurons, send to the brain. The design incorporates both sustained and transient types of cells with adaptive spatial and temporal filtering and captures several key adaptive features of biological retinas. Light sensed by electronic photoreceptors on the chip controls voltages in the chip's circuits in a way that is analogous to how the retina's voltage-activated ion channels cause ganglion cells to generate spikes, in this way replicating responses of the retina's four major types of ganglion cells [7, 18].
2.3 From Biological Models to Practical Vision Devices
Over the past two decades, a variety of neuromorphic vision devices has been developed, including temporal contrast vision sensors that are sensitive to relative light intensity change, gradient-based sensors sensitive to static edges, edge-orientation sensitive devices, and optical-flow sensors . Many of the early inventors and developers of bioinspired vision devices stem from the neurobiological community and saw their chips mainly as a means for proofing neurobiological models and theories and did not relate the devices to real-world applications. Very few of the sensors so far have been used in practical applications, yet in industry products. Many conceptually interesting pixel designs lack technical relevance because of, for example, circuit complexity, large silicon area, low fill factors, or high noise levels, preventing realistic application. Furthermore, many of the early designs suffer from technical shortcomings of VLSI implementation and fabrication such as transistor mismatch and did not yield practically usable devices. Recently, an increasing amount of effort is being put into the development of practicable and commercializable vision sensors based on biological principles. Today, bioinspired vision sensors are the most highly productized neuromorphic devices. Most of these sensors share the event-driven, frameless approach, capturing transients in visual stimuli. Their output is compressed at the sensor level, without the need of external processors, optimizing data transfer, storage, and processing, hence increasing power efficiency and compactness of the vision system. The reminder of this section reviews some of the recent developments in bioinspired artificial vision (Figure 2.2).
Figure 2.2 Modeling the retina in silicon – from biology to a bioinspired camera: ATIS “silicon retina” bioinspired vision sensor , showing the pixel cell CMOS layout (bottom left), microscope photograph of part of the pixel array and the whole sensor (bottom middle), and miniature bioinspired ATIS camera.
2.3.1 The Wiring Problem
As touched upon earlier, neuromorphic engineers observe striking parallels between the VLSI hardware used to implementing bioinspired electronics and the “hardware” of nature, the neural wetware. Nevertheless, some approaches taken by nature cannot be adopted in a straightforward way. One prominent challenge posed is often referred to as the “wiring problem.” Mainstream VLSI technology does not allow for the dense three-dimensional wiring observed everywhere in biological neural systems. In vision, the optic nerve connecting the retina to the visual cortex in the brain is formed by the axons of the about 1 million retinal ganglion cells, the spiking output cells of the retina. Translating this situation to an artificial vision system would imply that each pixel of an image sensor would have its own wire to convey its data out of the array. Given the restrictions posed by chip interconnect and packaging technologies, this is obviously not a feasible approach. However, VLSI technology does offer a workaround. Leveraging the 5 orders of magnitude or more of difference in bandwidth between a neuron (typically spiking at rates between 10 and 1000 Hz) and a digital bus enables engineers to replace thousands of dedicated point-to-point connections with a few metal wires and to time multiplex the traffic over these wires using a packet-based or “event-based” data protocol called address event representation (AER) [15, 16]. In the AER protocol, each neuron (e.g., pixel in a vision sensor) is assigned an address, such as its x, y-coordinate within an array. Neurons that generate spikes put their address in the form of digital bits on an arbitrated asynchronous bus. The bus arbiter implements a time-multiplexing strategy that allows all neurons to share the same physical bus to transmit their spikes. In this asynchronous protocol, temporal information is self-encoded in the timing of the spike events, whereas the location of the source of information is encoded in the form of digital bits as the “payload” of the event.
2.3.2 Where and What
The abstraction of two major types of retinal ganglion cells and corresponding retina pathways appear to be increasingly relevant with respect to the creation of useful bioinspired artificial vision devices. The magnocells are at the basis of the transient or magnocellular pathway. Magnocells are approximately evenly distributed over the retina; they have short latencies and use rapidly conducting axons. Magnocells have large receptive fields and respond in a transient way, that is, when changes – movements, onsets, and offsets – appear in their receptive field. The parvocells are at the basis of what is called the sustainedor parvocellular pathway. Parvocells have longer latencies and the axons of parvocells conduct more slowly. They have smaller receptive fields and respond in a sustained way. Parvocells are more involved in the transportation of detailed pattern, texture, and color information .
It appears that these two parallel pathways in the visual system are specialized for certain tasks in visual perception: the magnocellular system is more oriented toward general detection or alerting and is also referred to as the “where” system. It responds with high temporal resolution to changes and motion. Its biological role is seen in detecting, for example, dangers that arise in the peripheral vision. Magnocells are relatively evenly spaced across the retina at a rather low spatial resolution and are the predominant cell type in the retinal periphery. Once an object is detected (often in combination with a saccadic eye movement), the detailed visual information like spatial details, color, and so on, seems to be carried primarily by the parvo system. It is hence called the “what” system. The “what” system is relatively slow, exhibiting low temporal, but high spatial resolution. Parvocells are concentrated in the fovea, the retinal center.
Practically, all conventional frame-based image sensors completely neglect the dynamic information provided by a natural scene and perceived in nature by the magnocellular pathway, the “where” system. Attempts to implementing the function of the magnocellular transient pathway in an artificial neuromorphic vision system has recently led to the development of the “dynamic vision sensor” (DVS). This type of visual sensor is sensitive to the dynamic information present in a natural scene and directly responds to changes, that is, temporal contrast, pixel individually, and near real time. The gain in terms of temporal resolution with respect to standard frame-based image sensors is dramatic. But also other performance parameters like the DR greatly profit from the biological approach. This type of sensor is very well suited for a plethora of machine vision applications involving high-speed motion detection and analysis, object tracking, shape recognition, 3D scene reconstruction, and so on [22–25]; however, it neglects the sustained information perceived in nature by the parvocellular “what” system.
Further exploitation of the concepts of biological vision suggests a combination of the “where” and “what” system functionalities in a bioinspired, asynchronous, event-driven style. The design of asynchronous, time-based image sensor (ATIS) [20, 26], an image and vision sensor that combines several functionalities of the biological “where” and “what” systems, was driven by this notion. Both DVS and ATIS will be further discussed in this chapter.
2.3.3 Temporal Contrast: The DVS
In an attempt to realize a practicable vision device based on the functioning of the magnocellular transient pathway, the “DVS” pixel circuit has been developed . The DVS pixel models a simplified three-layer retina (Figure 2.3), implementing an abstraction of the photoreceptor–bipolar–ganglion cell information flow. Single pixels are spatially decoupled but take into account the temporal development of the local light intensity.
Figure 2.3 (a) Simplified three-layer retina model and (b) corresponding silicon retina pixel circuitry; in (c), typical signal waveforms of the pixel circuit are shown. The upper trace represents an arbitrary voltage waveform at the node Vp tracking the photocurrent through the photoreceptor. The bipolar cell circuit responds with spike events of different polarity to positive and negative gradients of the photocurrent while being monitored by the ganglion cell circuit that also transports the spikes to the next processing stage; the rate of change is encoded in interevent intervals; panel (d) shows the response of an array of pixels to a natural scene (person moving in the field of view of the sensor). Events have been collected for some tens of milliseconds and are displayed as an image with ON (going brighter) and OFF (going darker) events drawn as white and black dots.
The pixel autonomously responds to relative changes in intensity at microsecond temporal resolution over six decades of illumination. These properties are a direct consequence of abandoning the frame principle and modeling three key properties of biological vision: the sparse, event-based output, the representation of relative luminance change (thus directly encoding scene reflectance change), and the rectification of positive and negative signals into separate output channels (ON/OFF).
The major consequence of the bioinspired approach and most distinctive feature with respect to standard imaging is that the control over the acquisition of the visual information is no longer being imposed to the sensor in the form of external timing signals such as shutter or frame clock, but the decision making is transferred to the single pixel that handles its own visual information individually and autonomously. Consequently, the sensor is “event driven” instead of clock driven; hence, like its biological model, the sensor responds to visual events happening in the scene it observes. The sensor output is an asynchronous stream of pixel address events [15, 16] that directly encode scene reflectance changes. The output data volume of such a self-timed, event-driven sensor depends essentially on the dynamic contents of the target scene as pixels that are not visually stimulated do not produce output. Due to the pixel-autonomous, asynchronous operation, the temporal resolution is not limited by an externally imposed frame rate. However, the asynchronous stream of events carries only change information and does not contain absolute intensity information; there are no conventional image data in the sense of gray levels. This style of visual data acquisition and processing yields a pure dynamic vision device which closely follows its paradigm, the transient pathway of the human retina.
Relative change events and gray-level image frames are two highly orthogonal representations of a visual scene. The former contains the information on local relative changes, hence encodes all dynamic contents, yet there is no information about absolute light levels or static parts of the scene. The latter is a snapshot that carries an absolute intensity map at a given point in time, however has no information about any motion; hence, if scene dynamics are to be captured, one needs to acquire many of those frames. In principle, there is no way to recreate DVS change events from image frames nor can gray-level images being recreated from DVS events.
2.3.4 Event-Driven Time-Domain Imaging: The ATIS
Besides limited temporal resolution, data redundancy is another major drawback of conventional frame-based image sensors. Each acquired frame carries the information from all pixels, regardless of whether or not this information has changed since the last frame had been acquired. This approach obviously results, depending on the dynamic contents of the scene, in a more or less high degree of redundancy in the recorded image data as pixel values from unchanged parts in the scene get recorded and transmitted over and over, even though they do not contain any (new) information. Acquisition, transmission, and processing of this unnecessarily inflated data volume waste power, bandwidth, and memory resources and eventually limit the performance of a vision system. The adverse effects of this data redundancy, common to all frame-based image acquisition techniques, would be tackled most effectively at the pixel level by simply not recording the redundant data in the first place.
Again, biology is leading the way to a more efficient style of image acquisition. ATIS is an image sensor that combines several functionalities of the biological “where” and “what” systems with multiple bioinspired approaches such as event-based time-domain imaging, temporal contrast dynamic vision, and asynchronous, event-based information encoding and data communication [20, 26]. The sensor is based on an array of fully autonomous pixels that combine a simplified magnocellular change detector circuit model and an exposure measurement device inspired by the sustained parvocellular pathway whereby the parvocell is driven by the response to relative changes in local illuminance of its corresponding magnocell.
The magno change detector individually and asynchronously initiates the measurement of a new exposure/gray scale value only if – and immediately after – a brightness change of a certain magnitude has been detected in the field of view of the respective pixel. This biology-inspired way of organizing the acquisition of visual information leads to an image sensor that does not use a shutter or frame clock to control the sampling of the image data, but in which each pixel autonomously defines the timing of its own sampling points in response to its visual input. If things change quickly, the pixel samples at a high rate; if nothing changes, the pixel stops acquiring redundant data and goes idle until things start to happen again in its field of view. In contrast to a conventional image sensor, the entire image acquisition process hence is not governed by a fixed external timing signal but by the visual signal to be sampled itself, leading to near ideal sampling of dynamic visual scenes.
The parvo exposure measurement circuit in each pixel encodes the instantaneous absolute pixel illuminance into the timing of asynchronous spike pulses, more precisely into interspike intervals comparable to a simple rate coding scheme (Figure 2.4). As a result, the ATIS pixel does not rely on external timing signals and autonomously requests access to an asynchronous and arbitrated AER output channel only when it has a new gray scale value to communicate. At the AER readout periphery, the pixel events are arbitrated, furnished with the pixel's array address by an address encoder and sent out on an asynchronous bit-parallel AER bus [15, 16].
Figure 2.4 (a) Functional diagram of an ATIS pixel. (b) Arbitrary light stimulus and pixel response: two types of asynchronous “spike” events, encoding temporal change and sustained gray scale information, are generated and transmitted individually by each pixel in the imaging array. (c) Change events coded black (OFF) and white (ON) (top) and gray-level measurements at the respective pixel positions triggered by the change events (bottom).
The temporal redundancy suppression of the pixel-individual data-driven sampling operation ideally yields lossless focal-plane video compression with compression factors depending only on scene dynamics. Theoretically approaching infinity for static scenes, in practice, due to change detector background activity, the achievable compression factor is limited and reaches 1000 for bright static scenes. Typical dynamic scenes yield compression ratios between 50 and several hundreds. Figure 2.5 shows a typical traffic surveillance scene generating a 25–200 k events/s at 18 bit/event continuous-time video stream. The temporal resolution of the pixelwise recorded gray-level updates in this scene is about 1 ms, yielding 1000 frames per s equivalent video data. In addition, the time-domain encoding of the gray-level information results in exceptionally high DR beyond 120 dB and improved signal-to-noise ratio (SNR) of typically greater 50 dB .
Figure 2.5 Instance of a traffic scene observed by an ATIS. (a) ON/OFF changes (shown white/black) leading to instantaneous pixel-individual sampling. (b) Associated gray scale data recorded by the currently active pixels displayed on black background – all black pixels do not sample/send information at that moment. (c) Same data with full background acquired earlier, showing a low-data-rate, high-temporal-resolution video stream. The average video compression factor is around 100 for this example scene; that is, only 1% of the data (with respect to a standard 30 frames per s image sensor) are acquired and transmitted, yet allowing for a near lossless recording of a video stream. The temporal resolution of the video data is about 1000 frames per s equivalent.
2.4 Conclusions and Outlook
Bioinspired vision technology is starting to outperform conventional, frame-based vision systems in many application fields and to establish new benchmarks in terms of redundancy suppression and data compression, DR, temporal resolution, and power efficiency. Demanding vision tasks such as real-time 3D mapping, complex multiobject tracking, or fast visual feedback loops for sensory–motor action, tasks that often pose severe, sometimes insurmountable, challenges to conventional artificial vision systems, are in reach using bioinspired vision sensing and processing techniques.
Fast sensorimotor action through visual feedback loops, based on the frame-free, event-driven style of biological vision, support, for example, autonomous robot navigation as well as micromanipulation and image-guided intervention in applications like scientific microscopy or robot-assisted surgery. Related developments are about to influence other fields such as human-machine systems involving, for example, air gesture recognition.
At the other end of the spectrum are biomedical applications like retina prosthetics and vision-assisted artificial limbs. For retinal implants, event-based and pulse-modulation vision chips that partially model human retina operation are naturally suitable to serve as the signal-generating front end. The sensors produce an output of pulse streams which can directly be used for evoking cell potentials. Furthermore, they can operate on very low voltages without degrading signal-to-noise ratio, which is an essential feature for implantable devices due to the need for low power dissipation, limiting heat generation, and extending battery lifetime. Finally, the intrinsic ultrahigh DR of this type of vision chips is very advantageous for the task of replacing biological photoreceptors. Recently, it has been shown that the computation of parallel filtering occurring in the mammalian retina can be reproduced based on data delivered by a neuromorphic sensor. With a simple linear–nonlinear model, it was possible to reconstruct the responses of the majority of ganglion cell types in the mammalian retina , such demonstrating the suitability of bioinspired vision sensor to serve as a transducer for retinal prosthetics.
The direct modeling of retinal (dys)functions and operational principles in integrated electronic circuits allows reproducing and studying retinal defects and diseases, potentially helping in devising novel ways of medical diagnosis and treatment. Research toward physical models of retinas and retinal defects in VLSI silicon, thus realizing artificial “patient's eyes,” could facilitate large-scale experimental studies of particular retinal defects by partly replacing laborious and costly in vivo and in vitro studies, such as supporting design and construction of medical diagnosis and treatment devices and systems.
Finally, time-based imagers like ATIS deliver high DR, high-quality imaging, and video for scientific applications like astronomical imaging, fluorescence imaging, cell monitoring, or X-ray crystallography.
1. 1. McCulloch, W. and Pitts, W. (1943) A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biol., 5, 115–133.
2. 2. Hebb, D. (1949) The Organization of Behavior, Wiley-VCH Verlag GmbH, New York.
3. 3. Hodgkin, A. and Huxley, A. (1952) A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117, 500Y544.
4. 4. Mead, C. (1989) Analog VLSI and Neural Systems, Addison-Wesley.
5. 5. Mead, C. (1990) Neuromorphic electronic systems. Proc. IEEE, 78(10), 1629–1636.
6. 6. Maher, M.A.C., Deweerth, S.P., Mahowald, M.A., and Mead, C.A. (1989) Implementing neural architectures using analog VLSI circuits. Trans. Circuits Syst., 36(5), 643–652.
7. 7. Boahen, K.A. (2005) Neuromorphic microchips. Sci. Am., 292, 56–63.
8. 8. Sarpeshkar, R. (2006) Brain power – borrowing from biology makes for low power computing – bionic ear. IEEE Spectr., 43(5), 24–29.
9. 9. Indiveri, G. (2007) Synaptic plasticity and spike-based computation in VLSI networks of integrate-and-fire neurons. Neural Inf. Process. - Lett. Rev., 11(4-6), 135–146.
10.10. Furber, S. and Temple, S. (2007) Neural systems engineering. J. R. Soc. Interface, 2007(4), 193–206.
11.11. Mead, C.A. and Mahowald, M.A. (1991) The silicon retina. Sci. Am., 264, 76–82.
12.12. Gollisch, T. (2009) Throwing a glance at the neural code: rapid information transmission in the visual system. HFSP J., 3(1), 36–46.
13.13. Masland, R. (2001) The fundamental plan of the retina. Nat. Neurosci., 4, 877–886.
14.14. Rodieck, R.W. (1998) The primate retina. Comp. Primate Biol., 4, 203–278.
15.15. Boahen, K. (2000) Point-to-point connectivity between neuromorphic chips using address events. IEEE Trans. Circuits Syst. II, 47(5), 416–434.
16.16. Boahen, K. (2004) A burst-mode word-serial address-event link-I: transmitter design. IEEE Trans. Circuits Syst. I, 51(7), 1269–1280.
17.17. Mahowald, M.A. (1992) VLSI analogs of neuronal visual processing: a synthesis of form and function. PhD Computation and Neural Systems, Caltech, Pasadena, CA.
18.18. Zaghloul, K.A. and Boahen, K. (2006) A silicon retina that reproduces signals in the optic nerve. J. Neural Eng., 3, 257–267.
19.19. Moini, A. (2000) Vision Chips, Kluwer Academic Publishers.
20.20. Posch, C., Matolin, D., and Wohlgenannt, R. (2011) A QVGA 143dB dynamic range frame-free PWM image sensor with lossless pixel-level video compression and time-domain CDS. J. Solid-State Circuits, 46(1), 259–275.
21.21. Van Der Heijden, A.H.C. (1992) Selective Attention in Vision, Routledge, New York. ISBN: 0415061059.
22.22. Belbachir, A.N. et al. (2007) Real-time vision using a smart sensor system. 2007 IEEE International Symposium on Industrial Electronics, June 4–7, 2007.
23.23. Perez-Carrasco, J.A. et al. (2010) Fast vision through frameless event-based sensing and convolutional processing: application to texture recognition. IEEE Trans. Neural Networks, 21(4), 609–620.
24.24. Serrano-Gotarredona, R. et al. (2009) CAVIAR: a 45k Neuron, 5M Synapse, 12G Connects/s AER hardware sensory–processing– learning–actuating system for high-speed visual object recognition and tracking. IEEE Trans. Neural Networks, 20(9), 1417–1438.
25.25. Carneiro, J., Ieng, S.H., Posch, C., and Benosman, R. (2013) Asynchronous event-based 3D reconstruction from neuromorphic retinas. Neural Networks, 45, 27–38.
26.26. Posch, C., Matolin, D., and Wohlgenannt, R. (2008) An asynchronous time-based image sensor. ISCAS 2008. IEEE International Symposium on Circuits and Systems, May 18–21 2008, pp. 2130–2133.
27.27. Lichtsteiner, P., Posch, C., and Delbruck, T. (2006) A 128 × 128 120dB 30mW asynchronous vision sensor that responds to relative intensity change. ISSCC, 2006, Digest of Technical Papers February 6–9, 2006, pp. 2060–2069.
28.28. Chen, D., Matolin, D., Bermak, A., and Posch, C. (2011) Pulse modulation imaging – review and performance analysis. IEEE Trans. Biomed. Circuits Syst., 5(1), 64–82.
29.29. Lorach, H. et al. (2012) Artificial retina: the multichannel processing of the mammalian retina achieved with a neuromorphic asynchronous light acquisition device. J. Neural Eng., 9, 066004.
1 In reality, this picture of a simple partition into sustained and transient pathways of course is too simple; there are many parallel pathways computing many views (probably at least 50 in the mammalian retina) of the visual input.