Talk:Deep Learning on Event-Based Cameras - Revision history

MarcoCannici at 16:48, 19 May 2017

2017-05-19T16:48:36Z

MarcoCannici at 16:32, 19 May 2017

2017-05-19T16:32:24Z

MarcoCannici: Created page with "{{Project template |title=Deep Learning on Event-Based Cameras |image= |description=This project aims to study deep learning techniques on event-based cameras and develop algo..."

2017-05-19T11:11:38Z

Created page with "{{Project template |title=Deep Learning on Event-Based Cameras |image= |description=This project aims to study deep learning techniques on event-based cameras and develop algo..."

New page

{{Project template
|title=Deep Learning on Event-Based Cameras
|image=
|description=This project aims to study deep learning techniques on event-based cameras and develop algorithms to perform object recognition on those devices.
|tutor=MatteoMatteucci
|start=April 2017
|cfu=20
}}

= Introduction =

[[File:EventCameraEventFlow.png|thumb|The image shows the stream of events generated by the sensor when looking at a rotating white dot.]]
Standard camera devices suffer from limitation in performances imposed by their principles of functioning. They acquire visual information by taking snapshots of the entire scene at a fixed rate. This introduces a lot of redundancy in data, in fact most of the time only a small part of the scene has changed from the previous frame, and limits the speed at which data can be sampled, potentially missing relevant information. Biologically inspired event-based cameras, instead, are driven by events happening inside the scene without any notion of a frame. Each pixel of the sensor emits, independently from the other pixels, an event (spike) every time it detects that something has changed inside its field of view (change of brightness or contrast). Each event is a tuple (x,y,p,t) describing the coordinates (x,y) of the pixel from which the event has been generated, the polarity of the event p (if the event refers to an increasing or decreasing in intensity) and the timestamp t of creation. The output of the sensor is therefore a continuous flow of events describing the scene, with a small delay in time with respect to the instant in which the real events happened. Systems that are able to process directly the stream of events can take advantage of the their low-latency and produce decisions as soon as enough relevant information has been collected. The low latency of event-based cameras, their small dimensions, the fact that they don’t require cooling, make this type of sensor suitable for a lot of applications including Robotics.

= State of the art =

== Spiking Neural Networks ==

=== Leaky Integrate-and-Fire Neuron ===

=== HMAX Architecture ===

== Recurrent Neural Networks ==

=== LSTM Networks ===

= Tools and Datasets =

@@ Line 1: / Line 1: @@
 {{Project template
 |title=Deep Learning on Event-Based Cameras
-|image=
+|image=EventCamera.jpg
 |description=This project aims to study deep learning techniques on event-based cameras and develop algorithms to perform object recognition on those devices.
 |tutor=MatteoMatteucci
@@ Line 8: / Line 8: @@
 }}
 = Introduction =
 [[File:EventCameraEventFlow.png|thumb|The image shows the stream of events generated by the sensor when looking at a rotating white dot.]]
-Standard camera devices suffer from limitation in performances imposed by their principles of functioning. They acquire visual information by taking snapshots of the entire scene at a fixed rate. This introduces a lot of redundancy in data, in fact most of the time only a small part of the scene changes from the previous frame, and limits the speed at which data can be sampled, potentially missing relevant information. Biologically inspired event-based cameras, instead, are driven by events happening inside the scene without any notion of a frame. Each pixel of the sensor emits, independently from the other pixels, an event (spike) every time it detects that something has changed inside its field of view (change of brightness or contrast). Each event is a tuple ''(x,y,p,t)'' describing the coordinates ''(x,y)''  of the pixel from which the event has been generated, the polarity of the event ''p'' (if the event refers to an increasing or decreasing in intensity) and the timestamp ''t'' of creation.  The output of the sensor is therefore a continuous flow of events describing the scene, with a small delay in time with respect to the instant in which the real events happened. Systems that are able to process directly the stream of events can take advantage of the their low-latency and produce decisions as soon as enough relevant information has been collected. The low latency of event-based cameras, their small dimensions, their power consumption, make this type of sensor suitable for a lot of applications including Robotics.
+Standard camera devices suffer from limitation in performances imposed by their principles of functioning. They acquire visual information by taking snapshots of the entire scene at a fixed rate. This introduces a lot of redundancy in data, in fact most of the time only a small part of the scene changes from the previous frame, and limits the speed at which data can be sampled, potentially missing relevant information. Biologically inspired event-based cameras, instead, are driven by events happening inside the scene without any notion of a frame. Each pixel of the sensor emits, independently from the other pixels, an event (spike) every time it detects that something has changed inside its field of view (change of brightness or contrast). Each event is a tuple ''(x,y,p,t)'' describing the coordinates ''(x,y)''  of the pixel from which the event has been generated, the polarity of the event ''p'' (if the event refers to an increasing or decreasing in intensity) and the timestamp ''t'' of creation.  The output of the sensor is therefore a continuous flow of events describing the scene, with a small delay in time with respect to the instant in which the real events happened. Systems that are able to process directly the stream of events can take advantage of the their low-latency and produce decisions as soon as enough relevant information has been collected. The low latency of event-based cameras, their small dimensions and their power consumption make this type of sensor suitable for a lot of applications including Robotics.
 = State of the art =
 In recent years, there has been a growing interest in event-based vision and dynamic vision sensors (DVS) due to their advantages and the particular type of data representation they provide. In particular, because of the spiking nature of the data, research has focused on their application with biologically inspired systems. An example is the case of Spiking Neural Networks, in which one of the goals is to mimic how the visual information is processed in visual cortex, that are well suited for this type of sensors because of their ability to learn from spiking stimuli. Good results have also been obtained with recurrent architectures, such as LSTM models, which are able to learn spatio-temporal structures from sequences of information.
 == Spiking Neural Networks ==
-A spiking neural networks (SNN) is a biologically inspired model which consider temporal information related to the incoming spikes. The basic model of a neuron of this kind is the Leaky-Integrate and Fire neuron (LIF) which can be represented as a state ''x<sub>j</sub>'' which can be modified based on the received stimuli. Every time a new spike arrives, the state ''x<sub>j</sub>'' is incremented or decremented based on the corresponding weight ''w<sub>j</sub>''. When the state of the neuron reaches one of the two (negative and positive) thresholds ''+/- x<sub>th</sub>'' the neuron generates an output spike and reset to its resting value ''x<sub>rest</sub>''. When the neuron fires it is deactivated for a certain amount of time, called ''refactory time'' in which it cannot generate outputs; this state can also be imposed by lateral connection with neighbor neurons.  The state is also affected by a constant leak that increments or decrements the neuron’s state toward its resting value. <br><br>
-The main drawback of this type of networks is the fact that the model is not easily differentiable, so backpropagation methods cannot be applied. One solution to overcome this issue is to train a frame-based model (with frames obtained by integrating events occurred in small temporal windows of some milliseconds) and then convert the obtained weights by means of ad-hoc rules. This approach has been used by Pérez-Carrasco et al. that proposed a method to convert a trained frame-based ConvNet into an event-based one. A similar approach has been also adopted by [[#References|O’Connor et al.]] by using a Deep Belief network for classification. A completely different approach is the one of [[#References|J. Lee at al.]] in which they used a differentiable approximation of the model on which backpropagation can be applied. Finally, learning on spiking neural networks can be also performed by using biologically inspired rules of updating synaptic weights that make explicit use of the timing of the spikes, like for instance the STDP (Spike-timing dependent plasticity) learning rule that updates the strength of each synapsis based on the delay between pre-synaptic and post-synaptic spikes. <br><br>
+A spiking neural networks (SNN) is a biologically inspired model that considers temporal information related to the incoming spikes. The basic model of a neuron of this kind is the Leaky-Integrate and Fire neuron (LIF) that can be represented as a state ''x<sub>j</sub>'' which can be modified based on the received stimuli. Every time a new spike arrives, the state ''x<sub>j</sub>'' is incremented or decremented based on the corresponding weight ''w<sub>j</sub>''. When the state of the neuron reaches one of the two (negative and positive) thresholds ''+/- x<sub>th</sub>'' the neuron generates an output spike and reset to its resting value ''x<sub>rest</sub>''. When the neuron fires it is deactivated for a certain amount of time, called ''refactory time'' in which it cannot generate outputs; this state can also be imposed by lateral connection with nearby neurons.  The state is also affected by a constant leak that increments or decrements the neuron’s state towards its resting value. <br><br>
 Most of the proposed solutions for the object recognition problem with spiking neural networks ([[#References|B. Zhao et al.]], [[#References|G O.rchard et al.]], [[#References|T. Masquelier et al.]]) make use of the HMAX hierarchical model, a biologically plausible model of the computation in the primary visual cortex , and Gabor filters, which are a good approximation of the responses of simple cells in cortex. The main differences of these models is the way in which the features from the S2 layer are learned during training.
 == Recurrent Neural Networks ==
-Another well-suited model to learn with this type of data are the recurrent neural network architectures, because of their ability to maintain an internal state and create temporal relations between sequences of inputs. In particular, a model that excels in the task of remembering values for either long or short durations of time is the Long Short-term Memory network (LSTM). Good results have been obtained by [[#References|D. Neil et al.]] in their work of Phased LSTM, a modification of the classical LSTM cell that can learn from sequences of inputs gathered at irregular time instants. The modification consists of the introduction of a time gate ''k<sub>t</sub>'' which regulates the inputs seen by the cell’s state c<sub>t</sub> and the output h<sub>t</sub>. The opening of this gate is regulated by an oscillation whose parameters (period, rate ''r<sub>on</sub>'' of the open phase with respect to the period, and the shift ''s'') are learned during training.
 = Tools and Datasets =
@@ Line 40: / Line 49: @@
 * Pérez-Carrasco,J.A.,Zhao,B.,Serrano,C.,Acha,B.,Serrano-Gotarredona,T., Chen,S.,et al.(2013). "Mapping from frame-driven to frame-free event-driven vision systems by low-rate coding and coincidence processing" [https://dx.doi.org/10.1109/TPAMI.2013.71]
-* Maximilian Riesenhuber and Tomaso Poggio "Hierarchical models of object recognition in cortex" [cbcl.mit.edu/publications/ps/nn99.pdf]
+* Maximilian Riesenhuber and Tomaso Poggio "Hierarchical models of object recognition in cortex" [https://www.ncbi.nlm.nih.gov/pubmed/10526343]
 * J.H Lee, T. Delbrück and M. Pfeiffer. "Training deep spiking neural networks using backpropagation" [http://journal.frontiersin.org/article/10.3389/fnins.2016.00508/full]