News 'n' Updates | The Subgradient2015-12-26T06:09:44+00:00http://ebetica.github.io//Jekyll v2.4.0Learning Errors2015-12-26T00:00:00+00:00http://ebetica.github.io//learning-errors<p>I finally had the chance to give the ImageNet winners a read-through. It’s
Microsofts <a href="http://arxiv.org/abs/1512.03385">Deep Residual Learning</a> technique that allowed them to learn
a convnet of 150 layers. Sure,
Google matched them in classification error but their localization error
blew everyone else out of the water. The idea behind their technique was that
each layer learns the <em>residuals</em> instead of an entire mapping function.
That is, each layer of the network adds on to the previous layer instead
of finding a complete mapping function. It’s kind of intuitively obvious
that this is a better way to learn deep networks than trying to do the entire
thing in one shot now that you think about it.
<!--preview--></p>
<p class="center"><img src="/images/nips-2015-review/reception.png" alt="Sorry Google!" />
<em>Here’s the same picture from the last post</em></p>
<p>Then you think about it a little harder and realize this trend might be going
somewhere. The paper that inspired them is the <a href="http://arxiv.org/abs/1507.06228">Highway Networks</a> paper, except
Highway Networks implement a more complex technique where you learn how much of
the previous layer to pass through. In Residual Learning, the entire previous
layer is used completely.</p>
<p>The other place where this worked wonderfully is
<a href="http://arxiv.org/abs/1506.05751">Deep Adversarial Network paper for generating natural images</a>.
It actually gives extremely realistic generated images, and works by starting
with a small image and iteratively enlargening it and generating the residuals.
The training data is simply the reverse process, where you iteratively make smaller
the size of the image and feed the network the information lost during scaling.</p>
<p>This is not a new technique by far, remember AdaBoost and gradient boosting?
Something something using a bunch of weak learners and working with only the
errors of the previous learner at each iteration?</p>
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<p>My long stretch observation is that if you squint your eyes hard enough at <a href="http://arxiv.org/abs/1502.03167">Batch Normalization</a>,
it kind of looks like it’s predicting the errors of a Taylor Expansion at each time step.
A Taylor polynomiala kind of looks like</p>
<script type="math/tex; mode=display">f(x) = a_0 + a_1 x + a_2 x^2 + \ldots</script>
<p>where you could linearize the residuals by doing</p>
<script type="math/tex; mode=display">\frac{f(x) - a_0}{x}</script>
<p>and if you squint really hard it kind of looks like what Batch Normalization is
doing. Maybe something more to look at would be to force a network structure that
actually does this explicitly.</p>
<p>Residual learning seems like a really good direction to pursue deep learning models;
I bet there will be a few more models that look into this topic.</p>
NIPS 2015: Through the Looking Glass2015-12-23T00:00:00+00:00http://ebetica.github.io//nips-2015-review<p>NIPS being the first conference I’ve been to was somewhat of a information
overload. That said, it was probably one of the most educational weeks I’ve
had, catching up on the state of the art in deep learning, admiring how pretty
Bayesian techniques are and simultaneously how they’re never used for one reason
or another. Montreal probably had the best weather ever for this time of the year
– it was so warm that I passed with a light jacket and flip flops during the
middle of the conference. But enough said, let’s talk about NIPS 2015.</p>
<p>Some warning: My expertise is mostly in deep learning. Some of the impressions I got
of other subjects may be wildly incorrect.</p>
<h2 id="tutorials">Tutorials</h2>
<p>Overall the tutorials were well advertised, just an introduction to topics at hand.
Jeff Dean and Oriol Vinyals talked about Tensorflow. Tensorflow pretty much
follows in the footsteps of Theano, creating computational graphs that you can
compile and run on the GPU. The upside is that it has millions of dollars thrown
at it, so it’s almost definitely going to be better supported. Though a couple people
tested Tensorflow and found it slow at first, it’s definitely getting faster
and better. At the same time, Jeff announced the state-of-the-art on ImageNet
with a different inception architecture, dubbed ReCeption, halving the error
from last year. It was a big deal until ImageNet announced results a couple
days later and Microsoft beat them in both classification and localization error.</p>
<p class="center"><img src="/images/nips-2015-review/reception.png" alt="Sorry Google!" />
<em>Sorry Google!</em></p>
<!--preview-->
<p>The other two tutorials were very good introductions, and set the stage for the
rest of the conference. Bayesian methods seem to have discovered their backprop
algorithm in variational inference and MCMC methods. My take away is that Bayesian
scientists think of a model, then throw variational inference and MCMC at the model
until they get results. You get uncertainties, something most other learning algorithms
can’t handle, but none of it scales very well.</p>
<p>Reinforcement learning is all the rage after DeepMinds <a href="https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf">deep Q-learning paper</a>
came out, and there’s a bunch of work on learning Q functions and getting robots
to hang coat hangers and pull out nails. The interesting thing is that there seem
to be very few papers <em>at</em> NIPS dedicated to RL. However, the tutorial and workshop
on Deep RL were both presented very good work. The work seen at NIPS were
all conglomerations of research papers and not any particular one.</p>
<p>On to the papers!</p>
<h2 id="deep-learning">Deep Learning</h2>
<p>The big takeaway is that Deep Learning models are getting more and more complex.
Instead of working on the layer level, we’re already reaching the module level
where we pick out modules that do certain things and stick them together to test.
In a way, it more resembles a weird combination between software design and machine
learning. Still, tons of work into interesting modules and novel applications.</p>
<h3 id="tools-of-the-trade">Tools of the trade</h3>
<p>Basically a bunch of techniques that your deep learning model should probably be
using if the problem is at all complicated. Batch Normalization reduces the training
speed by an order of magnitude on ImageNet, and is an easy drop-in for any model.
Spatial Transformer Networks look <em>really</em> cool, where you regress a transformation
before running it through the actual network. This looks very useful as a targetting
mechanism for a specific section of an image and gets good results on your favorite
handwritten digits dataset.</p>
<p><a href="http://arxiv.org/abs/1506.02025"><strong>Spatial Transformer Networks</strong></a></p>
<p><a href="http://arxiv.org/abs/1502.03167"><strong>Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</strong></a></p>
<p>Attention based models partially solves the problem of memory. The idea is that
sequences get an attention module that looks at parts of an image or text. Soft
attention is generally the way to go where the final result from the looking at
the image/text is a weighted sum where the weights are assigned by the attention
module. Attention has become one of those typical modules that you have to think
about when designing a Deep Learning system.</p>
<p><a href="http://arxiv.org/abs/1506.07503"><strong>Attention-Based Models for Speech Recognition</strong></a></p>
<p><a href="http://arxiv.org/abs/1509.06812"><strong>Learning Wake-Sleep Recurrent Attention Models</strong></a></p>
<h3 id="memory">Memory</h3>
<p>Memory was one of the hottest topics NIPS 2015, with an entire overflowing (I was standing for
most of it) workshop dedicated to it. Despite the major work, there’s still a lot more to be done.
Most of the models were well received still lack a useful write-unit. I’m looking at you, Memory Networks.
When problems become too large for memory, there’s no way for the network to learn to do
memory management.</p>
<p>The two stack based networks only allow one computation per time-step, which means
they do not really simulate a turing complete memory. Turing machines that use two stacks only
work because you can “turn” the second stack upside down on the first and imagine that
the point where they meet is the head of the Turing machine. However, without
multiple push-pops on every input, the Turing machine is unable to seek, making anything more
complex than counting impossible.</p>
<p>The remaining class of algorithms seem to have all required attributes, but are slow and complex.
Experiments for both Neural Turing Machines and Dynamic Memory Networks are restricted
to toy problems. I suspect that they will be too slow and hard to train on real scale problems
in their current form.</p>
<p><a href="http://arxiv.org/abs/1503.08895"><strong>End-To-End Memory Networks</strong></a></p>
<p><a href="http://arxiv.org/abs/1506.02516"><strong>Learning to Transduce with Unbounded Memory</strong></a></p>
<p><a href="http://arxiv.org/abs/1503.01007"><strong>Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets</strong></a></p>
<p><a href="http://arxiv.org/abs/1410.5401"><strong>Neural Turing Machines</strong></a></p>
<p><a href="http://arxiv.org/abs/1506.07285"><strong>Dynamic Memory Networks for Natural Language Processing</strong></a></p>
<h3 id="applications">Applications</h3>
<p>Computer Vision has turned into applied convnets and most speech related problems
have turned into applied LSTMs (with attention). That said, there’s still lots
of interesting work like generating images almost completely from scratch and
teaching a computer to read. What I also took away were some really big QA datasets
for testing algorithms, especially memory based ones that we’re definitely going to see
in the upcoming year.</p>
<ul>
<li>Baidu <a href="http://idl.baidu.com/FM-IQA.html">released</a> an image question-answering dataset
with over 300k image-question-answer triplets.</li>
<li>DeepMind <a href="https://github.com/deepmind/rc-data">released</a> a QA dataset based off of
a paragraph of context, where the entities are all scrambled up so you can’t just
look on Wikipedia.</li>
<li>Facebook’s <a href="https://research.facebook.com/researchers/1543934539189348">QA dataset</a> has been here
forever but I still think it’s a good one to mention.</li>
</ul>
<p><a href="http://arxiv.org/abs/1506.05751"><strong>Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks</strong></a></p>
<p><a href="http://www-personal.umich.edu/~reedscot/nips2015.pdf"><strong>Deep Visual Analogy-Making</strong></a></p>
<p><a href="http://arxiv.org/pdf/1506.03340"><strong>Teaching Machines to Read and Comprehend</strong></a></p>
<p><a href="http://arxiv.org/pdf/1505.05612v3.pdf"><strong>Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering</strong></a></p>
<h3 id="deeper-and-deeper">Deeper and Deeper</h3>
<p>2 papers have stood out in training deeper and deeper networks. HF learning for RNNs trains
a 15 layer network, whereas Highway Nets get to over a hundred. The Highway Network even
learns to not use later layers, meaning it almost automatically gives some capacity control</p>
<p><a href="http://arxiv.org/abs/1509.03475"><strong>Hessian-free Optimization for Learning Deep Multidimensional Recurrent Neural Networks</strong></a></p>
<p><a href="http://arxiv.org/abs/1507.06228"><strong>Training Very Deep Networks (Highway Networks)</strong></a></p>
<h3 id="misc">Misc</h3>
<p>Spectral pooling is an amazing idea, lowpass filtering the images instead of doing maxpooling.
Even better when you consider that convolutions in frequency space is just multiplications, dropping
a log N term from the complexity of our networks. However, the authors here don’t actually do
the whole thing in frequency space, they focus on how the parameterization helps gradient descent
be smoother. It’ll be interesting to see an entire network in frequency space, and maybe we go a step
further and do it in some kind of wavelet space.</p>
<p><a href="http://arxiv.org/abs/1506.03767"><strong>Spectral Representations for Convolutional Neural Networks</strong></a></p>
<p>Super interesting “application” of a neural net by encoding discrete optimization problems and making
the network learn to solve it. It works by a simple modification of the attention based model. I’m
interested in seeing where further work in approximating NP-hard optimization problems with
Deep Learning goes…</p>
<p><a href="http://arxiv.org/abs/1506.03134"><strong>Pointer Networks</strong></a></p>
<p>Machine learning scientists are <em>already</em> trying to run themselves out of a job, we haven’t
even invented strong AI yet!</p>
<p><a href="http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf"><strong>Efficient and Robust Automated Machine Learning</strong></a></p>
<p>Ladder Networks are a very elegant way to do semi-supervised learning when you don’t
have a lot of labels for your data. They just learn an autoencoder along with the classifier,
where most of the layers are shared.</p>
<p><a href="http://arxiv.org/abs/1507.02672"><strong>Semi-Supervised Learning with Ladder Networks</strong></a></p>
<p>Binary weights get pretty close to the state-of-the-art compared to real numbers?
It’s very surprising how little precision you need to get amaing results.</p>
<p><a href="http://arxiv.org/abs/1511.00363"><strong>BinaryConnect: Training Deep Neural Networks with binary weights during propagations</strong></a></p>
<p>Learning lanugages models efficiently, because nobody has time for you to backprop through
the 20k elements in your vocabulary.</p>
<p><a href="http://arxiv.org/abs/1412.7091"><strong>Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets</strong></a></p>
<h2 id="reinforcement-learning-and-control">Reinforcement Learning and Control</h2>
<p>I didn’t put a lot of references because most of the deep reinforcement learning topics were actually published
in previous conferences. However, RL is very much the vogue based on the tutorials and workshops
that gave us great introductions to this topic. All of Deep RL is basically sprung from the
DeepMind Atari paper. The policy learning could very well be another angle to tackle the long
term dependency and memory problem that we have right now. It doesn’t hurt that reinforcement learning
sessions always <a href="https://sites.google.com/site/visuomotorpolicy/">shows</a> you some
<a href="https://www.youtube.com/watch?v=NeFkrwagYfc">very</a> interesting <a href="https://www.youtube.com/watch?v=IxrnT0JOs4o">videos</a>.</p>
<p><a href="http://www.eecs.berkeley.edu/~igor.mordatch/policy/paper.pdf"><strong>Interactive Control of Diverse Complex Characters with Neural Networks</strong></a></p>
<p><a href="http://arxiv.org/abs/1507.08750"><strong>Action-Conditional Video Prediction using Deep Networks in Atari Games</strong></a></p>
<p><a href="http://arxiv.org/abs/1507.01273"><strong>Learning Deep Neural Network Policies with Continuous Memory States</strong></a></p>
<h2 id="bayesian">Bayesian</h2>
<p>Lots of the papers in this topic are papers that I’d like to know more about but don’t. Again, it
looks like a lot of Bayesian learning is tending towards the *throw variational inference and MCMC*
at problems type paper, so you’ll undoubtably see a lot of that. However, the issues with this
is that Variantion Inference and MCMC techniques are notoriously slow, and so there’s another bundle of papers
that discuss how to make these sampling based approaches faster. It seems to me that
Gaussian Processes are very good at solving small scale problems with not many data points right now, but neural
networks still scale harder. I’m hoping to see a lot more study into GPs in the future,
either in Bayesian Optimization or somehow else integrated into deep networks.</p>
<p><a href="http://arxiv.org/abs/1509.02866"><strong>Fast Second Order Stochastic Backpropagation for Variational Inference</strong></a></p>
<p><a href="http://papers.nips.cc/paper/5665-scalable-inference-for-gaussian-process-models-with-black-box-likelihoods.pdf"><strong>Scalable Inference for Gaussian Process Models with Black-Box Likelihoods</strong></a></p>
<p><a href="http://papers.nips.cc/paper/5772-learning-stationary-time-series-using-gaussian-processes-with-nonparametric-kernels.pdf"><strong>Learning Stationary Time Series using Gaussian Processes with Nonparametric Kernels</strong></a></p>
<p><a href="http://arxiv.org/abs/1511.00054"><strong>Gaussian Process Random Fields</strong></a></p>
<p>There was also an entire session that I missed on probabilistic programming. Again, tons of using
Variational Inference to approximate your posterior and throwing it at problems until they stop being problems.
It will be interesting to see if it can be sped up even more.</p>
<p><a href="http://arxiv.org/abs/1506.03431"><strong>Automatic Variational Inference in Stan</strong></a></p>
<h2 id="misc-1">Misc</h2>
<p>Another GP based technique that extends line searches to stochastic optimization. Looks like
another very good drop-in replacement optimizer for our backpropagating needs.</p>
<p><a href="http://arxiv.org/abs/1502.02846"><strong>Probabilistic Line Searches for Stochastic Optimization</strong></a></p>
<p>There was a lot of analysis on asyncronous SGD type algorithms, a big reason is that all the big
companies are requiring a couple thousand machines to perform optimization on huge models. Most
of them just established bounds and guarantees. The takeaway message is that in the long run, async
SGD is no different from normal SGD.</p>
<p><a href="http://arxiv.org/abs/1506.08272"><strong>Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization</strong></a></p>
<p><a href="http://arxiv.org/abs/1506.06438"><strong>Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms</strong></a></p>
<p><a href="http://papers.nips.cc/paper/6031-asynchronous-stochastic-convex-optimization-the-noise-is-in-the-noise-and-sgd-dont-care.pdf"><strong>Asynchronous stochastic convex optimization: the noise is in the noise and SGD don’t care</strong></a></p>
<h2 id="conclusion">Conclusion</h2>
<p>Overall the conference was a huge information overload. The takeaways for me are that:</p>
<ol>
<li>Memory is the next big thing we’re gonna have to invent. Some kind of framework that
makes storing information and training models to read/write this information
easy will be essential next steps.</li>
<li>Reinforcement learning is pretty hip right now. I expect to see it on more and more
robots in the next couple years. If the DARPA robotics challenge were held two years
from now we would see far fewer falling robots.</li>
<li>Computer Vision and NLP have been unified under Applied Deep Learning.
Robotics and planning AIs are slowly heading towards this as well.</li>
<li>Gaussian processes are really cool. You literally decide on a single parameter (the covariance
function) and tell your model to go learn things.</li>
</ol>