Andreas Argyriou, Miguel González-Fierro

Dec. 22, 2016

This year, the Thirtieth Annual Conference on Neural Information Processing Systems (NIPS) took place in Barcelona, Spain. More than 6000 people attended, which was a record for NIPS.

The reason for this success and visibility could be related to the impressive results that deep learning has achieved in the last decade. Another reason could be the growing interest from the industry, especially big technology companies such as Microsoft, Google or Facebook, that are actively contributing to machine learning research. In a conference of such magnitude, it is really easy to miss some interesting talks, but a great feature of NIPS is that you can access in advance the papers in the pre-proceedings. There is a thread in Reddit with a __compilation__ of some papers and code.

Among the talks, maybe the most discussed one in social networks was LeCun’s keynote. He presented The Cake. The cake represents the current knowledge of artificial intelligence (AI). In LeCun’s words: “if intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but not the cake”. So that's the major obstacle, at least that we currently know of, to achieving strong AI.

Maybe one of the most impressive demos in the conference was the presentation of SpotMini from Boston Dynamics. They showed how the robot can walk, run, jump and grasp an object. Here is the video:

__RocketAI__ was the Easter Egg of the conference. Sandberg of Oxford University presented to the press the patent-pending “Temporally Recurrent Optimal Learning” (TROL) algorithm, while Goodfellow from OpenAI __tweeted__ that the company had the best “Jacobian-Optimized Kernel Expansion” (JOKE). Many people believed the __joke__. It looks that they received emails from __5 VC funds__ showing interest in investing in the fictitious company.

One way forward is being able to model the world, and for that Generative Adversarial Networks (GANs) are becoming more and more popular. GANs, proposed in 2014 by __Goodfellow et al__, are neural networks that allow for generating data via adversarial training. GANs therefore should internally learn a good model representation of what an image or a sentence is. On the first day of the conference, he gave a nice tutorial explaining in detail GANs.

Regarding new ideas about GANs that appeared in NIPS, there was a workshop on __adversarial training__ and many papers using this methodology. It is interesting how researchers from OpenAI implemented an algorithm for Imitation Learning (__paper__ and __code__). Other examples were multi-class classifications (__paper__ and __code__) and text to image synthesis (__paper__). Xi Chen presented __InfoGAN__, which is a new type of generative adversarial network (GAN), based on mutual information. In some experiments it generates more structured and interpretable image representations; for example, latent variables capture characteristics similar to pose, lighting, rotation, width etc. Soumith Chintala from Facebook Research talked about various practical tricks and heuristics for training GANs, e.g. stability issues, avoiding sparse gradients (such as ReLU), noisy labels etc. Sebastian Nowozin from Microsoft Research talked about recent work on generalizing GANs using f-divergences. He also explains this in his blog __post__. This generalization allows for replacing the Jensen-Shannon divergence (used in the original GAN) with anyone from the family of f-divergences. He also showed a connection with convex duality and proposed a new gradient-based algorithm.

Another especially interesting area is Deep Reinforcement Learning (DRL), a field in which __Pieter Abbeel__ and __Richard Sutton__ have made important contributions. Abbeel gave a very comprehensive tutorial about policy optimization in DRL.

Richard Sutton __presented__ "Learning representations by stochastic gradient descent in cross-validation error”. He proposed to optimize the cross-validation error instead of the training error by replacing backpropagation with a new algorithm called *crossprop*, within an online learning setting. Some experiments indicate that *crossprop *may have advantages in non-stationary settings.

Another interesting contribution came from Killian Weinberger, who presented a new training method called Stochastic Depth (__paper__ and __code__) for convolutional networks. The idea is similar to dropout, but instead of dropping connections, they drop a random number of layers in each iteration. This trick reduces the training time substantially for networks with many layers. It also avoids overfitting in some cases by reducing overparameterization. For example, they obtained an improved test error on CIFAR-10 with 1200 layers. In the ImageNet dataset, they achieved a similar error rate on the same architecture when comparing stochastic depth with a constant depth, but in a shorter time.

There were also interesting ideas in recurrent neural nets, like the __Phased LSTM__. The authors added a new time gate to traditional LSTMs. With it, the computation time is reduced by an order of magnitude and the network is suitable for inputs with different sampling rates.

Gartner sees Chatbots as one of the most important technologies for the next years. They predict that by 2020, customers will manage 85% of their relationships with the enterprise without interacting with a human. Microsoft's CEO Satya Nadella is positioning the company to prevail in a future filled with bots. “As an industry, we are on the cusp of a new frontier that pairs the power of natural human language with advanced machine intelligence”. At NIPS, there was an interesting workshop about chatbots.

Helen Hastie talked about __social chatbots__, which are finding more and more applications in business scenarios, as assistants, customer representatives, etc. There are several challenges in building good bots, an important one being how to evaluate their quality. There are extrinsic and intrinsic measures; for example, fulfilment of a task or a goal is extrinsic, fluency is intrinsic. Some can be assessed automatically, others by experts, __Turkers__ (not very trustworthy) or customers. Qualities that need to be measured may include engagement, flow, personalization, turn level, dialogue level, system-level (continuation, trust, "personality"). Another important point is the need to communicate a clear mental model to the user at the start, about how to interact with the bot.

Julien Perez presented an __end-to-end dialog system__ for text understanding and reasoning. He proposed a network architecture (gated memory network) that uses elements from highway networks and also related to residual networks. This network performs well on the Facebook bAbI tasks.

Iulian Serban presented work on __generative DNNs__ for dialogue systems that require state, policy and action. An example application is the Ubuntu task of helping users resolve technical problems. He described three generative models created for this problem: hierarchical recurrent encoder-decoder, multiresolution RNNs, latent variable recurrent encoder-decoders. Of these, the multiresolution RNNs performed better on the Ubuntu data.

Recently, there has been an interest in the theoretical study of properties of DNNs (such as the oral presentation "Deep Learning without Poor Local Minima", or the AISTATS paper connecting DNNs to spin glasses). Along these lines, in this __workshop__, Rene Vidal presented theoretical results applicable to some frameworks for __matrix/tensor factorization__ or deep learning (networks constructed from parallel subnetworks, with ReLU activation functions and max pooling). The implication is that if the number of subnetworks is large enough then local descent will reach a global minimizer. This motivates a “meta-algorithm” that performs local descent and keeps adding subnetworks until the criterion is satisfied.

Guillaume Obozinski presented a broad view of many __regularization methods__ that are used to learn structured and sparse models. This view is based on a general form for combinatorial problems and the corresponding convex relaxations.

This __workshop__ also included theoretical contributions, some of them relevant to deep learning. Amnon Shashua presented recent work that uses tensor decompositions to study the properties of certain types of deep networks. In particular, he and collaborators have shown that convolutional networks with ReLU are universal with max-pooling but not with average pooling; that the depth efficiency is better with linear than rectified activation function; and why contiguous pooling may be appropriate for learning natural images.

In another talk, Lek-Heng Lim introduced a new concept, related to tensor network states, which are tensors associated with graphs (used in condensed matter physics). The tensor network rank or G-rank is a generalization of the notion of rank of matrices. One advantage is that, unlike the standard rank of a tensor, this new type of rank is computable in polynomial time for acyclic graphs. This concept could be used in current applications of tensor approximations in machine learning.

blog comments powered by Disqus