This year, the Thirtieth Annual Conference on Neural Information Processing Systems (NIPS) took place in Barcelona, Spain. More than 6000 people attended, which was a record for NIPS.
The reason for this success and visibility could be related to the impressive results that deep learning has achieved in the last decade. Another reason could be the growing interest from the industry, especially big technological companies such as Microsoft, Google or Facebook, that are actively contributing to machine learning research. In a conference of such magnitude it is really easy to miss some interesting talks, but a great feature of NIPS is that you can access in advance the papers in the pre-proceedings. There is a thread in Reddit with a compilation of some papers and code.
Among the talks, maybe the most discussed one in social networks was LeCun’s keynote. He presented The Cake. The cake represents the current knowledge in artificial intelligence (AI). In LeCun’s words: “if intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but not the cake”. So that's the major obstacle, at least that we currently know of, to achieving strong AI.
Maybe one of the most impressive demos in the conference was the presentation of SpotMini from Boston Dynamics. They showed how the robot can walk, run, jump and grasp an object. Here is the video:
RocketAI was the Easter Egg of the conference. Sandberg of Oxford University presented to the press the patent-pending “Temporally Recurrent Optimal Learning” (TROL) algorithm, while Goodfellow from OpenAI tweeted that the company had the best “Jacobian-Optimized Kernel Expansion” (JOKE). Many people believed the joke. It looks that they received emails from 5 VC funds showing interest in investing in the fictitious company.
One way forward is being able to model the world, and for that Generative Adversarial Networks (GANs) are becoming more and more popular. GANs, proposed in 2014 by Goodfellow et al, are neural networks that allow for generating data via adversarial training. GANs therefore should internally learn a good model representation of what an image or a sentence is. On the first day of the conference, he gave a nice tutorial explaining in detail GANs.
Regarding new ideas about GANs that appeared in NIPS, there was a workshop on adversarial training and many papers using this methodology. It is interesting how researchers from OpenAI implemented an algorithm for Imitation Learning (paper and code). Other examples were multi-class classifications (paper and code) and text to image synthesis (paper). Xi Chen presented InfoGAN, which is a new type of generative adversarial network (GAN), based on mutual information. In some experiments it generates more structured and interpretable image representations; for example, latent variables capture characteristics similar to pose, lighting, rotation, width etc. Soumith Chintala from Facebook Research talked about various practical tricks and heuristics for training GANs, e.g. stability issues, avoiding sparse gradients (such as ReLU), noisy labels etc. Sebastian Nowozin from Microsoft Research talked about recent work on generalizing GANs using f-divergences. He also explains this in his blog post. This generalization allows for replacing the Jensen-Shannon divergence (used in the original GAN) with any one from the family of f-divergences. He also showed a connection with convex duality and proposed a new gradient-based algorithm.
Another especially interesting area is Deep Reinforcement Learning (DRL), a field in which Pieter Abbeel and Richard Sutton have made important contributions. Abbeel gave a very comprehensive tutorial about policy optimization in DRL.
Richard Sutton presented "Learning representations by stochastic gradient descent in cross-validation error”. He proposed to optimize the cross validation error instead of the training error by replacing back propagation with a new algorithm called crossprop, within an online learning setting. Some experiments indicate that crossprop may have advantages in non-stationary settings.
Another interesting contribution came from Killian Weinberger, who presented a new training method called Stochastic Depth (paper and code) for convolutional networks. The idea is similar to dropout, but instead of dropping connections, they drop a random number of layers in each iteration. This trick reduces the training time substantially for networks with many layers. It also avoids overfitting in some cases by reducing overparameterization. For example, they obtained an improved test error on CIFAR-10 with 1200 layers. In the ImageNet dataset, they achieved a similar error rate on the same architecture when comparing stochastic depth with constant depth, but in a shorter time.
There were also interesting ideas in recurrent neural nets, like the Phased LSTM. The authors added a new time gate to traditional LSTMs. With it, the computation time is reduced by an order of magnitude and the network is siutable for inputs with different sampling rates.
The popularity of chatbots has grown in the past years, as it can be seen in the following plot obtained from Google Trends (100 means maximum popularity).
Gartner sees Chatbots as one of the most important technologies for the next years. They predict that by 2020, customers will manage 85% of their relationships with the enterprise without interacting with a human. Microsoft's CEO Satya Nadella is positioning the company to prevail in a future filled with bots. “As an industry, we are on the cusp of a new frontier that pairs the power of natural human language with advanced machine intelligence”. At NIPS, there was an interesting workshop about chatbots.
Helen Hastie talked about social chatbots, which are finding more and more applications in business scenarios, as assistants, customer representatives, etc. There are several challenges in building good bots, an important one being how to evaluate their quality. There are extrinsic and intrinsic measures; for example, fulfillment of a task or a goal is extrinsic, fluency is intrinsic. Some can be assessed automatically, others by experts, Turkers (not very trustworthy) or customers. Qualities that need to be measured may include engagement, flow, personalization, turn level, dialogue level, system level (continuation, trust, "personality"). Another important point is the need to communicate a clear mental model to the user at the start, about how to interact with the bot.
Julien Perez presented an end-to-end dialog system for text understanding and reasoning. He proposed a network architecture (gated memory network) that uses elements from highway networks and also related to residual networks. This network performs well on the Facebook bAbI tasks.
Iulian Serban presented work on generative DNNs for dialog systems that require state, policy and action. An example application is the Ubuntu task of helping users resolve technical problems. He described three generative models created for this problem: hierarchical recurrent encoder-decoder, multiresolution RNNs, latent variable recurrent encoder-decoders. Of these, the multiresolution RNNs performed better on the Ubuntu data.
Recently, there has been interest in the theoretical study of properties of DNNs (such as the oral presentation "Deep Learning without Poor Local Minima", or the AISTATS paper connecting DNNs to spin glasses). Along these lines, in this workshop, Rene Vidal presented theoretical results applicable to some frameworks for matrix/tensor factorization or deep learning (networks constructed from parallel subnetworks, with ReLU activation functions and max pooling). The implication is that if the number of subnetworks is large enough then local descent will reach a global minimizer. This motivates a “meta-algorithm” that performs local descent and keeps adding subnetworks until the criterion is satisfied.
Guillaume Obozinski presented a broad view of many regularization methods that are used to learn structured and sparse models. This view is based on a general form for combinatorial problems, and the corresponding convex relaxations.
This workshop also included theoretical contributions, some of them relevant to deep learning. Amnon Shashua presented recent work that uses tensor decompositions to study the properties of certain types of deep networks. In particular, he and collaborators have shown that convolutional networks with ReLU are universal with max pooling but not with average pooling; that the depth efficiency is better with linear than rectified activation function; and why contiguous pooling may be appropriate for learning natural images.
In another talk, Lek-Heng Lim introduced a new concept, related to tensor network states, which are tensors associated to graphs (used in condensed matter physics). The tensor network rank, or G-rank is a generalization of the notion of rank of matrices. One advantage is that, unlike the standard rank of a tensor, this new type of rank is computable in polynomial time for acyclic graphs. This concept could be used in current applications of tensor approximations in machine learning.