Understanding XLNet and its implications for NLP

Miguel González-Fierro

Nov. 19, 2019

Last year BERT revolutionized NLP and since then there have appeared a large number of improvements over the original implementation: MT-DNN, RoBERTa, AlBERTa. The main feature of these models is their autoencoding nature. On the other hand, a group of autorregressive methods have been proposed like Transformer-XL, GPT-2 or XLNet. In this short post we want to give a short overview of XLNet and its nature and how it compares with BERT.

nlp; xlnet; bert

Pretrained methods have revolutionized NLP, surpassing in some cases the human-level accuracy in language understanding, as can be seen in the GLUE benchmark. These methods typically train large neural networks on massive text corpora of unlabeled data, this process is called pretraining, and then finetune the model on downstream tasks such as text classification, entailment, q&a, named entity recognition, etc.

BERT is clearly one of the most exciting advances in NLP. BERT belongs to the family of denoising autoencoders (AE). It replaces a small percentage of the input tokens, a process called masking, with symbols like [MASK]. During pretraining, the model tries to reconstruct the original tokens from the corrupted data. Intrinsically, BERT is performing data augmentation, which is why LeCun encloses this method under self-supervised learning. Another important feature of BERT is that it learns a bidirectional context, due to its autoencoding nature, which produces a higher performance.

The other group of successful pretrained methods, like GPT-2, Transformer-XL or XLNet are autoregressive (AR). They use deep neural networks to model the conditional probability distribution of the output given a text sequence.

A disadvantage of AR is that they are trained in a unidirectional context (either forward or backward), therefore they don't model well bidirectional relationships. On the other hand, AE methods like BERT suffers from a pretrain-finetune discrepancy, because the masked tokens are not present during finetuning. This effectively assumes that the predicted tokens are independent of each other, oversimplifying the long-term relationships that normally appears in the text.

XLNet seeks to combine the advantages of AR and AE models, introducing the concept of permutation language modeling. The model maximizes the expected log-likelihood of the sequence w.r.t all possible permutations of the autoregressive components. For a sequence of length T, there are T! orders for performing an autoregressive factorization. Since the model parameters are shared over all factorization orders, the model will learn from both sides, capturing a bidirectional context similar to AE models.

In this jupyter notebook, an example of text classification with XLNet is shown.

10 portfolio projects to get you an AI job

Understanding XLNet and its implications for NLP

nlp; xlnet; bert