Pretrained methods have revolutionized NLP, surpassing in some cases the human-level accuracy in language understanding, as can be seen in the GLUE benchmark. These methods typically train large neural networks on massive text corpora of unlabeled data, this process is called pretraining, and then finetune the model on downstream tasks such as text classification, entailment, q&a, named entity recognition, etc.
BERT is clearly one of the most exciting advances in NLP. BERT belongs to the family of denoising autoencoders (AE). It replaces a small percentage of the input tokens, a process called masking, with symbols like [MASK]. During pretraining, the model tries to reconstruct the original tokens from the corrupted data. Intrinsically, BERT is performing data augmentation, which is why LeCun encloses this method under self-supervised learning. Another important feature of BERT is that it learns a bidirectional context, due to its autoencoding nature, which produces a higher performance.
The other group of successful pretrained methods, like GPT-2, Transformer-XL or XLNet are autoregressive (AR). They use deep neural networks to model the conditional probability distribution of the output given a text sequence.
A disadvantage of AR is that they are trained in a unidirectional context (either forward or backward), therefore they don't model well bidirectional relationships. On the other hand, AE methods like BERT suffers from a pretrain-finetune discrepancy, because the masked tokens are not present during finetuning. This effectively assumes that the predicted tokens are independent of each other, oversimplifying the long-term relationships that normally appears in the text.
XLNet seeks to combine the advantages of AR and AE models, introducing the concept of permutation language modeling. The model maximizes the expected log-likelihood of the sequence w.r.t all possible permutations of the autoregressive components. For a sequence of length T, there are T! orders for performing an autoregressive factorization. Since the model parameters are shared over all factorization orders, the model will learn from both sides, capturing a bidirectional context similar to AE models.
In this jupyter notebook, an example of text classification with XLNet is shown.