I always wanted to join the club of artistic titles such as Learning to Learn by Gradient Descent by Gradient Descent, Attention Is All You Need, You May Not Need Attention (a response to the previous article), CNN Is All You Need (not written by LeCun) or LSTM is All You Need (not written by Schmidhuber and Schmidhuber). Thus this post's title.
In 2009, Alon Halevy, Peter Norvig, and Fernando Pereira from Google published a paper entitled The Unreasonable Effectiveness of Data, addressing the impact on NLP that having large amounts of data produced.
Later in 2017, a team at Google published a revisit of the original paper entitled Revisiting the Unreasonable Effectiveness of Data, where they address the effect of data in deep learning. In their paper, they validate the following hypothesis:
(1) Large scale data helps in representation learning. (2) Performance increases logarithmically based on volume of training data. (3) Capacity is crucial.
Point (1) is in line with the two latest SOTA for image classification on ImageNet dataset. In 2018, a transfer learning approach was used with billions of tagged images from Instagram. Recently this year, researchers used a billion images and a smart trick in the image resolution to beat the SOTA again.
Point (2) is in line with what can be seen in mostly any SOTA benchmark. Even though I wonder whether the reason is that the metrics we are using are constrained between 0 and 1.
Point (3) can also be observed in the network development in recent years.
In my opinion there is an important point missing by the authors: (4) Optimization tricks help.
Over the years researchers have proposed different methods like dropout, residual networks or the attention mechanism. All these solutions are tricks to try to optimize better the network and reduce overfitting.
The two latest SOTA in COCO illustrate the last point. In 2017, MegDet used a optimization trick to win COCO2017 challenge. This month the SOTA was beaten by a small amount using Auto-augment, where a data generation method is developed.
It will be exciting to see who will win in the battle of optimization vs data augmentation (see some code examples).