Data Science
We could divide a Data Science project into three big phases: data managing, model training and analysis. Data managing has to do with cleaning the data, engineering features and managing databases. Model training requires Machine Learning to create a model of the data. Finally, there has to be an analysis of the results, showing the predictions of the model in an understandable way to the customer, usually performed via reports or dashboards.
Regarding Machine Learning, people use different algorithms depending on the problem. In the industry, probably the most used algorithm in Data Science is Decision Trees. Curiously, until 2006, the most popular Machine Learning algorithm in the scientific community was Support Vector Machines. But 2012 brought one of the most successful breakthroughs in Artificial Intelligence: Deep Learning, improving previous performances by more than 20%. Since then, a lot of researchers turned to Deep Neural Networks. However, it doesn’t happen the same in the industry. Data Scientists keep using Decision Trees or Logistic Regression in their day to day problems.
When to use Deep Learning
Here some of the key use cases:
- Computer vision: many difficult object identification problems can be solved using Convolutional Neural Networks.
- Natural Language Processing: Recurrent Neural Networks, specifically LSTM, are the current state of the art in this area.
- Time series forecasting: problems such as predictive maintenance and financial forecasting can be solved with Deep Learning.
- Problems with big datasets: Deep Learning is especially strong when the training set is in the order of millions or more.
- When you have access to GPUs: the same algorithm in a GPU can be 1000 times faster.
- When you don’t need to look into the intermediate process: Neural Networks are black boxes, only the input and the output is understandable.
There is a big business opportunity around Deep Learning and it will come to Data Science very soon. Furthermore, there is a wide suite of open-source libraries: CNTK from Microsoft, TensorFlow from Google, Keras, Chainer, Theano, and MXNet.