Don’t use deep learning your data isn’t that big · Simply Statistics

Don't use deep learning your data isn't that big

  • Don’t use deep learning your data isn’t that big

    The other day Brian was at a National Academies meeting and he gave one of his usual classic quotes:

    When I saw that quote I was reminded of the blog post Don’t use hadoop – your data isn’t that big.

  • Just as with Hadoop at the time that post was written – deep learning has achieved a level of mania and hype that means people are trying it for every problem.
  • The issue is that only a very few places actually have the data to do deep learning.
  • But I’ve always thought that the major advantage of using deep learning over simpler models is that if you have a massive amount of data you can fit a massive number of parameters.
  • If you are Google, Amazon, or Facebook and have near infinite data it makes sense to deep learn.

Best quote from NAS DS Round Table: “I mean, do we need deep learning to analyze 30 subjects?” – B Caffo @simplystats #datascienceinreallife

@simplystats: Don’t use deep learning your data isn’t that big

The other day Brian was at a National Academies meeting and he gave one of his usual classic quotes:

When I saw that quote I was reminded of the blog post Don’t use hadoop – your data isn’t that big. Just as with Hadoop at the time that post was written – deep learning has achieved a level of mania and hype that means people are trying it for every problem.

The issue is that only a very few places actually have the data to do deep learning. Sure if you are Google and have everyone’s emails over the last decade or if you are Facebook and have billions of tagged images, then deep learning makes sense. But I’ve always thought that the major advantage of using deep learning over simpler models is that if you have a massive amount of data you can fit a massive number of parameters.

When your dataset isn’t that big, doing something simpler is often both more interpretable and it works just as well due to potential overfitting. To test this idea I’m going to do an experiment on the digits data. I’m going to build a model just to predict one versus zero. I’m going to do that using logistic regression and I’m going to use a deep neural network following this post. First lets load the packages we need:

Then load the data:

Now what I’m going to do is break the data into a training set and a testing set, leaving 20% for testing.

Using these data we can now try our experiment. I’m going to compare two methods:

I’m going to create training sets of size 10 to 80, increasing by 5 each time. I’m going to do this 5 times so I can try to average out some of the noise.

Now we plot the accuracy of each of these methods versus sample size with vertical bars showing the 10th and 90th percentiles for accuracy.

For low training set sample sizes it looks like the simpler method (just picking the top 10 and using a linear model) slightly outperforms the more complicated methods. As the sample size increases, we see the more complicated method catch up and have comparable test set accuracy.

This is an extremely simple example but illustrates the larger point that Brian was making above. The sample size matters. If you are Google, Amazon, or Facebook and have near infinite data it makes sense to deep learn. But if you have a more modest sample size you may not be buying any accuracy – just losing interpretability. Although I guess you still get to keep the hype :).

Don’t use deep learning your data isn’t that big · Simply Statistics