Why? Becuase fallacies of distributed computing, and because buzzfeed didn't have a list for #BigData.
The Fallacies of Data Science
Adnan Masood & David Lazar
- Correlation = Causation, and Big Data = Information and Insights because Data Context Doesn't Matter.
- The random nature of the event drives the distribution, therefore the likely distribution also drive the events.
- Base Rate Fallacy only applies to small data-sets.
- Data dredging is negatively correlated to the data-size i.e. number of spurious correlations decrease with number of dimensions of a data-set.
- In Data Science, past performance implies Future Results! Modeling assumptions can be held as absolute truths after experiments, and variables are normally distributed unless otherwise specified.
- Random sampling in experiment design and hypothesis testing is optional. Of course real world data sets don’t have Cross validation "leakage".
- Extrapolating beyond the range of training data, especially in the case of time series data, is fine providing the data-set is large enough.
- Strong Evidence is same as a Proof! Prediction intervals and confidence intervals are the same thing, just like statistical significance and practical significance.
- Measurement Doesn't Change the System. Increasing the number of features increases the model's significance and accuracy.
- Over/under-fitting of a models can be performed irrespective of bias-variance trade-off.
- Bonus: Renaming your Analytics dept. to Data Science dept. gives you a data science discipline & specialty overnight.
Thanks Dr. Jim Java for reading the earlier draft and providing comments.
Download Fallacies of Big Data
References & Further Reading