The Fallacies of Data Science

Why? Becuase fallacies of distributed computing, and because buzzfeed didn't have a list for #BigData.

The Fallacies of Data Science
Adnan Masood & David Lazar

  1. Correlation = Causation, and Big Data = Information and Insights because Data Context Doesn't Matter.
  2. The random nature of the event drives the distribution, therefore the likely distribution also drive the events.
  3. Base Rate Fallacy only applies to small data-sets.
  4. Data dredging is negatively correlated to the data-size i.e. number of spurious correlations decrease with number of dimensions of a data-set.
  5. In Data Science, past performance implies Future Results! Modeling assumptions can be held as absolute truths after experiments, and variables are normally distributed unless otherwise specified.
  6. Random sampling in experiment design and hypothesis testing is optional. Of course real world data sets don’t have Cross validation "leakage".
  7. Extrapolating beyond the range of training data, especially in the case of time series data, is fine providing the data-set is large enough.
  8. Strong Evidence is same as a Proof! Prediction intervals and confidence intervals are the same thing, just like statistical significance and practical significance.
  9. Measurement Doesn't Change the System. Increasing the number of features increases the model's significance and accuracy.
  10. Over/under-fitting of a models can be performed irrespective of bias-variance trade-off.
  11. Bonus: Renaming your Analytics dept. to Data Science dept. gives you a data science discipline & specialty overnight.

Thanks Dr. Jim Java for reading the earlier draft and providing comments.

Download Fallacies of Big Data

References & Further Reading

Observer-expectancy effect 



4 thoughts on “The Fallacies of Data Science

  1. Saying that "correlation equals causation" is a fallacy is itself a fallacy. After all, the way we establish causation is by showing correlation through multiple data sets, variables, etc.

    The fallacy is accepting a single correlation as being dispositive of causation, which is the danger when crossing the analyst / business boundary, which seems to act like the blood/brain barrier in stopping dangerous things like "caveats" or "considerations" from reaching the executive level.

    The problem we face is that simple platitudes like "correlation does not equal causation" can also cross the analyst / business boundary without any context or understanding, and become something analysts have to fight uphill against unnecessarily.

Comments are closed.