On Verifiable, Reproducible Research in Computational Sciences

Recently I have been reading few research papers by Dr. Szymon Jaroszewicz, co-author of "Scalable pattern mining with Bayesian networks as background knowledge", "Fast discovery of unexpected patterns in data, relative to a bayesian network" and "Interestingness of frequent itemsets using Bayesian networks as background knowledge". The papers stated that "A copy of the source code is available from the authors for research purpose" so I requested Dr. Jaroszewicz to send me a copy of the source code to reproduce some of his research work for comparison purposes. To my surprise, in less than two days he responded back and provided me the entire library of his python based code used in the papers. Unfortunately though, this is not a norm in research community and I have sent various similar emails to other researchers in the past which were never replied to.

In the 2011 SIAM Conference on Computational Science & Engineering, there was a mini-symposium held on Verifiable, reproducible research and computational science where Randall J. LeVeque, Professor of Applied Mathematics and Adjunct Professor of Mathematics at the University of Washington, Seattle gave a talk on topic of Top 10 Reasons to NOT Share your Code and Why you Should Anyway. This presentation starts with the following hypothetical.

Imagine a world in which mathematics papers contain:


Lemmas, Theorems, Corollaries

No proofs 

Nobody expects to see a proof in a publication, or to ever have to submit one.

He continued with some of the hypothetical excuses which ironically apply quite well to the absence of code in current research work.

1. The proof is too ugly to show anyone else.
2. I didn’t work out all the details.
3. The proof is valuable intellectual property.

Saying like it is, Dr. LeVeque continues to state that algorithms, (are) often incompletely described, with graphs or tables demonstrating properties claimed, Lots of pretty pictures, No actual code with all the details.

Any academic or researcher know that some of these excuses are albeit, legitimate such as claim of intellectual property, possibility to run by others (may require prorietary software, Require proprietary software, or Only runs on supercomputers, or Requires too many dependencies etc), lack of credit to name a few. However, the author claims that with change of culture, these issues won't be of problematic as they appear. Some inspiring stories of reproducible research can be found here.

Sometime researchers claim that it’s not software, it’s a research code that isn't worth cleaning up for others to see; author recommend that they will find the answer in this article; Publish your computer code: it is good enough .

Sharing of scientific software, data and knowledge is necessary for reproducible research. The unrestricted access to research outcomes and educational tools is an important driver for meaningful scientific discoveries. One good example is open source software developed by collaborative, meritocratic communities which is openly tested, validated and documented as the basis for reliable scientific outcomes.

I take great pleasure in building and compiling a new algorithm in weka while testing out the existing implementations. If you enjoy building it, you will definitely enjoy sharing it with others. If it's not share-worth, was it really worth building?

Victoria Stodden - Reproducible Research Workshop 2011


References & Further Readings

Reproducible Research:Tools and Strategies for Scientific Computing

The Need for Reproducibility in Academic Research by Elizabeth Iorns

Scientists, Share Secrets or Lose Funding: Stodden and Arbesman

Another rant about academia and open source