Machine Learning
LA Machine Learning event on Mining Time Series Data w/ Sylvia Halasz
Last night's LA Machine Learning event on Mining Time Series Data w/ Sylvia Halasz of YP at OpenX Pasadena was quite interesting and well attended. Dr. Halasz spoke about Adaptive Ensemble Kalman Filter and her work on building n-gram correlation with the flu outbreaks. Some of the associated papers follow.
- The ngram chief complaint classifier: A novel method of automatically creating chief complaint classifiers based on international classification of diseases groupings
- Detecting the start of the flu season
- Syndrome Surveillance - CDC
Causality, Probability, and Time - A Temporo-Philosophical Primer to Causal Inference with Case Studies
Causality, Probability and Time by Dr. Samantha Kelinberg is a whirlwind yet original journey of the interdisciplinary study of probabilistic temporal logic and causal inference. Probabilistic causation is a fairly demanding area of study which studies the relationship between cause and effect using the tools of probability theory. Judea Pearl, in his seminal text "Causality: Models, Reasoning, and Inference" refers to this quandary by stating that
(causality) connotes lawlike necessity, whereas probabilities connote exceptionality, doubt, and lack of regularity.
Dr. Kelinberg's work provides a balanced introduction to background work on this topic while breaking new grounds on a well-positioned approach of causality based on temporal logic. The envisioning problem is the problem of deducing the set of facts, possibly as the result of our actions leading to the decision problem. This is compounded with finding a timely and useful way to represent our knowledge about time, change, and chance.
In this ~260 page book, Dr. Kelinberg begins with a brief history of causality leading to Probability, logic and probabilistic temporal logic. The author then defines causality from various different facets, proceeding to causality inference, token causality and then finally the case studies. With practical examples and algorithms, author devises simple mathematical tools for analyzing the relationships between causal connections, inference, causal significance, model complexity, statistical associations, actions and observations.
Exploiting the temporal nature of probabilistic events, Dr. Kelinberg's research is a thought provoking and valuable addition to the scientific community interested in learning causal effects and inference with respect to time. Built upon the works of the likes of Heckerman, Breese, Santos and Young, this book will pave the way probabilistic reasoning researchers think about temporal effects on causality for years to come.
David Hume believed that the causes are invariably followed by their effects: "We may define a cause to be an object, followed by another, and where all the objects similar to the first, are followed by objects similar to the second." So, would you like a well written margin-annotation-laden text which provides formal and practical case study based approach to this somewhat abstract concept of causality? Then look no further!
Bayesian Network Repositories Collections
A #NoteToSelf style post regarding collection of bayesian network repositories including but not limited to bnet, net, bif, dsc and rda files.
- GeNIe and SMILE Network Repository
http://genie.sis.pitt.edu/networks.html
- BNLearn
http://www.bnlearn.com/bnrepository/ - University of Hebrew Bayesian Network Repository
http://www.cs.huji.ac.il/~galel/Repository/
- DSL lab Network Repository
http://genie.sis.pitt.edu/networks.html
- Aalborg University Repository
http://www.cs.auc.dk/research/DSS/Misc/networks.html
- Norsys Bayes Net Library
http://www.norsys.com/networklibrary.html
- Encog Project - Example Bayesian Networks
http://www.heatonresearch.com/wiki/Example_Bayesian_Networks
The Theory That Would Not Die - An Engaging History of Bayesian Philosophy
As statistician Dennis Lindley famously said, "Inside every nonBayesian there is a Bayesian struggling to get out"; it would be safe to interpolate that Sharon McGrayne's interesting tale of trials and triumph of the Bayes Rule, or more accurately Bayes-Laplace-Price rule, is an excellent historical journey, which may help get your Bayesian out of the closet.
The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy makes for an interesting and captivating read especially considering that writing about history of mathematics and statistics for general audience is a daunting task when compared with relatively popular topics like astronomy or physics. In this easy reading for popular-science audience, author covers over three hundred years of the history behind Bayes rule with its applications and engrossing stories of mathematical luminaries; some of which thought it was a brilliant way to model real-life scenarios while others considered it unscientific, an exercise in futility and vehemently fought against the idea of incorporating prior beliefs. Aside from providing thorough research on the subject matter, this text also delves into significant details about life and works of important scientists, mathematicians and statisticians including but not limited to Turing, Von Neumann, Price, Shannon, Bailey, Laplace, Fisher and Feynman. Regarding modern times, I was delighted to see Daphne Koller and Heckerman's work mentioned as well as the role Bayesian techniques played in contemporary discipline of Machine learning.
Starting with the compelling statement
When the facts change, I change my opinion. What do you do, sir?
—John Maynard Keynes
the ups and downs of adoption of Bayesian rule are listed as different eras and separated out as different parts of the book. The 17 chapters are divided into five parts namely Enlightenment and the anti-Bayesian reaction, Second World War era, the glorious revival, to prove it's worth and finally, victory. Did author do a good job explaining Bayes rule is the point of contention among earlier reviews. I agree that a few more concrete examples with algebraic expressions may have helped better explaining how Bayesian priors and it's mathematical formulation by early luminaries in the field makes it easy to work without complex integrals. However, it is to be noted that this book is not a course in antiquity of causality and inference but rather a study of Bayesian thought through centuries and it's profound impact on science and technology. The book very well covers the advances by 'Bayesian revolution' in variety of fields including medical diagnosis, ecology, geology, computer science, artificial intelligence, machine learning, genetics, astrophysics, archaeology, education performance, sports modeling, and more.

Sharon McGrayne's has picked a very relevant topic for contemporary audience interested in mathematical and computational sciences; making this ~350 page book a very informative, absorbing and pleasurable reading. Although light on technical details, proofs, mathematical equations and problems, this book delivers what it sets to accomplish, to tell the story of Bayes theory. "The theory that would not die" tells the story of a robust idea which is simple, intuitive, unsettling to establishment and yet so resilient that despite of all the criticisms from mainstream frequentists, it stayed alive and well. To quote from the book
"Bayes is still young. Probability did not have any mathematics in it until 1700. Bayes grew up in data- poor and computationally poor circumstances. It hasn't settled down yet. We can give it time. We're just starting."
A Deep Dive into Causality with Judea Pearl
For most researchers in the ever growing fields of probabilistic graphical models, belief networks, causal influence and probabilistic inference, ACM Turing award winner Dr. Judea Pearl and his seminary papers on causality are well-known and acknowledged. Representation and determination of Causality, the relationship between an event (the cause) and a second event (the effect), where the second event is understood as a consequence of the first, is a challenging problem. Over the years, Dr. pearl has written significantly on both Art and Science of Cause and Effect. In this book on “Causality: Models, Reasoning and Inference”, the inventor of Bayesian belief networks discusses and elaborates on his earlier workings including but not limited to Reasoning with Cause and Effect, Causal inference in statistics, Simpson's paradox, Causal Diagrams for Empirical Research, Robustness of Causal Claims, Causes and explanations, and Probabilities of causation Bounds and identification.
In these eleven chapters followed by an epilogue, Dr. Pearl’s manuscript postulates representational and computational foundation for the processing of information under uncertainty. It commences with introduction of simpler concepts in Bayesian inference, causality and corresponding proves. However, as text progresses into causal vs. statistical concepts along with theory of inferred causation, the theorems get arduous, somewhat counter-intuitive and the text becomes demanding to keep up. Chapter 3 is an interesting read where causality is discussed in context of philosophy and history. As Dr. Liu states, Judea Pearl’s thesis regarding statistics that it deals with quantitative constructs like mean, variance, correlation, regression, dependence, conditional independence, association, likelihood, collapsibility, risk ratio, odd ratio, marginalization, conditionalization, etc. Meanwhile the causal analysis deals with the topics of randomization, influence, effect, confounding, disturbance, correlation, intervention, explanation and attribution. One of the challenges while following Dr. Pearl’s work is that it abstracts causation discussing it in mathematical and philosophical manner without providing concrete mathematical and computational model for applied research. I believe the book provides great foundation for formal representation of causal analysis and its components, such as do(x) to represent intervention.
Automated Reasoning Group at UCLA has made some strides in this area however the applied research aspects of this formalism still needs to be ‘tightly bound’ by reason of scarcity of empirical evidence for the algorithms in practice.
Caltech Entrepreneurs Forum Event – Big Data, Big Opportunities: Slides & Pictures
Recently attended Big Data Event @ Caltech. The topic was Big Data, Big Opportunities: Predicting the Future One Byte at a Time and the panel and speakers didn't disappoint. Following is the slidedeck and pictures from the event.
A Truly Modern discourse in Bayesian Reasoning and Machine Learning
If you are scouring for an exploratory text in probabilistic reasoning, basic graph concepts, belief networks, graphical models, statistics for machine learning, learning inference, naïve Bayes, Markov models and machine learning concepts, look no further. Dr. Barber has done a praiseworthy job in describing key concepts in probabilistic modeling and probabilistic aspects of machine learning. Don’t let the size of this 700 page, 28 chapter long book intimidate you; it is surprisingly easy to follow and well formatted for the modern day reader.
With excellent follow ups in summary, code and exercises, Dr. David Barber a reader at University college London provides a thorough and contemporary primer in machine learning with Bayesian reasoning. Starting with probabilistic reasoning, author provides a refresher that the standard rules of probability are a consistent, logical way to reason with uncertainty. He proceeds to discuss the basic graph concepts and belief networks explaining how we can reason with certain or uncertain evidence using repeated application of Bayes' rule. Since belief network, a factorization of a distribution into conditional probabilities of variables dependent on parental variables, is a specific case of graphical models, the book leads us into the discipline of representing probability models graphically. Followed by efficient inference in trees and the junction tree, the text elucidates on key stages of moralization, triangularization, potential assignment, and message-passing.
I particularly enjoyed the follow up chapter called statistics for machine learning which uniquely discuss the classical univariate distributions including the exponential, Gamma, Beta, Gaussian and Poisson. It summarizes the measure of the difference between distributions, Kullback-Leibler divergence and states that Bayes' rule enables us to achieve parameter learning by translating a prior parameter belief into a posterior parameter belief based on observed data. Learning as inference, naïve bayes, Learning with Hidden Variables and Bayesian model selection is followed by machine learning concepts. I found the sequence of chapters to be a bit off (shouldn’t graphical models be discussed before a specific case?) but since the book is more rooted in practice than an exercise in theorem-proving’s, the order ultimately makes sense.
Since Nearest neighbor methods are general classification methods, the book continues with conditional mixture of guassian into Unsupervised Linear Dimension Reduction, supervised linear dimension reduction, kernel extensions, guassian processes, mixture models, latent linear models, latent ability models, discrete and continuous state markov models eventuating to distributed computation of models, sampling and the holy grail of Deterministic Approximate Inference.
One of the many great things about this book is the practical and code oriented approach; tips with applied insight like “Consistency methods such as loopy belief propagation can work extremely well when the structure of the distribution is close to a tree. These methods have been spectacularly successful in information theory and error correction.” makes this text distinguished and indispensable.
Selected Papers on Interestingness Measures, Knowledge Discovery and Outlier Mining
- S. Abe and T. Inoue. Fuzzy support vector machines for multiclass problems.In ESANN 2002 Proceedings, pages 113-118, 2002.
- E.L. Allwein, RE. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research,
1:113-141,2000.
- P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1):53-581989
- Marc Benioff. Data, data everywhere: A special report on managing information. The Economist, February 2010.
- J.M. Keller R Krishnapuram L.I. Kuncheva J.C. Bezdek and N.R Pal. Will the real iris data please stand up? IEEE Transactions on Fuzzy Systems, 7:3,
1999.
- C. J. C. Burges. A tutorial on support vector machines for pattern recognition.
Data Mining and Knowledge Discovery, 2:121-167, 1998.
- C. Chen, A. Liaw, and L. Breiman. Using random forest to learn imbalanced data. Technical report, Department of Statistics, UC Berkeley, 2004.
- V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory and
Methods. John Wiley & Sons, Inc., 1998.
- R. Cilibrasi and P. Vitanyi. Clustering by compression. IEEE Transactions on
Information Theory, 51(4):1523-1545, 2005.
- R. Cilibrasi and P. Vitanyi. Normalized web distance and word similarity.
CoRR, abs/0905.4039, 2009.
- R. Cilibrasi, P. Vitanyi, and R. de Wolf. Algorithmic clustering of music. In
WEDELMUSIC, pages 110-117, 2004.
- T. Downs, I. Wood, and M. Gallagher. Empirical evidence for ultrametric structure in multi layer perceptron error surfaces. Neural Processing Letters,
16(2):177~186, 2002.
- A.A. Freitas. Are we really discovering "interesting" knowledge from data?
Expert Update (the BCS-SGAI Magazine), 9(1):41~47, October 2006.
- L. Geng and H. J. Hamilton. Interestingness measures for data mining: A
survey. ACM Comput. Surv., 38(3), 2006.
- M. Gori and F. Scarselli. Are multilayer perceptrons adequate for pattern recognition and verification? IEEE Trans. Pattern Anal. Mach. Intell., 20(11):1121~
1132, 1998.
- P.M. Granitto, P.F. Verdes, and H.A. Cecatto. Neural network ensembles:
evaluation of aggregation algorithms. arXiv, arXiv:cs.AI/0502006vl, 2005.
- S. Hashemi and T.P. Trappenberg. Using svm for classification in data sets with ambiguous data. In International Conference on Information Systems, Analysis and Synthesis (SCI 2002), 2002.
- M. Hassoun. Fundamentals of Artificial Neural Networks. Massachusetts Institute of Technology, 1995.
- S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice-Hall Inc., second edition, 1999.
- Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern
Recognition Letters, 24(9-10):1641-1650, 2003.
- S. Hettich and S. D. Bay. Kdd cup 1999 data. UCI KDD Archive
[http://kdd.ics.uci.edu/ /databases/kddcup99/kddcup99.html], 1999.
- L. Itti and P. Baldi. Bayesian surprise attracts human attention. In Proceedings
Neural Information Processing Systems, 2005.
- B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data
{Data-Centric Systems and Applications}. Springer, January 2007.
- S. Singh M. Markou. Novelty detection: a review part 1: statistical approaches.
Signal Processing, 83(12):2481-2497, December 2003.
- K. McGarry. A survey of interestingness measures for knowledge discovery. The
Knowledge Engineering Review, 00:0:1-24, 2005.
- P. M. Murphy and M. J. Pazzani. Exploring the decision forest: an empirical investigation of occam's razor in decision tree induction. J. Artij. Int. Res.,
1(1):257-275, 1993.
- A. Orriols-Puig, J. Casillas, and E. Bernado-Mansilla. First approach toward online evolution of association rules with learning classifier systems. In GECCO
'08: Proceedings of the 2008 GECCO conference companion on Genetic and evolutionary computation, pages 2031-2038, New York, NY, USA, 2008. ACM.
- Y.H. Pao and C.- Y. Shen. Visualization of pattern data through learning of non- linear variance-conserving dimension-reduction mapping. Pattern Recognition,
30(10):1705-1717,1997.
- J.M. Puche, J.M. Benitez, and J.L. Mantas. Fuzzy pairwise multiclass support vector machines. In A. Gelbukh and C.A. Reyes-Garcia, editors, Mexican International Conference on Artificial Intelligence (MICA I) , volume LNAI, pages 562-571. Springer-Verlag, 2006.
- M. Robnik-Sikonja. Improving random forests. In J.F. Boulicaut et al., editor,
Machine Learning, ECML 2004, 2004.
- J. Schmidhuber. Driven by compression progress: A simple principle explains essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. CoRR, abs/0812.4360, 2009.
- C. Shirky. It's not information overload. it's filter failure. Keynote Speech, September 2008.
- E. Suzuki. Data mining methods for discovering interesting exceptions from an unsupervised table. Journal of Universal Computer Science, 12(6):627-653,
2006. http://w ..... jucs. org/jucs_12_6/data_mining_methods_for.
- E. Suzuki. Lecture Notes in Computer Science, volume 5579/2009, chapter Compression-Based Measures for Mining Interesting Rules, pages 741-746. Springer Berlin / Heidelberg, 2009.
- P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining, {First
Edition}. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,
2005.
- D. Tax and R Duin. Experiments with classifier combining rules. In Lecture
Notes in Computer Science, volume 1857, pages 16-29, Berlin, 2000. Springer- Verlag.
- D. Tax and RP.W. Duin. Using two-class classifiers for multi class classification.
In C. Suen R Kasturi, D. Laurendeau, editor, Proceedings 16th International Conference on Pattern Recognition, volume II, pages 124-127, Quebec City, Canada, Aug.11-15 2002. IEEE Computer Society Press.
- I. Tsang, J. Kwok, P. Cheung, and N. Cristianini. Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research,
6:363-392, 2005.
- L. H. Tsoukalas and R E. Uhrig. Fuzzy and Neural Approaches in Engineering.
John Wiley & Sons, Inc., New York, NY, USA, 1996.
- C. S. Wallace and D. M. Boulton. A information measure for classification.
Computer Journal, 11(2):185-194, 1968.
- J.-S. Wang and J.-C. Chiang. An efficient data preprocessing procedure for support vector clustering. Journal of Universal Computer Science, 15(4):705-
721, 2009. http://www . jucs. org/jucs_15_4/an_efficient_data_preprocessing.
- J. D. Williams. The Compleat Strategyst: Being a Primer on the Theory of
Games of Strategy. Dover Publications, 1986.
Hilary Mason - Machine Learning for Hackers
An interesting beginners talk for machine learning enthusiasts.
Ever tried to use a regular expression to parse an unstructured street address? This talk is an introduction to a few machine learning algorithms and some tips for integrating them where they make the most sense and will save you the most headaches.
Hilary Mason - Machine Learning for Hackers from BACON: things developers love on Vimeo.
On Bayesian Sensitivity Analysis in Digital Forensics
The idea of using of Bayesian Belief Networks in digital forensics to quantify the evidence has been around for a while now. To provide qualitative approaches to Bayesian evidential reasoning in the digital Meta-Forensics is however relatively new in the decision support systems research. For law enforcement, decision support and application of data mining techniques to “soft” forensic evidence is a large area in Bayesian forensic statistics which has depicted how Bayesian Networks can be used to infer the probability of defense and prosecution statements based on forensic evidence. Kevin B. Korb and Ann E. Nicholson's study on Sally Clark is Wrongly Convicted of Murdering Her Children and Linguistic Bayesian Networks for reasoning with subjective probabilities in forensic statistics gives an insight into an important development which helps to quantify the meaning of forensic expert testimony for "strong support".
The IEEE paper on Sensitivity Analysis of a Bayesian Network for Reasoning about Digital Forensic Evidence published in 3rd International Conference on Human-Centric Computing (HumanCom), 2010 is of particular interest since it has a comprehensive real-world list of evidence items and hypothesis.
Bayesian network representing an actual prosecuted case of illegal file sharing over a peer-to-peer network has been subjected to a systematic and rigorous sensitivity analysis. Our results demonstrate that such networks are usefully insensitive both to the occurrence of missing evidential traces and to the choice of conditionalevidential probabilities
one of the co-authors Dr. Overill has also covered grounds for A Complexity Based Forensic Analysis of the Trojan Horse Defence.
The evidence nodes are follows.
- Modification time of the destination file equals that of the source file
- Creation time of the destination file is after its own modification time
- Hash value of the destination file matches that of the source file
- BitTorrent client software is installed on the seized computer
- File link for the shared file is created
- Shared file exists on the hard disk
- Torrent file creation record is found
- Torrent file exists on the hard disk
- Peer connection information is found
- Tracker server login record is found
- Torrent file activation time is corroborated by its MAC time and link file
- Internet history record about the publishing website is found
- Internet connection is available
- Cookie of the publishing website is found
- URL of the publishing website is stored in the web browser
- Web browser software is available
- Internet cache record about the publishing of the torrent file is found
- Internet history record about the tracker server connection is found
- The seized computer was used as the initial seeder to share the pirated file on a BitTorrent network
while the following hypothesis stand.
- The pirated file was copied from the seized optical disk to the seized computer
- A torrent file was created from the copied file
- The torrent file was sent to newsgroups for publishing
- The torrent file was activated, which caused the seized computer to connect to the tracker server
- The connection between the seized computer and the tracker server was maintained

The authors conclude, exonerating the sparse evidence such that
The sensitivity analysis reported in this paper demonstrates that the BT BBN used in is insensitive to the occurrence of missing evidence and also to the choice of evidential likelihoods to an unexpected degree.
Our overall finding is gratifying because it implies that the exact choice of values for the inherently subjective evidential likelihoods is not as critical as might have been expected. Values falling within the consensus of experienced expert investigators are sufficiently reliable to be used in the BBN model. Furthermore, our results imply that the inability to recover one or more evidential traces in a digital forensic investigation is not generally critical for the probability of the investigatory hypothesis under consideration.
For some reason, this reminded me of a recent read SuperFreakonomics where authors devise a terrorist-algorithm with the following black-box variable.
“What finally made it work was one last metric that dramatically sharpened the aalgorithm. In the interest of national security, was have been asked to not disclose the particulars; we’ll call it Variable X.
What makes Variable X so special?
For one, it is a behavioral metric, not a demographic one. The dream of anti-terrorist authorities everywhere is to somehow become a fly on the wall in a room full of terrorists. In one small important way, Variable X accomplishes that. Unlike most other metrics in the algorithm, which produce a yes or no answer, Variable X measures the intensity of a particular banking activity. While not unusual in low intensities among the general population, this behavior occurs in high intensities much more frequently among those who have other terrorist markers.
This ultimately gave the algorithm great predictive power. Starting with a database of millions of bank customers, Horsley was able to generate a list of about 30 highly suspicious individuals. According to his rather conservative estimate, at least 5 of those 30 are almost certainly involved in actitvities. Five out of 30 isn’t perfect—the algorithm misses many terrorists and still falsley identifies some innocents—but it sure beats 495 out of 500,495.”
Bayesian Belief Networks can definitely serve as a better probabilistic graphical model to achieve a improved visibility and prior/posterior probabilities for such network related algorithm.































