# Machine Learning

## Machine Learning - On the Art and Science of Algorithms with Peter Flach

Over a decade ago, Peter Flach of Bristol University wrote a paper on the topic of "On the state of the art in machine learning: A personal review" in which he reviewed several, then recent books, related to developments in machine learning. This included Pat Langley’s Elements of Machine Learning (Morgan Kaufmann), Tom Mitchell’s Machine Learning (McGraw-Hill), and Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations by Ian Witten and Eibe Frank (Morgan Kaufman) among many others. Dr. Flach mentioned Michael Berry and Gordon Linoff’s Data Mining Techniques for Marketing, Sales, and Customer Support (John Wiley) for it's excellent writing style citing the paragraph below and commending "I wish that all computer science textbooks were written like this."

“People often find it hard to understand why the training set and test set are “tainted” once they have been used to build a model. An analogy may help: Imagine yourself back in the 5th grade. The class is taking a spelling test. Suppose that, at the end of the test period, the teacher asks you to estimate your own grade on the quiz by marking the words you got wrong. You will give yourself a very good grade, but your spelling will not improve. If, at the beginning of the period, you thought there should be an ‘e’ at the end of “tomato”, nothing will have happened to change your mind when you grade your paper. No new data has entered the system. You need a test set!

Now, imagine that at the end of the test the teacher allows you to look at the papersof several neighbors before grading your own. If they all agree that “tomato” has no final ‘e’, you may decide to mark your own answer wrong. If the teacher gives the same quiz tomorrow, you will do better. But how much better? If you use the papers of the very same neighbors to evaluate your performance tomorrow, you may still be fooling yourself. If they all agree that “potatoes” has no more need of an ‘e’ then “tomato”, and you have changed your own guess to agree with theirs, then you will overestimate your actual grade on the second quiz as well. That is why the evaluation set should be different from the test set.” [3, pp. 76–77] 4

That is why when I recently came across * *"Machine Learning The Art and Science of Algorithms that Make Sense of Data", I decided to check it out and wasn't disappointed. Dr. Flach is the Professor of Artificial Intelligence at the University of Bristol and in this "future classic", he left no stone unturned when it comes to clarity and explainability. The book starts with a machine learning sampler, introduces the ingredients of machine learning fast progressing to Binary classification and Beyond. Written as a textbook, riddled with examples, foot-notes and figures, this text elaborates concept learning, tree models, rule models, linear models, distance-based models, probabilistic models to features and ensembles concluding with Machine learning experiments. I really enjoyed the "Important points to remember" section of the book as a quick refresher on machine-learning-commandments.

The concept learning section seems to have been influenced by author's own research interest and is not discussed in as much details in contemporary machine learning texts. I also found frequent summarization of concepts to be quite helpful. Contrary to it's subtitle and compared to it's counterparts, the book however is light on algorithms and code, possibly on purpose. While it explains the concepts with examples, number of formal algorithms are kept to a minimum. This may aid in clarity and help avoiding recipe-book-syndrome while making it potentially inaccessible to practitioners. Great at basics, the text also falls short on elaboration of intermediate to advance topics such as LDA, kernel methods, PCA, RKHS, and convex optimization. For instance, in chapter 10 "Matrix transformations and decompositions" could have been made an appendix while expanding upon meaningful topics like LSA and use cases of sparse matrix (pg 327). It is definitely not the book's fault; but rather of this reader expecting too much from an introductory text just because author explains everything so well!

As a text book on On the Art and Science of Algorithms, Peter Flach definitely delivers on the promise of clarity, with well chosen illustrations and example based approach. A highly recommended reading for all who would like to understand the principles behind machine learning techniques.

Materials can be downloaded from here which generously include excerpts with background material and literature references, full set of 540 lecture slides in PDF including all figures in the book with LaTeX beamer source of the above.

## Demystification of Demystifying Machine Learning using nuML w/ Seth Juarez

Going for a little Benoit B. Mandelbrot recursion joke here with the title.

Seth Juarez (github) recently spoke to Pasadena .NET user group on the topic of Practical Machine Learning using nuML. Seth is a wonderful speaker, educator and nuML is an excellent library to get started with machine learning in .NET. His explanations are very intuitive; even for people who have been working in the field for a while. During the talk and follow up discussions, there were various technical references made which went beyond the scope of talk. To be fair with Seth, he covered lot of material in an hour and a half; probably couple of weeks worth in a traditional ML course.

Therefore I decided to provide links to these underlying topics for the benefit of attendees in case anyone is interested in knowing more about them.

- No free lunch in search and optimization
- Probably approximately correct learning
- Kernalized Sorting for NLP Presentation - Paper by Seth
- QP Solver
- NP-Complete Problems
- Intuitive Explanation of Expectation Maximization
- Multi-class classification
- REPL
- Rosylyn and Roslyn CTP Introduces Interactive Code for C#
- Expando Objects
- Cardinality vs Selectivity
**Microsoft Automatic Graph Layout Library**- Positive Definite Matrix
- Kernel Perceptron in Python
- Perceptrons and Kernels
- math.net numerics
- Matrix Slicing
- Vectors and Matrices
- CodeMash 2013 Repo and readme
- What is EM algorithm?
- k-means clustering
- Clustering Algorithms
- Bag of Words Model
- Cosine similarity vs Hamming distance
- Time series regression and generalized least squares
- Machine Learning Techniques for Stock Prediction
- Causality, Correlation and Browian Motion

Happy Machine Learning!

## LA Machine Learning event on Mining Time Series Data w/ Sylvia Halasz

Last night's LA Machine Learning event on Mining Time Series Data w/ Sylvia Halasz of YP at OpenX Pasadena was quite interesting and well attended. Dr. Halasz spoke about Adaptive Ensemble Kalman Filter and her work on building n-gram correlation with the flu outbreaks. Some of the associated papers follow.

- The ngram chief complaint classifier: A novel method of automatically creating chief complaint classifiers based on international classification of diseases groupings
- Detecting the start of the flu season
- Syndrome Surveillance - CDC

## Causality, Probability, and Time - A Temporo-Philosophical Primer to Causal Inference with Case Studies

Causality, Probability and Time by Dr. Samantha Kelinberg is a whirlwind yet original journey of the interdisciplinary study of probabilistic temporal logic and causal inference. Probabilistic causation is a fairly demanding area of study which studies the relationship between cause and effect using the tools of probability theory. Judea Pearl, in his seminal text "Causality: Models, Reasoning, and Inference" refers to this quandary by stating that

(causality) connotes lawlike necessity, whereas probabilities connote exceptionality, doubt, and lack of regularity.

Dr. Kelinberg's work provides a balanced introduction to background work on this topic while breaking new grounds on a well-positioned approach of causality based on temporal logic. The envisioning problem is the problem of deducing the set of facts, possibly as the result of our actions leading to the decision problem. This is compounded with finding a timely and useful way to represent our knowledge about time, change, and chance.

In this ~260 page book, Dr. Kelinberg begins with a brief history of causality leading to Probability, logic and probabilistic temporal logic. The author then defines causality from various different facets, proceeding to causality inference, token causality and then finally the case studies. With practical examples and algorithms, author devises simple mathematical tools for analyzing the relationships between causal connections, inference, causal significance, model complexity, statistical associations, actions and observations.

Exploiting the temporal nature of probabilistic events, Dr. Kelinberg's research is a thought provoking and valuable addition to the scientific community interested in learning causal effects and inference with respect to time. Built upon the works of the likes of Heckerman, Breese, Santos and Young, this book will pave the way probabilistic reasoning researchers think about temporal effects on causality for years to come.

David Hume believed that the causes are invariably followed by their effects: "We may define a cause to be an object, followed by another, and where all the objects similar to the first, are followed by objects similar to the second." So, would you like a well written margin-annotation-laden text which provides formal and practical case study based approach to this somewhat abstract concept of causality? Then look no further!

## Bayesian Network Repositories Collections

A #NoteToSelf style post regarding collection of bayesian network repositories including but not limited to bnet, net, bif, dsc and rda files.

- GeNIe and SMILE Network Repository
http://genie.sis.pitt.edu/networks.html

- BNLearn

http://www.bnlearn.com/bnrepository/ - University of Hebrew Bayesian Network Repository
http://www.cs.huji.ac.il/~galel/Repository/

- DSL lab Network Repository
- Aalborg University Repository
http://www.cs.auc.dk/research/DSS/Misc/networks.html

- Norsys Bayes Net Library
http://www.norsys.com/networklibrary.html

- Encog Project - Example Bayesian Networks
http://www.heatonresearch.com/wiki/Example_Bayesian_Networks

## The Theory That Would Not Die - An Engaging History of Bayesian Philosophy

As statistician Dennis Lindley famously said, "Inside every nonBayesian there is a Bayesian struggling to get out"; it would be safe to interpolate that Sharon McGrayne's interesting tale of trials and triumph of the Bayes Rule, or more accurately Bayes-Laplace-Price rule, is an excellent historical journey, which may help get your Bayesian out of the closet.

**The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy** makes for an interesting and captivating read especially considering that writing about history of mathematics and statistics for general audience is a daunting task when compared with relatively popular topics like astronomy or physics. In this easy reading for popular-science audience, author covers over three hundred years of the history behind Bayes rule with its applications and engrossing stories of mathematical luminaries; some of which thought it was a brilliant way to model real-life scenarios while others considered it unscientific, an exercise in futility and vehemently fought against the idea of incorporating prior beliefs. Aside from providing thorough research on the subject matter, this text also delves into significant details about life and works of important scientists, mathematicians and statisticians including but not limited to Turing, Von Neumann, Price, Shannon, Bailey, Laplace, Fisher and Feynman. Regarding modern times, I was delighted to see Daphne Koller and Heckerman's work mentioned as well as the role Bayesian techniques played in contemporary discipline of Machine learning.

Starting with the compelling statement

When the facts change, I change my opinion. What do you do, sir?

—John Maynard Keynes

the ups and downs of adoption of Bayesian rule are listed as different eras and separated out as different parts of the book. The 17 chapters are divided into five parts namely Enlightenment and the anti-Bayesian reaction, Second World War era, the glorious revival, to prove it's worth and finally, victory. Did author do a good job explaining Bayes rule is the point of contention among earlier reviews. I agree that a few more concrete examples with algebraic expressions may have helped better explaining how Bayesian priors and it's mathematical formulation by early luminaries in the field makes it easy to work without complex integrals. However, it is to be noted that this book is not a course in antiquity of causality and inference but rather a study of Bayesian thought through centuries and it's profound impact on science and technology. The book very well covers the advances by 'Bayesian revolution' in variety of fields including medical diagnosis, ecology, geology, computer science, artificial intelligence, machine learning, genetics, astrophysics, archaeology, education performance, sports modeling, and more.

Sharon McGrayne's has picked a very relevant topic for contemporary audience interested in mathematical and computational sciences; making this ~350 page book a very informative, absorbing and pleasurable reading. Although light on technical details, proofs, mathematical equations and problems, this book delivers what it sets to accomplish, to tell the story of Bayes theory. "The theory that would not die" tells the story of a robust idea which is simple, intuitive, unsettling to establishment and yet so resilient that despite of all the criticisms from mainstream frequentists, it stayed alive and well. To quote from the book

"Bayes is still young. Probability did not have any mathematics in it until 1700. Bayes grew up in data- poor and computationally poor circumstances. It hasn't settled down yet. We can give it time. We're just starting."

## A Deep Dive into Causality with Judea Pearl

For most researchers in the ever growing fields of probabilistic graphical models, belief networks, causal influence and probabilistic inference, ACM Turing award winner Dr. Judea Pearl and his seminary papers on causality are well-known and acknowledged. Representation and determination of Causality, the relationship between an event (the cause) and a second event (the effect), where the second event is understood as a consequence of the first, is a challenging problem. Over the years, Dr. pearl has written significantly on both Art and Science of Cause and Effect. In this book on “Causality: Models, Reasoning and Inference”, the inventor of Bayesian belief networks discusses and elaborates on his earlier workings including but not limited to Reasoning with Cause and Effect, Causal inference in statistics, Simpson's paradox, Causal Diagrams for Empirical Research, Robustness of Causal Claims, Causes and explanations, and Probabilities of causation Bounds and identification.

In these eleven chapters followed by an epilogue, Dr. Pearl’s manuscript postulates representational and computational foundation for the processing of information under uncertainty. It commences with introduction of simpler concepts in Bayesian inference, causality and corresponding proves. However, as text progresses into causal vs. statistical concepts along with theory of inferred causation, the theorems get arduous, somewhat counter-intuitive and the text becomes demanding to keep up. Chapter 3 is an interesting read where causality is discussed in context of philosophy and history. As Dr. Liu states, Judea Pearl’s thesis regarding statistics that it deals with quantitative constructs like mean, variance, correlation, regression, dependence, conditional independence, association, likelihood, collapsibility, risk ratio, odd ratio, marginalization, conditionalization, etc. Meanwhile the causal analysis deals with the topics of randomization, influence, effect, confounding, disturbance, correlation, intervention, explanation and attribution. One of the challenges while following Dr. Pearl’s work is that it abstracts causation discussing it in mathematical and philosophical manner without providing concrete mathematical and computational model for applied research. I believe the book provides great foundation for formal representation of causal analysis and its components, such as do(x) to represent intervention.

Automated Reasoning Group at UCLA has made some strides in this area however the applied research aspects of this formalism still needs to be ‘tightly bound’ by reason of scarcity of empirical evidence for the algorithms in practice.

## Caltech Entrepreneurs Forum Event – Big Data, Big Opportunities: Slides & Pictures

Recently attended Big Data Event @ Caltech. The topic was Big Data, Big Opportunities: Predicting the Future One Byte at a Time and the panel and speakers didn't disappoint. Following is the slidedeck and pictures from the event.

## A Truly Modern discourse in Bayesian Reasoning and Machine Learning

If you are scouring for an exploratory text in probabilistic reasoning, basic graph concepts, belief networks, graphical models, statistics for machine learning, learning inference, naïve Bayes, Markov models and machine learning concepts, look no further. Dr. Barber has done a praiseworthy job in describing key concepts in probabilistic modeling and probabilistic aspects of machine learning. Don’t let the size of this 700 page, 28 chapter long book intimidate you; it is surprisingly easy to follow and well formatted for the modern day reader.

With excellent follow ups in summary, code and exercises, Dr. David Barber a reader at University college London provides a thorough and contemporary primer in machine learning with Bayesian reasoning. Starting with probabilistic reasoning, author provides a refresher that the standard rules of probability are a consistent, logical way to reason with uncertainty. He proceeds to discuss the basic graph concepts and belief networks explaining how we can reason with certain or uncertain evidence using repeated application of Bayes' rule. Since belief network, a factorization of a distribution into conditional probabilities of variables dependent on parental variables, is a specific case of graphical models, the book leads us into the discipline of representing probability models graphically. Followed by efficient inference in trees and the junction tree, the text elucidates on key stages of moralization, triangularization, potential assignment, and message-passing.

I particularly enjoyed the follow up chapter called statistics for machine learning which uniquely discuss the classical univariate distributions including the exponential, Gamma, Beta, Gaussian and Poisson. It summarizes the measure of the difference between distributions, Kullback-Leibler divergence and states that Bayes' rule enables us to achieve parameter learning by translating a prior parameter belief into a posterior parameter belief based on observed data. Learning as inference, naïve bayes, Learning with Hidden Variables and Bayesian model selection is followed by machine learning concepts. I found the sequence of chapters to be a bit off (shouldn’t graphical models be discussed before a specific case?) but since the book is more rooted in practice than an exercise in theorem-proving’s, the order ultimately makes sense.

Since Nearest neighbor methods are general classification methods, the book continues with conditional mixture of guassian into Unsupervised Linear Dimension Reduction, supervised linear dimension reduction, kernel extensions, guassian processes, mixture models, latent linear models, latent ability models, discrete and continuous state markov models eventuating to distributed computation of models, sampling and the holy grail of Deterministic Approximate Inference.

One of the many great things about this book is the practical and code oriented approach; tips with applied insight like “Consistency methods such as loopy belief propagation can work extremely well when the structure of the distribution is close to a tree. These methods have been spectacularly successful in information theory and error correction.” makes this text distinguished and indispensable.

## Selected Papers on Interestingness Measures, Knowledge Discovery and Outlier Mining

- S. Abe and T. Inoue.
**Fuzzy support vector machines for multiclass problems**.In ESANN 2002 Proceedings, pages 113-118, 2002.

**E.L. Allwein, RE. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research,**1:113-141,2000.

**P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1):53-58**1989

- Marc Benioff. Data,
**data everywhere: A special report on managing information.**The Economist, February 2010.

**J.M. Keller R Krishnapuram L.I. Kuncheva J.C. Bezdek and N.R Pal. Will the real iris data please stand up? IEEE Transactions on Fuzzy Systems, 7:3,**1999.

**C. J. C. Burges. A tutorial on support vector machines for pattern recognition.**Data Mining and Knowledge Discovery, 2:121-167, 1998.

- C. Chen, A. Liaw, and L. Breiman.
**Using random forest to learn imbalanced data.**Technical report, Department of Statistics, UC Berkeley, 2004.

**V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory and**

**Methods. John Wiley & Sons, Inc., 1998.**

**R. Cilibrasi and P. Vitanyi. Clustering by compression. IEEE Transactions on**Information Theory, 51(4):1523-1545, 2005.

- R. Cilibrasi and P. Vitanyi.
**Normalized web distance and word similarity.**

CoRR, abs/0905.4039, 2009.

**R. Cilibrasi, P. Vitanyi, and R. de Wolf. Algorithmic clustering of music. In**WEDELMUSIC, pages 110-117, 2004.

**T. Downs, I. Wood, and M. Gallagher. Empirical evidence for ultrametric structure in multi layer perceptron error surfaces. Neural Processing Letters,**16(2):177~186, 2002.

- A.A. Freitas.
**Are we really discovering "interesting" knowledge from data?**

Expert Update (the BCS-SGAI Magazine), 9(1):41~47, October 2006.

**L. Geng and H. J. Hamilton. Interestingness measures for data mining: A**

**survey. ACM Comput. Surv., 38(3), 2006.**

- M. Gori and F. Scarselli.
**Are multilayer perceptrons adequate for pattern recognition and verification?**IEEE Trans. Pattern Anal. Mach. Intell., 20(11):1121~

1132, 1998.

**P.M. Granitto, P.F. Verdes, and H.A. Cecatto. Neural network ensembles:**

**evaluation of aggregation algorithms. arXiv, arXiv:cs.AI/0502006vl, 2005.**

- S. Hashemi and T.P. Trappenberg.
**Using svm for classification in data sets with ambiguous data.**In International Conference on Information Systems, Analysis and Synthesis (SCI 2002), 2002.

**M. Hassoun. Fundamentals of Artificial Neural Networks. Massachusetts Institute of Technology, 1995.**

- S. Haykin. Neural Networks:
**A Comprehensive Foundation****.**Prentice-Hall Inc., second edition, 1999.

**Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern**Recognition Letters, 24(9-10):1641-1650, 2003.

- S. Hettich and S. D. Bay.
**Kdd cup 1999 data.**UCI KDD Archive[http://kdd.ics.uci.edu/ /databases/kddcup99/kddcup99.html], 1999.

**L. Itti and P. Baldi. Bayesian surprise attracts human attention. In Proceedings**Neural Information Processing Systems, 2005.

**B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data**

**{Data-Centric Systems and Applications}. Springer, January 2007.**

**S. Singh M. Markou. Novelty detection: a review part 1: statistical approaches.**Signal Processing, 83(12):2481-2497, December 2003.

**K. McGarry. A survey of interestingness measures for knowledge discovery. The**Knowledge Engineering Review, 00:0:1-24, 2005.

**P. M. Murphy and M. J. Pazzani. Exploring the decision forest: an empirical investigation of occam's razor in decision tree induction. J. Artij. Int. Res.,**1(1):257-275, 1993.

**A. Orriols-Puig, J. Casillas, and E. Bernado-Mansilla. First approach toward online evolution of association rules with learning classifier systems. In GECCO**'08: Proceedings of the 2008 GECCO conference companion on Genetic and evolutionary computation, pages 2031-2038, New York, NY, USA, 2008. ACM.

**Y.H. Pao and C.- Y. Shen. Visualization of pattern data through learning of non- linear variance-conserving dimension-reduction mapping. Pattern Recognition,**30(10):1705-1717,1997.

**J.M. Puche, J.M. Benitez, and J.L. Mantas. Fuzzy pairwise multiclass support vector machines. In A. Gelbukh and C.A. Reyes-Garcia, editors, Mexican International Conference on Artificial Intelligence (MICA I) , volume LNAI, pages**562-571. Springer-Verlag, 2006.

**M. Robnik-Sikonja. Improving random forests. In J.F. Boulicaut et al., editor,**Machine Learning, ECML 2004, 2004.

**J. Schmidhuber. Driven by compression progress: A simple principle explains essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. CoRR, abs/0812.4360,**2009.

- C. Shirky.
**It's not information overload. it's filter failure.**Keynote Speech, September 2008.

**E. Suzuki. Data mining methods for discovering interesting exceptions from an unsupervised table. Journal of Universal Computer Science, 12(6):627-653,**2006. http://w ..... jucs. org/jucs_12_6/data_mining_methods_for.

- E. Suzuki.
**Lecture Notes in Computer Science,**volume 5579/2009, chapter Compression-Based Measures for Mining Interesting Rules, pages 741-746. Springer Berlin / Heidelberg, 2009.

**P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining, {First**2005.

Edition}. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,

**D. Tax and R Duin. Experiments with classifier combining rules. In Lecture**Notes in Computer Science, volume 1857, pages 16-29, Berlin, 2000. Springer- Verlag.

**D. Tax and RP.W. Duin. Using two-class classifiers for multi class classification.**In C. Suen R Kasturi, D. Laurendeau, editor, Proceedings 16th International Conference on Pattern Recognition, volume II, pages 124-127, Quebec City, Canada, Aug.11-15 2002. IEEE Computer Society Press.

**I. Tsang, J. Kwok, P. Cheung, and N. Cristianini. Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research,**6:363-392, 2005.

**L. H. Tsoukalas and R E. Uhrig. Fuzzy and Neural Approaches in Engineering.**John Wiley & Sons, Inc., New York, NY, USA, 1996.

**C. S. Wallace and D. M. Boulton. A information measure for classification.**Computer Journal, 11(2):185-194, 1968.

**J.-S. Wang and J.-C. Chiang. An efficient data preprocessing procedure for support vector clustering. Journal of Universal Computer Science, 15(4):705-**721, 2009. http://www . jucs. org/jucs_15_4/an_efficient_data_preprocessing.

**J. D. Williams. The Compleat Strategyst: Being a Primer on the Theory of**

Games of Strategy. Dover Publications, 1986.