# Algorithms

## Notes from Data Science with Azure Talk @ TAP

0I recently spoke at the Tampa Analytics Group, a Microsoft recognized Data Science group ran by Joe Blankenship on the topic of Data Science with Azure. The talk focused on Azure offerings, with a demo on how to write a map-redcuce job in Azure using C#. Following are the slides.

## The five Tribes of Machine Learning, and other algorithmic tales

Pedro Domingos' **The Master Algorithm - How the Quest for the Ultimate Learning Machine Will Remake Our World** is an interesting and thought provoking book about the state of machine learning, data science, and artificial intelligence.

Categorizing, classifying and clearly representing the ideas around any rapidly developing/evolving field is hard job. Machine learning with its multi-faceted approaches and ubiquitous implementation is an especially challenging topic. To write about it in a comprehensive yet easily understandable (aka non-jargon-ridden-hand-waving) way is definitely a great accomplishment.

One thing I really enjoyed about this writing is how the ML taxonomy and classification works; even for the people who have been in industry for a while, it is hard to create such meaningful distinctions and clusters around ideas.

“Each of the five tribes of machine learning has its own master algorithm, a general-purpose learner that you can in principle use to discover knowledge from data in any domain. The symbolists’ master algorithm is inverse deduction, the connectionists’ is backpropagation, the evolutionaries’ is genetic programming, the Bayesians’ is Bayesian inference, “and the analogizers’ is the support vector machine. In practice, however, each of these algorithms is good for some things but not others. What we really want is a single algorithm combining the key features of all of them: the ultimate master algorithm. For some this is an unattainable dream, but for many of us in machine learning, it’s what puts a twinkle in our eye and keeps us working late into the night.”

Starting with the question of are you rationalist or an empiricist, and extended this analogy to five tribes of machine, author has also challenged the notion of "intelligence" in a very direct manner against. By stating that this skeptical knowledge engineer's dogma that AI cannot "beat" humans is based on an 'archaic' Minsky/Chomsky school of thought; the variants of '“poverty of the stimulus" arguments are irrelevant for all practical intents and purposes. The outstanding success of deep learning is a proof to the contrary. Author has answered most of the 'usual' argumentum ad logicam in chapter 2 in which he paraphrase that the proof is in the pudding. From autonomous vehicles to sentiment analysis, Machine Learning / Statistical learners work, and hand-engineered expert systems with human experts don’t scale;

...learning-based methods have swept the field, to the point where it’s hard to find a paper devoid of learning. Statistical parsers analyze language with accuracy close to that of humans, where hand-coded ones lagged far behind. Machine translation, spelling correction, part-of-speech tagging, word sense disambiguation, question answering, dialogue, summarization: the best systems in these areas all use learning. Watson, the Jeopardy! computer champion, would not have been possible without it.

The book further elaborates by stating what author intuitively know (pun intended) as a frequently heard objection

...“Data can’t replace human intuition.” In fact, it’s the other way around: human intuition can’t replace data. Intuition is what you use when you don’t know the facts, and since you often don’t, intuition is precious. But when the evidence is before you, why would you deny it? Statistical analysis beats talent scouts in baseball (as Michael Lewis memorably documented in Moneyball), it beats connoisseurs at tasting, and every day we see new examples of what it can do. Because of the influx of data, the boundary between evidence and intuition is shifting rapidly, and as with any revolution, entrenched ways have to be overcome. If I’m the expert on X at company Y, I don’t like to be overridden by some guy with data. There’s a saying in industry: “Listen to your customers, not to the HiPPO,” HiPPO being short for “highest paid person’s opinion.” If you want to be tomorrow’s authority, ride the data, don’t fight it.

and of course the eureka! argument doesn't escape his criticism

And some may say, machine learning can find statistical regularities in data, but it will never discover anything deep, like Newton’s laws. It arguably hasn’t yet, but I bet it will. Stories of falling apples notwithstanding, deep scientific truths are “not low-hanging fruit. Science goes through three phases, which we can call the Brahe, Kepler, and Newton phases. In the Brahe phase, we gather lots of data, like Tycho Brahe patiently recording the positions of the planets night after night, year after year. In the Kepler phase, we fit empirical laws to the data, like Kepler did to the planets’ motions. In the Newton phase, we discover the deeper truths. Most science consists of Brahe- and Kepler-like work; Newton moments are rare. Today, big data does the work of billions of Brahes, and machine learning the work of millions of Keplers. If—let’s hope so—there are more Newton moments to be had, they are as likely to come from tomorrow’s learning algorithms as from tomorrow’s even more overwhelmed scientists, or at least from a combination of the two.

Whether you agree with the author's point of view or not, this is one of the best "big picture" reading on the state of machine learning and AI which will help you understand how things may shape up to be (or not) in next computing revolution.

## On Explainability of Deep Neural Networks

During a discussion yesterday with software architect extraordinaire David Lazar regarding how everything old is new again, the topic of deep neural networks and its amazing success was brought up. Unless one is living under a rock for past five years, the advancements in artificial neural networks (ANN) has been quite significant and noteworthy. Since the thaw of AI winter, the frowned-upon wave has come a long way to be a successful and relied upon technique in multiple problem spaces. From an interesting apocryphal which sums up the state of ANN back in the day to its current state of ConvNets with Google Translate squeezing deep learning onto a phone, there has been significant progress made. We all have seen the dreamy images of Inceptionism: Going Deeper into Neural Network with great results in image classification and speech recognition while fine tuning network parameters. Beyond the classical feats of Reading Digits in Natural Images with Unsupervised Feature Learning Deep Neural Networks (DNNs) have shown outstanding performance on image classification tasks. We now have excellent results on MNIST, Imagenet classification with deep convolutional neural networks, and effective use of Deep Neural Networks for Object Detection.

Otavio Good of Google puts it quite well,

Five years ago, if you gave a computer an image of a cat or a dog, it had trouble telling which was which. Thanks to convolutional neural networks, not only can computers tell the difference between cats and dogs, they can even recognize different breeds of dogs.

Geoffrey Hinton et al noted that

Best system in 2010 competition got 47% error for its first choice and 25% error for its top 5 choices. A very deep neural net (Krizhevsky et. al. 2012) gets less than 40% error for its first choice and less than 20% for its top 5 choices

Courtesy: XKCD and http://pekalicious.com/blog/training/

So with all this fanfare, what could possibly go wrong?

In deep learning systems where both the classifiers and the features are learned automatically, neural networks possess a grey side, the explain-ability problem.

Explain-ability and determinism in ML systems is a larger discussion, but limiting the scope to stay within the context of neural nets when you see the Unreasonable Effectiveness of Recurrent Neural Networks, it is important to pause and ponder, why does it work? Is it good enough that I can peek into this black-box by getting strategic heuristics out of the network, or infer the concept of cat from a trained neural network by Building High-level Features Using Large Scale Unsupervised Learning? Does it make it a ‘grey-box’ if we can figure out word embedding extractions from the network in high dimensional space, and therefore exploit similarities among languages for machine translation? The very idea of this non deterministic nature is problematic; as in context of how you choose the initial parameters such as starting point for gradient descent when training the back-propagation being of key importance. How about retain-ability? The imperviousness makes troubleshooting harder to say the least.

If you haven’t noticed, I am trying hard not make this a pop-science alarmist post but here is the leap I am going to take; that the relative lack of explain-ability and transparency inherent in the neural networks (and research community’s relative complacency towards the approach ‘because it just works’), this idea of black-boxed-intelligence is probably what may lead to larger issues identified by Gates, Hawking, and Musk. I would be the first one to state that this argument might be a stretch or over generalization of the shortcomings of a specific technique to create the doomsday scenario, and we might be able to ‘decrypt’ the sigmoid and all these fears will go away. However, my fundamental argument stays; if the technique isn’t quite as explainable, and with the ML proliferation as we have today, the unintended consequences might be too real to ignore.

With the ensemble of strong AI from weak AI, the concern towards explain-ability enlarges. There is no denying that it can be challenging to understand what a neural network is really doing under those layers approximating functions. For a happy path scenario when a network is trained well, we have seen repeatedly that it does achieve high quality results. However, it is still perplexing to comprehend the underpinnings as to how it is doing so? Even more alarmingly, if the network fails, it is hard to understand what went wrong. Can we really shrug off the skeptics fearful about the dangers that seemingly sentient Artificial Intelligence (AI) poses. As Bill Gates said articulately (practically refuting Eric Horvitz's position)

I am in the camp that is concerned about super intelligence. First the machines will do a lot of jobs for us and not be super intelligent. That should be positive if we manage it well. A few decades after that though the intelligence is strong enough to be a concern. I agree with Elon Musk and some others on this and don’t understand why some people are not concerned.

The non-deterministic nature of a technique like neural network pose a larger concerns in terms of understanding the confidence of the classifier? The convergence of a neural network isn’t really clear but alternatively for SVM, it’s fairly trivial to validate. Depicting the approximation of an ‘undocumented’ function as a black-box is most probably a fundamentally flawed idea in itself. If we equate this with the biological thought process, the signals and the corresponding trained behavior, we have an expected output based on the training set as an observer. However, in the non-identifiable model, the approximation provided by the neural network is fairly impenetrable for all intents and purposes.

I don’t think anyone with deep understanding of AI and machine learning is really worried about Skynet, at this point. Like Andrew Ng said

“Fearing a rise of killer robots is like worrying about overpopulation on Mars.”

The concern is more about adhering to “but it works!” aka If-I-fits-I-sits approach (the mandatory cat meme goes here).

The sociological challenges associated with self-driving trucks, taxis, delivery people and employment are real but these are regulatory issues. The key issue lies in the heart of the technology and our understanding of its internals. Stanford's Katie Malone said it quite well in linear digressions episode on Neural Nets

Even though it sounds like common sense that we would like to have controls in place where automation should not be allowed to engage targets without human intervention, and luminaries like Hawking, Musk and Wozniak would like to Ban autonomous weapons, urging AI experts, our default reliance on black-box approaches may make this nothing more than wishful thinking. As Stephen Hawking said

“The primitive forms of artificial intelligence we already have, have proved very useful. But I think the development of full artificial intelligence could spell the end of the human race. Once humans develop artificial intelligence it would take off on its own and redesign itself at an ever-increasing rate. Humans, who are limited by slow biological evolution, couldn’t compete and would be superseded.”

It might be fair to say that since we don’t completely understand a new technique, it makes us afraid (of change), and will be adapted as the research moves forward. As great as the results are, for non-black box models or interpretable models such as regression (closed form approximation) and decision trees / belief nets (graphical representations of deterministic and probabilistic beliefs) there is the comfort of determinism and understanding. We know today that smaller changes in NN can lead to significant changes as one of the “Intriguing” properties of neural networks. In their paper, authors demonstrated that small changes can cause larger issues

We find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extent. We can cause the network to misclassify an image by applying a certain hardly perceptible perturbation, which is found by maximizing the network’s prediction error….

We demonstrated that deep neural networks have counter-intuitive properties both with respect to the semantic meaning of individual units and with respect to their discontinuities.

The existence of the adversarial negatives appears to be in contradiction with the network’s ability to achieve high generalization performance. Indeed, if the network can generalize well, how can it be confused by these adversarial negatives, which are indistinguishable from the regular examples? Possible explanation is that the set of adversarial negatives is of extremely low probability….. However, we don’t have a deep understanding of how often adversarial negatives appears…

Let’s be clear that when we discuss the black-box nature of ANN, we are not talking about Single-unit perceptron only being capable of learning linearly separable patterns (Minsky et al, 69). It is well established that XOR functions inability to learn in single layer networks does not extend to multi-layer perceptron (MLP). Convolutional Neural Networks (CNN) are therefore a working proof to the contrary; the biologically-inspired variants of MLPs with the explicit assumption that the input comprises of images hence certain properties can be embedded into the architecture. The point here is against the rapid adaption of a technique which is black-box in nature with greater computational burden, inherent non-determinism, and over-fitting proneness over its “better” counterparts. To paraphrase Jitendra Malik without being an NN skeptic, there is no reason that multi-layer random forests or SVM cannot achieve the same results. During AI winter we made ANN pariah, aren’t we repeating the same mistake with other techniques now?

Recently Elon Musk has tweeted

Worth readingSuperintelligence by Bostrom. We need to be super careful with AI. Potentially more dangerous than nukes.

And even though things might not be so bad right now, let’s conclude this with the following quote from Michael Jordan from IEEE spectrum.

Sometimes those go beyond where the achievements actually are. Specifically on the topic of deep learning, it’s largely a rebranding of neural networks, which go back to the 1980s. … In the current wave, the main success story is the convolutional neural network, but that idea was already present in the previous wave. And one of the problems … is that people continue to infer that something involving neuroscience is behind it, and that deep learning is taking advantage of an understanding of how the brain processes information, learns, makes decisions, or copes with large amounts of data. And that is just patently false.

Now this also leaves the other fundamental question is that if the pseudo-mimicry of biological neural nets actually a good approach to emulate intelligence? Or may be Noam Chomsky on Where Artificial Intelligence Went Wrong?

That we will talk about some other day.

**References**

- Neural Networks, Manifolds, and Topology
- The Future of AI: A Non-Alarmist Viewpoint
- Stephen Hawking warns artificial intelligence could end mankind’
- A shallow introduction to the deep machine learning
- Computer science: The learning machines
- DARPA SyNAPSE Program artificialbrains.com
- On explainability in Machine Learning
- Killed by AI Much? A Rise of Non-deterministic Security!

## Learning F# Functional Data Structures and Algorithms is Out!

الحمد للہ رب العالمین

Wondering what to do on 4th of July long weekend? Learn Functional Programming in F# with my book!

I am glad to inform that my book on Learning F# Functional Data Structures and Algorithms is published, and is now available via Amazon and other retailers. F# is a multi-paradigm programming language that encompasses object-oriented, imperative, and functional programming language properties. The F# functional programming language enables developers to write simple code to solve complex problems.

Starting with the fundamental concepts of F# and functional programming, this book will walk you through basic problems, helping you to write functional and maintainable code. Using easy-to-understand examples, you will learn how to design data structures and algorithms in F# and apply these concepts in real-life projects. The book will cover built-in data structures and take you through enumerations and sequences. You will gain knowledge about stacks, graph-related algorithms, and implementations of binary trees. Next, you will understand the custom functional implementation of a queue, review sets and maps, and explore the implementation of a vector. Finally, you will find resources and references that will give you a comprehensive overview of F# ecosystem, helping you to go beyond the fundamentals.

If you have just started your adventure with F#, then this book will help you take the right steps to become a successful F# coder. An intermediate knowledge of imperative programming concepts, and a basic understanding of the algorithms and data structures in .NET environments using the C# language and BCL (Base Class Library), would be helpful.

With detailed technical and editorial reviews, it is a long process to write a technology book, but equally rewarding and unique learning experience. I am thankful to my technical reviewer, and Packt editorial team to provide excellent support to make this a better book. Nothing is perfect and to err is human; if you find any issues in the code or text, please let me know.

Learning F# Functional Data Structures and Algorithms - Get it via Amazon

Learning F# Functional Data Structures and Algorithms - Get it via Google Books

Learning F# Functional Data Structures and Algorithms - Get it via Packt

Happy Functional Programming!

The source code for the book can be downloaded from here.

## Visualizing Decision Boundaries for Deep Learning

Decision boundary is the region of a problem space in which the output label of a classifier is ambiguous. In this concise yet informative article Dr. Takashi J Ozaki outlines decision boundaries for deep learning and other Machine Learning classifiers and emphasize on parameter tuning for Deep Learning.

The source code for this article is on github, and he uses H2O, one of the leading deep learning framework in python, is now also available in R.

Code: https://github.com/ozt-ca/tjo.hatenablog.samples/tree/master/r_samples/public_lib/jp

Deep Learning – Getting Started - important resources for learning and understanding

## MIT Machine Learning for Big Data and Text Processing Class Notes Day 5

On the final day (day 5) the agenda for the MIT Machine learning course was as follows:

- Generative models, mixtures, EM algorithm
- Semi-supervised and active learning
- Tagging, information extraction

The day started with Dr. Jakkola's discusion on parameter selection, generative learning algorithms, Learning Generative Models via Discriminative Approaches, and Generative and Discriminative Models. This led to the questions such as What are some benefits and drawbacks of discriminative and generative models?. What is the difference between a Generative and Discriminative Algorithm and how to learn from Little data - Comparison of Classifiers Given Little Training.

One of the reasons I truly enjoyed the course is because you always learn something new working with practioners and participating academics. One of such topics was Machine Learning Based Missing Value Imputation Method for Clinical Dataset. We revisited Missing Data Problems in Machine Learning, Imputation of missing data using machine learning techniques, class imbalance problem, Machine Learning from Imbalanced Data Sets 101, and DATA MINING challenges FOR IMBALANCED DATASETS

Feature reduction and data transformation is really helpful for model building. Dr. Jakkolla talked about how to architect feature structures. We did review of ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches, Feature Set Embedding for Incomplete Data before moving on to Naive Bayes.

Since Naive Bayes classifiers (excellent tutorial) performs so well, we had to find some problems with it! Hence Naive Bayes with unbalanced classes, Naive-Bayes classifier for unequal groups, Tackling the Poor Assumptions of Naive Bayes Text Classifiers, Naive Bayes for Text Classification with Unbalanced Classes, Class Imbalance Problem, Techniques for Improving the Performance of Naive Bayes for Text Classification, and Tackling the Poor Assumptions of Naive Bayes Text Classifiers. The elephant in the room question was when does Naive Bayes perform better than SVM?

Transfer learning was also something I needed a refresher on. As stated

When the distribution changes, most statistical models need to be rebuilt from scratch using newly collected training data. In many realworld applications, it is expensive or impossible to recollect the needed training data and rebuild the models. It would be nice to reduce the need and effort to recollect the training data. In such cases, knowledge transfer or transfer learning between task domains would be desirable.

From: http://www1.i2r.a-star.edu.sg/~jspan/publications/TLsurvey_0822.pdf

Dr. Jakkola then did an final review / sum of of Supervised techniques (classification, rating, ranking), unsupervised (clustering, mixture models), and Semi-supervised (making use of labelled and unlabelled by means of clustering) along with Generative techniques (naive bayes) and Discriminative techniques (Perceptron /PA, SVM, Boosting, Neural Networks, Random Forests) and emphasized the fact that Discriminative analysis is performance driven.

Then someone asked, Which is the best off-the-shelf classifier? i.e. when I have to do least work myself. SVM is one of the obvious choices, but here is some food for thought as well.

If we bag or boost trees, we can get the best off-the-shelf prediction available. Bagging and Boosting are ensemble methods that combine the fit from many (hundreds, thousands) of tree models to get an overall predictor.

http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/Sec4_Trees.pdf

Here is a good tutorial on Ensemble learning approach - Committee-based methods Bagging - increasing stability Boosting - ”Best off-the-shelf classification technique” ”Fate is a great provider!” - Random Forests etc. and this one explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers. We discussed this couple of days ago, but still a good reference. CNN Features off-the-shelf: an Astounding Baseline for Recognition. Dr. Jakkolla emphasized that when it comes to Neural networks, they are highly flexible, hence time consuming; Test time efficient and difficult to train. For Scaling Machine Learning, this is a great resource.

Stochastic gradient descent is a gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. After a mandatory gradient descent / ascent talk, we spoke again about scalability, and Scalable Machine Learning. Here are couple of good reference of Scaling Up Machine Learning Parallel and Distributed Approaches, and Scalable Machine Learning From a practitioner's perspective, Apache Mahout helps build an environment for quickly creating scalable performant machine learning applications. Here is a good reference on Scalable Machine Learning - or - what to do with all that Big Data infrastructure.

Dropout training technique is something else I learned about during this class. Preventing feature co-adaptation by encouraging independent contributions from different features often improves classification and regression performance. Dropout training (Hinton et al., 2012) does this by randomly dropping out (zeroing) hidden units and input features during training of neural networks. Dropout: A Simple Way to Prevent Neural Networks from Overfitting and ICML Highlight: Fast Dropout Training are good resources. Dr. Jakkolla know these techniques so well, he discussed spectral embedding / *Laplacian Eigenmaps*) as one method to calculate non-linear embedding. Laplacian eigenmaps for dimensionality reduction and data representation, Manifold learning and Nonlinear Methods are good follow up places.

We surveyed diverse topics such as Dimensionality reduction, Dimensionality Reduction Tutorial, Covariance Matrix and answered questions like What is usage of eigenvectors and eigenvalues in machine learning?, What are eigenvectors and eigenvalues? and PCA.

Image Courtesy AMPCAMP Berkeley

The afternoon session started with Dr. Barzilay Recommender system talk. There were various topics discussed including Slope One - Cold start - Robust collaborative filtering and I like what you like. dr. Barzilay talk about cases when recommendations may be similar in one way, and different in other ways. Global vs. Local comparisons, matrix factorization techniques for recommender systems, using SVD and finally the quentisential netflix challenge.

Some key resources included The “Netflix” Challenge – Predicting User preferences using: -Average prediction -Item-based Collaborative filtering -Maximum Margin Matrix Factorization, GroupLens: An Open Architecture for Collaborative Filtering of Netnews, Explaining Collaborative Filtering Recommendations, Collaborative Filtering Recommender Systems, Item-to-Item Collaborative Filtering and Matrix Factorization , The BellKor solution to the Netflix Prize

Last but not least, Dr. Barzilay talked about Why netflix never implemented the algorithm and why simple but scalable approaches are better choices than their complex and resource intensive counterparts.

It was a brilliant class. Lots of networking and learning. Thanks to our instructors for their hard work, and diligence in getting the knowledge and concepts across.. The class ended with the certificate distribution and Q&A session.

I plan to implement these ideas in practice. Happy Machine Learning!

**Miscellaneous**** Links.**

- Recognizing hand-written digits- scikit-learn.org/0.11/auto_examples/plot_digits_classification.html
- An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer
- Machine Learning 101: General Concepts www.astroml.org/sklearn_tutorial/general_concepts.html
- Linear Algebra for Machine Learning machinelearningmastery.com/linear-algebra-machine-learning/
- Develop new models to accurately predict the market response to large trades. https://www.kaggle.com/c/AlgorithmicTradingChallenge
- ML exams
- http://www.cs.cmu.edu/~guestrin/Class/10701/additional/
- http://www.cs.cmu.edu/~guestrin/Class/10701/additional/final-s2006.pdf%20
- http://www.cs.cmu.edu/~guestrin/Class/10701/additional/final-s2006-handsol.pdf

## MIT Machine Learning for Big Data and Text Processing Class Notes Day 4

On day 4 of the Machine learning course, following was the agenda:

- Unsupervised learning, clustering
- Dimensionality reduction, matrix factorization, and
- Collaborative filtering, recommender problems

The day started with Regina** Barzilay** (Bio) (Personal Webpage) talk on Determining the number of clusters in a data set and approaches to determine the correct numbers of clusters. The core idea being addressed was difference between supervised, unsupervised, and semi-supervised feature selection algorithms and Supervised/Unsupervised/Semi-supervised Feature Selection for Multi-Cluster/Class Data. Dr. Barzilay discussed Voronoi diagrams, VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS leading to Lloyd's algorithm or Vornoi iteration. The lecture also included the simple yet effective k-means clustering and k-medoids. The **k****-medoids algorithm** is a clustering algorithm related to the `k`-means algorithm and the medoidshift algorithm. Elbow method was also briefly discussed.

**Llyods Algorithm**

Choosing the number of clusters, as well as what to 'cluster around' are both quite interesting problems. Google news algorithms focused on clustering and how to measure the appeal of a product and to determine features of how google news cluster stories is a topic of immense interest. Instructor was inquired about Under The Hood: Google News & Ranking Stories and this link provides some insight as Patent Application Reveals Key Factors In Google News Algorithm

**High Performance Text Processing in Machine Learning by Daniel Krasner**

The second part of the class was with Dr. Tommi** Jaakkola** (Bio) (Personal Webpage) who focused mainly on examples of mixture models, revised the K-means algorithm for finding clusters in a data set, reviewed the latent variable view of mixture distributions, how to assign data points to specific components of mixture and general techniques for finding m.l. estimators in latent variable models.

The expectation Maximization (EM) algorithm and its explanation in context with Gaussian mixture models which motivates EM took majority of time. Dr. Jaakkola talked about framework for building complex probability distributions – A method for clustering data and using social media to reinforce learning. The topic then went on to Sequence learning and a brief Introduction to decision trees and random forests, Mixtures, EM, Non-parametric models as well as Machine Learning from Data Gaussian Mixture Models.

Expectation Maximization (EM) is an iterative procedure that is very sensitive to initial conditions. The principal of Garbage in Garbage out applies here and therefore we need a good and fast initialization procedure. The Expectation Maximization Algorithm A short tutorial explains few techniques including sed: K-Means, hierarchical K-Means, Gaussian splitting etc.

Here is a great tutorial by MathematicalMonk on (ML 16.3) Expectation-Maximization (EM) algorithm.

Mixture Models and EM Tutorial by Sargur Srihari

(ML 16.6) Gaussian mixture model (Mixture of Gaussians) Introduction to the mixture of Gaussians, a.k.a. Gaussian mixture model (GMM). This is often used for density estimation and clustering.

In response to Henry Tan's query regarding how is tensor analysis applied to machine learning?, Dr. Regina pointed to one of her papers as a resource

Rest of the class continued with ML topics and practical advise on on things like log-likelihood use, Clustering in high dimension is extremely tricky, Dimensionality reduction for supervised learning, Dimensionality reduction, Random Projections, Dimensionality reduction Feature selection, and last but not least, the BIC - Model Selection Lecture V: The Bayesian Information Criterion

Looking forward to tomorrow's final class on generative models, mixtures, EM algorith, Semi-supervised and active learning as well as tagging, information extraction.

**Misc**

- Machine Learning Mastery - Getting Started
- Data Scientist In a Can
- Atlas of Knowledge
- Machine Learning Math
- Math for machine learning
- What if I’m Not Good at Mathematics

Machine Learning with Scikit-Learn (I) - PyCon 2015

## MIT Machine Learning for Big Data and Text Processing Class Notes Day 3

**Regina Barzilay**(Bio) (Personal Webpage) overview of the the following.

- Cascades, boosting
- Neural networks, deep learning
- Back-propagation
- Image/text annotation, translation

Dr.** Barzilay **introduced BoosTexter for the class with a demo on twitter feed. BoosTexter is a general purpose machine-learning program based on boosting for building a classifier from text and/or attribute-value data. It can be downloaded from here while step by step instructions can be found here as part of Assignment 2: Extracting Meaning from Text. The paper outlining BoosTexter is a Boosting-based System for Text Categorization. An Open-source implementation of Boostexter (Adaboost based classifier) can be found here.

Reference: http://www.ais.uni-bonn.de/deep_learning/

A question was brought up regarding where to get the data from; following are few sources.

- ACL Anthology A Digital Archive of Research Papers in Computational Linguistics
- LDC Catalog https://catalog.ldc.upenn.edu

After the intro to BoosTexter, we jumped into ensemble learning and boosting. Here are some of the pertinent resources to the lecture.

- Boosting Simple Model Selection Cross Validation Regularization
- Introductory lecture - CSE 151 Machine Learning Instructor: Kamalika Chaudhuri
- Fighting the bias-variance tradeoff

Questions like Does ensembling (boosting) cause overfitting? came up and we talked about how Machines read Twitter and Yelp so you don’t have to. I think one of the most relevant resource can be summed up in Foundations of Machine Learning Lecture by Mehryar Mohri Courant Institute and Google Research.

At this point, a detailed discussion about Loss Function was in order. Loss function is the function indicating the penalty for an incorrect prediction but the different kinds of loss functions (or cost functions) such as zero-one loss (standard loss function in classification) or for non-symmetric losses (good vs spam emails classification), or squared loss which is standard loss function in regression. MSR's paper on On Boosting and the exponential loss is a good starting point to follow this topic.

Speaking of Boosting, we reviewed BOOSTING (ADABOOST ALGORITHM) Eric Emer, Explaining AdaBoost Robert E. Schapire and Ensemble Learning. Some of the questions came up like In boosting, why are the learners “weak”?, What is a weak learner?, How to boost without overfitting or Does ensembling (boosting) cause overfitting? and Is AdaBoost less or more prone to overfitting?

Misc topics included, so what happens when you don't have mistakes? Here comes Perceptron Mistake Bounds by Mehryar Mohri, Afshin Rostamizadeh which talks about why the error rate doesn't become zero. How about Extracting Meaning from Millions of Pages, the natural language toolkit NTLK - Extracting Information from Text, Parsing Meaning from Text and of course sickit learn for Ensemble methods

**Adaboost Demo**

After lunch, Dr. Tommi** Jaakkola** (Bio) (Personal Webpage) started with the ANN - Neural networks. There was of course the mandatory mention of AI Winter and how neural networks fell out of favor. Dr. Jaakola spoke about Support Vector Machines vs Artificial Neural Networks, What are advantages of Artificial Neural Networks over Support Vector Machines? Neural networks vs support vector machines: are the second definitely superior? etc. A good overview lecture for Neural Networks can be found here.

As Minsky said

[The perceptron] has many features to attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgment that the extension is sterile.

-Quote from Minsky and Papert’s book, Perceptrons (1969):

The topic quickly converged to learning in Multi-Layer Perceptrons - Back-Propagation and forward propogration. Iin order to cover Stochastic gradient descent, Non-linear classification, neural networks, and neural networks (multi-layer perceptron, this will give you a good overview. and of course props for Minsky.

The rationale essentially is that Perceptrons only be highly accurate for linearly separable problems. However, Multi-layer networks (often called multi-layer perceptrons, or MLPs) can work in case of any complex non-linear target function. The challenge we see in the multi-layer networks is that it provides no guarantee of convergence to minimal error weight vector. To hammer in these ideas, Exercise: Supervised Neural Networks is quite helpful. Few more relevant resources.

- Single Layer Neural Networks Hiroshi Shimodaira
- What is the difference between back-propagation and forward-propagation?
- Neural Networks Tutorial
- The Back-propagation Algorithm
- Vector Calculus: Understanding the Gradient (or derivative)
- Lecture 11: Feed-Forward Neural Networks

Next topic was of Feature Scaling; What is feature scaling? A question posed was that if Feature-scaling is applied to the data before input into the artificial neural network will make the network converge faster. This is well defined here in coursera Gradient Descent in Practice - What is Feature Scaling. This brought up the point of How to “undo” feature scaling/normalization for output?, and How and why do normalization and feature scaling work?

The concluding topic of the day was Convolution neural network.Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs. There has been amazing things done by the use of CS231n Convolutional Neural Networks for Visual Recognition.

There has been immense interest in topics like Large-scale Learning with SVM and Convolutional Nets for Generic Object Categorization, ImageNet Classification with Deep Convolutional Neural Networks, Convolutional Kernel Networks and how convolution networks can help generate text for images. Here are some of the relevant papers.

- CNN Features off-the-shelf: an Astounding Baseline for Recognition
- Text Understanding from Scratch
- END-TO-END TEXT RECOGNITION WITH CONVOLUTIONAL NEURAL NETWORKS
- Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts
- Convolutional Neural Networks for Sentence Classification
- Visualizing and Understanding Convolutional Network
- Caffe Deep learning framework by the BVLC
- A picture is worth a thousand (coherent) words: building a natural description of images
- Show and Tell: A Neural Image Caption Generator
- Generating Text with Recurrent Neural Networks
- The Unreasonable Effectiveness of Recurrent Neural Networks
- Building Fast High-Performance Recognition Systems with Recurrent Neural Networks and LSTM
- LONG SHORT-TERM MEMORY
- Convolutional Neural Net Image Processor Link

Looking forward to tomorrow's class!

## MIT Machine Learning for Big Data and Text Processing Class Notes - Day 2

So after having an awesome Day 1 @ MIT, I was in CSAIL library and met Pedro Ortega, NIPS 2015 Program Manager @adaptiveagents. Celebrity sighting!

Today on Day 2, Dr. **Jaakkola** (Bio) (Personal Webpage) professor, Electrical Engineering and Computer Science/Computer Science and Artificial Intelligence Laboratory (CSAIL), went over the following .

- Non-linear classification and regression, kernels
- Passive aggressive algorithm
- Overfitting, regularization, generalization
- Content recommendation

Dr. Jaakkola's socratic method of inquiring the common sense questions ingrain the common concepts in the mind of people. The class started with the follow up of perceptron from yesterday and quickly turned into a session on when NOT to use perceptron such as in case of non linearly seperable problems. Today's lecture was derieved from 6.867 Machine Learning Lecture 8. The discussion extended to Support Vector Machine (and Statistical Learning Theory) Tutorial, which is also well explained in the An Idiot’s guide to Support vector machines (SVMs) R. Berwick, Village Idiot

Speaking of SVM and dimensionality, Dr. Jaakkola posed the question if ranking can also be a secondary classification problem? Learning to rank or machine-learned ranking (MLR) is a fascinating topic where common intuitions like number of items displayed, error functions between user's preference and display order sparseness fall flat. Microsoft research has some excellent reference papers and tutorials on learning to rank which are definitely worth pouring over in case you are interested in this topic. Label ranking by learning pairwise preferences is another topic discussed in detail during the class. Some reference papers follow:

- A Short Introduction to Learning to Rank
- Reviewing Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales
- LETOR: Learning to Rank for Information Retrieval Tutorials on Learning to Rank
- Ranking Methods in Machine Learning A Tutorial Introduction
- Yahoo! Learning to Rank Challenge Datasets
- Large Scale Learning to Rank
- Yahoo! Learning to Rank Challenge Overview
- Multiclass Classification: One-vs-all
- Zipf, Power-laws, and Pareto - a ranking tutorial Lada A. Adamic

Indeed with SVM, the natural progression led to the 'k' word; kernel functions. A brief introduction to kernel classifiers Mark Johnson Brown University is a good starting point and The difference of kernels in SVM?, and how to select a kernel for SVM provide good background material to understand the practical aspects of kernel. Kernels and the Kernel Trick Martin Hofmann Reading Club "Support Vector Machines"

The afternoon topic was Anomaly detection; use cases included aberrant behavior in financial transactions, insurance fraud, bot detection, manufacturing quality control etc. One the most comprehensive presentations on Anomaly Detection Data Mining Techniques is by Francesco Tamberi which is great for the background. Several problems worked on during the class were from 6.867 Machine learning which shows how instructors carefully catered the program for practitioners with the right contents from graduate level courses, as well as industry use cases. Other topics discussed included Linear versus nonlinear classifiers and we learned how decision boundary is the region of a problem space in which the output label of a classifier is ambiguous. Class discussions and Q&A touched on the wide variety of subjects including but not limited to How to increase accuracy of classifiers?, Recommendation Systemsm A Comparative Study of Collaborative Filtering Algorithms which eventually led to Deep Learning Tutorial: From Perceptrons to Deep Networks which performed really well on MNIST Database for handwritten digits.

- Caltech 101
- THE MNIST DATABASE of handwritten digits
- Why do naive Bayesian classifiers perform so well?

Linear vs. non linear classifiers followed where Dr. Jaakkola spoke about why logistic regression a linear classifier, more on Linear classifier, Kernel Methods for General Pattern Analysis, Kernel methods in Machine learning, How do we determine the linearity or nonlinearity of a classification problem? and review of Kernel Methods in Machine Learning

Misc. discussions of Kernel Methods, So you think you have a power law, Radial basis function kernel, Kernel Perceptron in Python surfaced, some of which briefly reviewed in Machine Learning: Perceptrons- Kernel Perceptron Learning Part-3/4. Shape Fitting with Outliers and SIGIR 2003 Tutorial Support Vector and Kernel Methods tutorial with radial basis functions. Other topics included Kernel based Anomaly Detection with Multiple Kernel Anomaly Detection (MKAD) Algorithm, Support Vector Machines: Model Selection Using Cross-Validation and Grid-Search, LIBSVM -- A Library for Support Vector Machines, Practical Guide to Support Vector Classification, Outlier Detection with Kernel Density Functions and Classification Framework for Anomaly Detection as relevant readings.

For a linear Algebra Refresher, Dr. Barzilay recommended Prof. Gilbert Strang MIT Open Course Number 18.06 or Gilbert Strang lectures on Linear Algebra via video lectures.

Looking forward to the Deep Learning and Boosting tomorrow! Dr. Barzilay said its going to be pretty cool.

**Misc:**