# Machine Learning

## Visualizing Decision Boundaries for Deep Learning

Decision boundary is the region of a problem space in which the output label of a classifier is ambiguous. In this concise yet informative article Dr. Takashi J Ozaki outlines decision boundaries for deep learning and other Machine Learning classifiers and emphasize on parameter tuning for Deep Learning.

The source code for this article is on github, and he uses H2O, one of the leading deep learning framework in python, is now also available in R.

Code: https://github.com/ozt-ca/tjo.hatenablog.samples/tree/master/r_samples/public_lib/jp

Deep Learning – Getting Started - important resources for learning and understanding

## Dissertation Defense - Novel Frameworks for Auctions and Optimization

Last week I attended Zeyuan Allen-Zhu dissertation defense on the topic of Novel Frameworks for Auctions and Optimization.

The abstract of the talk follows.

Abstract: This thesis introduces novel frameworks for modeling uncertainty in auctions, and for understanding first-order methods in optimization. The former provides robust analysis to alternative specifications of preferences and information structures in Vickrey auctions, and the latter enables us to break 20-year barriers on the running time used for solving positive linear programs.

Zeyuan Allen-Zhu is a Ph.D. candidate in Comptuer Science (supervised by Prof. Jon Kelner and Prof. Silvio Micali) with an amazing record of publications. His defense talk on the topics of **Novel Frameworks for Auctions and Optimization** was quite comprehensive and easy to follow.

Prof. Jon Kelner and Prof. Silvio Micali promised a cake after the defense. The audience had to leave for the private session with the candidate. I got an update from Zeyuan that he passed!

Looking forward to Zeyuan published dissertation.

## MIT Machine Learning for Big Data and Text Processing Class Notes Day 5

On the final day (day 5) the agenda for the MIT Machine learning course was as follows:

- Generative models, mixtures, EM algorithm
- Semi-supervised and active learning
- Tagging, information extraction

The day started with Dr. Jakkola's discusion on parameter selection, generative learning algorithms, Learning Generative Models via Discriminative Approaches, and Generative and Discriminative Models. This led to the questions such as What are some benefits and drawbacks of discriminative and generative models?. What is the difference between a Generative and Discriminative Algorithm and how to learn from Little data - Comparison of Classifiers Given Little Training.

One of the reasons I truly enjoyed the course is because you always learn something new working with practioners and participating academics. One of such topics was Machine Learning Based Missing Value Imputation Method for Clinical Dataset. We revisited Missing Data Problems in Machine Learning, Imputation of missing data using machine learning techniques, class imbalance problem, Machine Learning from Imbalanced Data Sets 101, and DATA MINING challenges FOR IMBALANCED DATASETS

Feature reduction and data transformation is really helpful for model building. Dr. Jakkolla talked about how to architect feature structures. We did review of ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches, Feature Set Embedding for Incomplete Data before moving on to Naive Bayes.

Since Naive Bayes classifiers (excellent tutorial) performs so well, we had to find some problems with it! Hence Naive Bayes with unbalanced classes, Naive-Bayes classifier for unequal groups, Tackling the Poor Assumptions of Naive Bayes Text Classifiers, Naive Bayes for Text Classification with Unbalanced Classes, Class Imbalance Problem, Techniques for Improving the Performance of Naive Bayes for Text Classification, and Tackling the Poor Assumptions of Naive Bayes Text Classifiers. The elephant in the room question was when does Naive Bayes perform better than SVM?

Transfer learning was also something I needed a refresher on. As stated

When the distribution changes, most statistical models need to be rebuilt from scratch using newly collected training data. In many realworld applications, it is expensive or impossible to recollect the needed training data and rebuild the models. It would be nice to reduce the need and effort to recollect the training data. In such cases, knowledge transfer or transfer learning between task domains would be desirable.

From: http://www1.i2r.a-star.edu.sg/~jspan/publications/TLsurvey_0822.pdf

Dr. Jakkola then did an final review / sum of of Supervised techniques (classification, rating, ranking), unsupervised (clustering, mixture models), and Semi-supervised (making use of labelled and unlabelled by means of clustering) along with Generative techniques (naive bayes) and Discriminative techniques (Perceptron /PA, SVM, Boosting, Neural Networks, Random Forests) and emphasized the fact that Discriminative analysis is performance driven.

Then someone asked, Which is the best off-the-shelf classifier? i.e. when I have to do least work myself. SVM is one of the obvious choices, but here is some food for thought as well.

If we bag or boost trees, we can get the best off-the-shelf prediction available. Bagging and Boosting are ensemble methods that combine the fit from many (hundreds, thousands) of tree models to get an overall predictor.

http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/Sec4_Trees.pdf

Here is a good tutorial on Ensemble learning approach - Committee-based methods Bagging - increasing stability Boosting - ”Best off-the-shelf classification technique” ”Fate is a great provider!” - Random Forests etc. and this one explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers. We discussed this couple of days ago, but still a good reference. CNN Features off-the-shelf: an Astounding Baseline for Recognition. Dr. Jakkolla emphasized that when it comes to Neural networks, they are highly flexible, hence time consuming; Test time efficient and difficult to train. For Scaling Machine Learning, this is a great resource.

Stochastic gradient descent is a gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. After a mandatory gradient descent / ascent talk, we spoke again about scalability, and Scalable Machine Learning. Here are couple of good reference of Scaling Up Machine Learning Parallel and Distributed Approaches, and Scalable Machine Learning From a practitioner's perspective, Apache Mahout helps build an environment for quickly creating scalable performant machine learning applications. Here is a good reference on Scalable Machine Learning - or - what to do with all that Big Data infrastructure.

Dropout training technique is something else I learned about during this class. Preventing feature co-adaptation by encouraging independent contributions from different features often improves classification and regression performance. Dropout training (Hinton et al., 2012) does this by randomly dropping out (zeroing) hidden units and input features during training of neural networks. Dropout: A Simple Way to Prevent Neural Networks from Overfitting and ICML Highlight: Fast Dropout Training are good resources. Dr. Jakkolla know these techniques so well, he discussed spectral embedding / *Laplacian Eigenmaps*) as one method to calculate non-linear embedding. Laplacian eigenmaps for dimensionality reduction and data representation, Manifold learning and Nonlinear Methods are good follow up places.

We surveyed diverse topics such as Dimensionality reduction, Dimensionality Reduction Tutorial, Covariance Matrix and answered questions like What is usage of eigenvectors and eigenvalues in machine learning?, What are eigenvectors and eigenvalues? and PCA.

Image Courtesy AMPCAMP Berkeley

The afternoon session started with Dr. Barzilay Recommender system talk. There were various topics discussed including Slope One - Cold start - Robust collaborative filtering and I like what you like. dr. Barzilay talk about cases when recommendations may be similar in one way, and different in other ways. Global vs. Local comparisons, matrix factorization techniques for recommender systems, using SVD and finally the quentisential netflix challenge.

Some key resources included The “Netflix” Challenge – Predicting User preferences using: -Average prediction -Item-based Collaborative filtering -Maximum Margin Matrix Factorization, GroupLens: An Open Architecture for Collaborative Filtering of Netnews, Explaining Collaborative Filtering Recommendations, Collaborative Filtering Recommender Systems, Item-to-Item Collaborative Filtering and Matrix Factorization , The BellKor solution to the Netflix Prize

Last but not least, Dr. Barzilay talked about Why netflix never implemented the algorithm and why simple but scalable approaches are better choices than their complex and resource intensive counterparts.

It was a brilliant class. Lots of networking and learning. Thanks to our instructors for their hard work, and diligence in getting the knowledge and concepts across.. The class ended with the certificate distribution and Q&A session.

I plan to implement these ideas in practice. Happy Machine Learning!

**Miscellaneous**** Links.**

- Recognizing hand-written digits- scikit-learn.org/0.11/auto_examples/plot_digits_classification.html
- An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer
- Machine Learning 101: General Concepts www.astroml.org/sklearn_tutorial/general_concepts.html
- Linear Algebra for Machine Learning machinelearningmastery.com/linear-algebra-machine-learning/
- Develop new models to accurately predict the market response to large trades. https://www.kaggle.com/c/AlgorithmicTradingChallenge
- ML exams
- http://www.cs.cmu.edu/~guestrin/Class/10701/additional/
- http://www.cs.cmu.edu/~guestrin/Class/10701/additional/final-s2006.pdf%20
- http://www.cs.cmu.edu/~guestrin/Class/10701/additional/final-s2006-handsol.pdf

## MIT Machine Learning for Big Data and Text Processing Class Notes Day 4

On day 4 of the Machine learning course, following was the agenda:

- Unsupervised learning, clustering
- Dimensionality reduction, matrix factorization, and
- Collaborative filtering, recommender problems

The day started with Regina** Barzilay** (Bio) (Personal Webpage) talk on Determining the number of clusters in a data set and approaches to determine the correct numbers of clusters. The core idea being addressed was difference between supervised, unsupervised, and semi-supervised feature selection algorithms and Supervised/Unsupervised/Semi-supervised Feature Selection for Multi-Cluster/Class Data. Dr. Barzilay discussed Voronoi diagrams, VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS leading to Lloyd's algorithm or Vornoi iteration. The lecture also included the simple yet effective k-means clustering and k-medoids. The **k****-medoids algorithm** is a clustering algorithm related to the `k`-means algorithm and the medoidshift algorithm. Elbow method was also briefly discussed.

**Llyods Algorithm**

Choosing the number of clusters, as well as what to 'cluster around' are both quite interesting problems. Google news algorithms focused on clustering and how to measure the appeal of a product and to determine features of how google news cluster stories is a topic of immense interest. Instructor was inquired about Under The Hood: Google News & Ranking Stories and this link provides some insight as Patent Application Reveals Key Factors In Google News Algorithm

**High Performance Text Processing in Machine Learning by Daniel Krasner**

The second part of the class was with Dr. Tommi** Jaakkola** (Bio) (Personal Webpage) who focused mainly on examples of mixture models, revised the K-means algorithm for finding clusters in a data set, reviewed the latent variable view of mixture distributions, how to assign data points to specific components of mixture and general techniques for finding m.l. estimators in latent variable models.

The expectation Maximization (EM) algorithm and its explanation in context with Gaussian mixture models which motivates EM took majority of time. Dr. Jaakkola talked about framework for building complex probability distributions – A method for clustering data and using social media to reinforce learning. The topic then went on to Sequence learning and a brief Introduction to decision trees and random forests, Mixtures, EM, Non-parametric models as well as Machine Learning from Data Gaussian Mixture Models.

Expectation Maximization (EM) is an iterative procedure that is very sensitive to initial conditions. The principal of Garbage in Garbage out applies here and therefore we need a good and fast initialization procedure. The Expectation Maximization Algorithm A short tutorial explains few techniques including sed: K-Means, hierarchical K-Means, Gaussian splitting etc.

Here is a great tutorial by MathematicalMonk on (ML 16.3) Expectation-Maximization (EM) algorithm.

Mixture Models and EM Tutorial by Sargur Srihari

(ML 16.6) Gaussian mixture model (Mixture of Gaussians) Introduction to the mixture of Gaussians, a.k.a. Gaussian mixture model (GMM). This is often used for density estimation and clustering.

In response to Henry Tan's query regarding how is tensor analysis applied to machine learning?, Dr. Regina pointed to one of her papers as a resource

Rest of the class continued with ML topics and practical advise on on things like log-likelihood use, Clustering in high dimension is extremely tricky, Dimensionality reduction for supervised learning, Dimensionality reduction, Random Projections, Dimensionality reduction Feature selection, and last but not least, the BIC - Model Selection Lecture V: The Bayesian Information Criterion

Looking forward to tomorrow's final class on generative models, mixtures, EM algorith, Semi-supervised and active learning as well as tagging, information extraction.

**Misc**

- Machine Learning Mastery - Getting Started
- Data Scientist In a Can
- Atlas of Knowledge
- Machine Learning Math
- Math for machine learning
- What if I’m Not Good at Mathematics

Machine Learning with Scikit-Learn (I) - PyCon 2015

## MIT Machine Learning for Big Data and Text Processing Class Notes Day 3

**Regina Barzilay**(Bio) (Personal Webpage) overview of the the following.

- Cascades, boosting
- Neural networks, deep learning
- Back-propagation
- Image/text annotation, translation

Dr.** Barzilay **introduced BoosTexter for the class with a demo on twitter feed. BoosTexter is a general purpose machine-learning program based on boosting for building a classifier from text and/or attribute-value data. It can be downloaded from here while step by step instructions can be found here as part of Assignment 2: Extracting Meaning from Text. The paper outlining BoosTexter is a Boosting-based System for Text Categorization. An Open-source implementation of Boostexter (Adaboost based classifier) can be found here.

Reference: http://www.ais.uni-bonn.de/deep_learning/

A question was brought up regarding where to get the data from; following are few sources.

- ACL Anthology A Digital Archive of Research Papers in Computational Linguistics
- LDC Catalog https://catalog.ldc.upenn.edu

After the intro to BoosTexter, we jumped into ensemble learning and boosting. Here are some of the pertinent resources to the lecture.

- Boosting Simple Model Selection Cross Validation Regularization
- Introductory lecture - CSE 151 Machine Learning Instructor: Kamalika Chaudhuri
- Fighting the bias-variance tradeoff

Questions like Does ensembling (boosting) cause overfitting? came up and we talked about how Machines read Twitter and Yelp so you don’t have to. I think one of the most relevant resource can be summed up in Foundations of Machine Learning Lecture by Mehryar Mohri Courant Institute and Google Research.

At this point, a detailed discussion about Loss Function was in order. Loss function is the function indicating the penalty for an incorrect prediction but the different kinds of loss functions (or cost functions) such as zero-one loss (standard loss function in classification) or for non-symmetric losses (good vs spam emails classification), or squared loss which is standard loss function in regression. MSR's paper on On Boosting and the exponential loss is a good starting point to follow this topic.

Speaking of Boosting, we reviewed BOOSTING (ADABOOST ALGORITHM) Eric Emer, Explaining AdaBoost Robert E. Schapire and Ensemble Learning. Some of the questions came up like In boosting, why are the learners “weak”?, What is a weak learner?, How to boost without overfitting or Does ensembling (boosting) cause overfitting? and Is AdaBoost less or more prone to overfitting?

Misc topics included, so what happens when you don't have mistakes? Here comes Perceptron Mistake Bounds by Mehryar Mohri, Afshin Rostamizadeh which talks about why the error rate doesn't become zero. How about Extracting Meaning from Millions of Pages, the natural language toolkit NTLK - Extracting Information from Text, Parsing Meaning from Text and of course sickit learn for Ensemble methods

**Adaboost Demo**

After lunch, Dr. Tommi** Jaakkola** (Bio) (Personal Webpage) started with the ANN - Neural networks. There was of course the mandatory mention of AI Winter and how neural networks fell out of favor. Dr. Jaakola spoke about Support Vector Machines vs Artificial Neural Networks, What are advantages of Artificial Neural Networks over Support Vector Machines? Neural networks vs support vector machines: are the second definitely superior? etc. A good overview lecture for Neural Networks can be found here.

As Minsky said

[The perceptron] has many features to attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgment that the extension is sterile.

-Quote from Minsky and Papert’s book, Perceptrons (1969):

The topic quickly converged to learning in Multi-Layer Perceptrons - Back-Propagation and forward propogration. Iin order to cover Stochastic gradient descent, Non-linear classification, neural networks, and neural networks (multi-layer perceptron, this will give you a good overview. and of course props for Minsky.

The rationale essentially is that Perceptrons only be highly accurate for linearly separable problems. However, Multi-layer networks (often called multi-layer perceptrons, or MLPs) can work in case of any complex non-linear target function. The challenge we see in the multi-layer networks is that it provides no guarantee of convergence to minimal error weight vector. To hammer in these ideas, Exercise: Supervised Neural Networks is quite helpful. Few more relevant resources.

- Single Layer Neural Networks Hiroshi Shimodaira
- What is the difference between back-propagation and forward-propagation?
- Neural Networks Tutorial
- The Back-propagation Algorithm
- Vector Calculus: Understanding the Gradient (or derivative)
- Lecture 11: Feed-Forward Neural Networks

Next topic was of Feature Scaling; What is feature scaling? A question posed was that if Feature-scaling is applied to the data before input into the artificial neural network will make the network converge faster. This is well defined here in coursera Gradient Descent in Practice - What is Feature Scaling. This brought up the point of How to “undo” feature scaling/normalization for output?, and How and why do normalization and feature scaling work?

The concluding topic of the day was Convolution neural network.Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs. There has been amazing things done by the use of CS231n Convolutional Neural Networks for Visual Recognition.

There has been immense interest in topics like Large-scale Learning with SVM and Convolutional Nets for Generic Object Categorization, ImageNet Classification with Deep Convolutional Neural Networks, Convolutional Kernel Networks and how convolution networks can help generate text for images. Here are some of the relevant papers.

- CNN Features off-the-shelf: an Astounding Baseline for Recognition
- Text Understanding from Scratch
- END-TO-END TEXT RECOGNITION WITH CONVOLUTIONAL NEURAL NETWORKS
- Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts
- Convolutional Neural Networks for Sentence Classification
- Visualizing and Understanding Convolutional Network
- Caffe Deep learning framework by the BVLC
- A picture is worth a thousand (coherent) words: building a natural description of images
- Show and Tell: A Neural Image Caption Generator
- Generating Text with Recurrent Neural Networks
- The Unreasonable Effectiveness of Recurrent Neural Networks
- Building Fast High-Performance Recognition Systems with Recurrent Neural Networks and LSTM
- LONG SHORT-TERM MEMORY
- Convolutional Neural Net Image Processor Link

Looking forward to tomorrow's class!

## MIT Machine Learning for Big Data and Text Processing Class Notes - Day 2

So after having an awesome Day 1 @ MIT, I was in CSAIL library and met Pedro Ortega, NIPS 2015 Program Manager @adaptiveagents. Celebrity sighting!

Today on Day 2, Dr. **Jaakkola** (Bio) (Personal Webpage) professor, Electrical Engineering and Computer Science/Computer Science and Artificial Intelligence Laboratory (CSAIL), went over the following .

- Non-linear classification and regression, kernels
- Passive aggressive algorithm
- Overfitting, regularization, generalization
- Content recommendation

Dr. Jaakkola's socratic method of inquiring the common sense questions ingrain the common concepts in the mind of people. The class started with the follow up of perceptron from yesterday and quickly turned into a session on when NOT to use perceptron such as in case of non linearly seperable problems. Today's lecture was derieved from 6.867 Machine Learning Lecture 8. The discussion extended to Support Vector Machine (and Statistical Learning Theory) Tutorial, which is also well explained in the An Idiot’s guide to Support vector machines (SVMs) R. Berwick, Village Idiot

Speaking of SVM and dimensionality, Dr. Jaakkola posed the question if ranking can also be a secondary classification problem? Learning to rank or machine-learned ranking (MLR) is a fascinating topic where common intuitions like number of items displayed, error functions between user's preference and display order sparseness fall flat. Microsoft research has some excellent reference papers and tutorials on learning to rank which are definitely worth pouring over in case you are interested in this topic. Label ranking by learning pairwise preferences is another topic discussed in detail during the class. Some reference papers follow:

- A Short Introduction to Learning to Rank
- Reviewing Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales
- LETOR: Learning to Rank for Information Retrieval Tutorials on Learning to Rank
- Ranking Methods in Machine Learning A Tutorial Introduction
- Yahoo! Learning to Rank Challenge Datasets
- Large Scale Learning to Rank
- Yahoo! Learning to Rank Challenge Overview
- Multiclass Classification: One-vs-all
- Zipf, Power-laws, and Pareto - a ranking tutorial Lada A. Adamic

Indeed with SVM, the natural progression led to the 'k' word; kernel functions. A brief introduction to kernel classifiers Mark Johnson Brown University is a good starting point and The difference of kernels in SVM?, and how to select a kernel for SVM provide good background material to understand the practical aspects of kernel. Kernels and the Kernel Trick Martin Hofmann Reading Club "Support Vector Machines"

The afternoon topic was Anomaly detection; use cases included aberrant behavior in financial transactions, insurance fraud, bot detection, manufacturing quality control etc. One the most comprehensive presentations on Anomaly Detection Data Mining Techniques is by Francesco Tamberi which is great for the background. Several problems worked on during the class were from 6.867 Machine learning which shows how instructors carefully catered the program for practitioners with the right contents from graduate level courses, as well as industry use cases. Other topics discussed included Linear versus nonlinear classifiers and we learned how decision boundary is the region of a problem space in which the output label of a classifier is ambiguous. Class discussions and Q&A touched on the wide variety of subjects including but not limited to How to increase accuracy of classifiers?, Recommendation Systemsm A Comparative Study of Collaborative Filtering Algorithms which eventually led to Deep Learning Tutorial: From Perceptrons to Deep Networks which performed really well on MNIST Database for handwritten digits.

- Caltech 101
- THE MNIST DATABASE of handwritten digits
- Why do naive Bayesian classifiers perform so well?

Linear vs. non linear classifiers followed where Dr. Jaakkola spoke about why logistic regression a linear classifier, more on Linear classifier, Kernel Methods for General Pattern Analysis, Kernel methods in Machine learning, How do we determine the linearity or nonlinearity of a classification problem? and review of Kernel Methods in Machine Learning

Misc. discussions of Kernel Methods, So you think you have a power law, Radial basis function kernel, Kernel Perceptron in Python surfaced, some of which briefly reviewed in Machine Learning: Perceptrons- Kernel Perceptron Learning Part-3/4. Shape Fitting with Outliers and SIGIR 2003 Tutorial Support Vector and Kernel Methods tutorial with radial basis functions. Other topics included Kernel based Anomaly Detection with Multiple Kernel Anomaly Detection (MKAD) Algorithm, Support Vector Machines: Model Selection Using Cross-Validation and Grid-Search, LIBSVM -- A Library for Support Vector Machines, Practical Guide to Support Vector Classification, Outlier Detection with Kernel Density Functions and Classification Framework for Anomaly Detection as relevant readings.

For a linear Algebra Refresher, Dr. Barzilay recommended Prof. Gilbert Strang MIT Open Course Number 18.06 or Gilbert Strang lectures on Linear Algebra via video lectures.

Looking forward to the Deep Learning and Boosting tomorrow! Dr. Barzilay said its going to be pretty cool.

**Misc:**

## MIT Machine Learning for Big Data and Text Processing Class Notes - Day 1

As a follow up on MIT's tackling the challenges of Big Data, I am currently in Boston attending Machine Learning for Big Data and Text Processing Classification (and therefore blogging about it for posterity based on public domain data / papers - nothing posted here is MIT proprietary info to violate any T&C). MIT professional education courses are tailored towards professionals and it is always a great opportunity to learn what others practitioners are up to, especially in a relatively new field of data science.

Today's lecture #1 was outlined as

- machine learning primer
- features, feature vectors, linear classifiers
- On-line learning, the perceptron algorithm and
- application to sentiment analysis

Instructors Tommi** Jaakkola** (Bio) (Personal Webpage) and **Regina Barzilay** (Bio) (Personal Webpage) started the discussion with breif overview of the course. Dr. Barzilay is a great teacher who explains the concepts in amazing detail. As an early adapter and practitioner, she was one of the technology review innovator under 35.

The course notes are fairly comprehensive; following are the links to the publicly available material.

- Youtube: http://www.youtube.com/MITProfessionalEd
- FB: https://www.facebook.com/MITProfessionalEducation
- twitter: https://twitter.com/MITProfessional
- LinkedIn - https://www.linkedin.com/grp/home?gid=2352439

In collaboration with CSAIL - MIT Computer Science and AI Lab- www.csail.mit.edu, today's lecture was a firehose version of Ulman's large scale machine learning. Dr. Barzilay walked through the derivation of the Perceptron Algorithm, covering Perceptrons for Dummies and Single Layer Perceptron as Linear Classifier. For a practical implementation, Seth Juarez's NUML implementation of perceptron is a good reading. A few relevant publications can be found here.

- NLP Programming Tutorial 3 - The Perceptron Algorithm
- Machine Learning: Exercise Sheet 4
- Perceptron Find Weight
- ML LAb Solutions
- Classification Exercise
- Perceptron Learning

The discussion progressed into Opinion Mining and Sentiment Analysis with related techniques. Some of the pertinent data sets can be found here:

- Huge ngrams dataset from googlestorage.googleapis.com/books/ngrams/books/datasetsv2.html
- http://www.sananalytics.com/lab/twitter-sentiment/
- http://inclass.kaggle.com/c/si650winter11/data
- http://nlp.stanford.edu/sentiment/treebank.html
- Global ML dataset repository: https://archive.ics.uci.edu/ml
- Sentiment 140 Dataset
- Cornell Movie Review Dataset

Dr. Barzilay briefly mentioned Online Passive-Aggressive Algorithms and details from Lillian Lee, AAAI 2008 Invited Talk - A “mitosis” encoding / min-cost cut while talking about Domain Adaptation which is quite an interesting topic on its own. Domain Adaptation with Structural Correspondence Learning by John Blitzer, Introduction to Domain Adaptation guest lecturer: Ming-Wei Chang CS 546, and Word Segmentation of Informal Arabic with Domain Adaptation are fairly interesting readings. The lecture slides are heavily inspired by Introduction to Domain Adaptation guest lecturer: Ming-Wei Chang CS 546.

With sentiment analysis and opinion mining, we went over the seminal Latest Semantic Analysis - LSI, Clustering Algorithm Based on Singular Value Decomposition, Latent Semantic Indexing (LSI), (Deerwester et al. 1990), and Latent Dirichlet Allocation (LDA), (Blei et al. 2003). The class had an interesting discussion around the The Hathaway Effect: How Anne Gives Warren Buffett a Rise, with a potential NSFW graphic. The lecture can be summed up in Comprehensive Review of Opinion Summarization Kim, Hyun Duk; Ganesan, Kavita; Sondhi, Parikshit; Zhai, ChengXiang (PDF version).

Few other papers / research work and demos discussed during the lecture included Get out the vote: Determining support or opposition from Congressional floor-debate transcripts, Multiple Aspect Ranking using the Good Grief Algorithm, Distributional Footprints of Deceptive Product Reviews, Recursive Neural Tensor Network - Deeply Moving: Deep Learning for Sentiment Analysis, Code for Deeply Moving: Deep Learning for Sentiment Analysis, and Sentiment Analysis - The Stanford NLP Demo, Stanford Sentiment Treebank.

Among several class discussions and exercises/quiz, The Distributional Footprints of Deceptive Product Reviews was of primary importance. Started with Amazon Glitch Unmasks War Of Reviewers, darts were thrown around Opinion Spam Detection: Detecting Fake Reviews and Reviewers , Fake Review Detection: Classification and Analysis of Real and Pseudo Reviews

With all this sentiment analysis talks, I have asked fellow attendee Mohammed Al-Hamdan (Data Analyst at Al-Elm Information Security Company), about publishing a paper by the end of this course on sentiment analysis in Arabic language twitter feeds for potential political dissent. Would be a cool project / publication.

Looking forward to the session tomorrow!

Bonus, here is Dr. Regina Barzilay — Information Extraction for Social Media video - publicly available on youtube.

## Deep Learning with Neural Networks

Deep learning architectures are built using multiple levels of non-linear aggregators, for instance neural nets with many hidden layers. In this introductory talk Will Stanton discusses the motivations and principles regarding learning algorithms for deep architectures. Bill provides a primer to neural networks, and deep Learning. He explains how Deep Learning gives some of the best-ever solutions to problems in computer vision, speech recognition, and natural language processing.

and also, why Google is Investing in deep learning.

## Gradient Boosting Machine Learning by Prof. Hastie

Here is Prof. Hastie's recent talk from the H2O World conference. In this talk, professor Hastie takes us through Ensemble Learners like decision trees and random forests for classification problems.

Other excellent talks from the conference include the following.

- Michael Marks - Values and Art of Scale in Business
- Nachum Shacham of Paypal - R and ROI for Big Data
- Hassan Namarvar, ShareThis - Conversion Estimation in Display Advertising
- Ofer Mendelevitch, Hortonworks - Bayesian Networks with R and Hadoop,
- Sandy Ryza, Cloudera - MLlib and Apache Spark
- Josh Bloch, Lord of the APIs - A Brief, Opinionated History of the API
- Macro and Micro Trends in Big Data, Hadoop and Open Source
- Competitive Data Science Panel: Kaggle, KDD and data sports
- Practical Data Science Panel

The complete playlist can be found here.

## Machine Learning - On the Art and Science of Algorithms with Peter Flach

Over a decade ago, Peter Flach of Bristol University wrote a paper on the topic of "On the state of the art in machine learning: A personal review" in which he reviewed several, then recent books, related to developments in machine learning. This included Pat Langley’s Elements of Machine Learning (Morgan Kaufmann), Tom Mitchell’s Machine Learning (McGraw-Hill), and Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations by Ian Witten and Eibe Frank (Morgan Kaufman) among many others. Dr. Flach mentioned Michael Berry and Gordon Linoff’s Data Mining Techniques for Marketing, Sales, and Customer Support (John Wiley) for it's excellent writing style citing the paragraph below and commending "I wish that all computer science textbooks were written like this."

“People often find it hard to understand why the training set and test set are “tainted” once they have been used to build a model. An analogy may help: Imagine yourself back in the 5th grade. The class is taking a spelling test. Suppose that, at the end of the test period, the teacher asks you to estimate your own grade on the quiz by marking the words you got wrong. You will give yourself a very good grade, but your spelling will not improve. If, at the beginning of the period, you thought there should be an ‘e’ at the end of “tomato”, nothing will have happened to change your mind when you grade your paper. No new data has entered the system. You need a test set!

Now, imagine that at the end of the test the teacher allows you to look at the papersof several neighbors before grading your own. If they all agree that “tomato” has no final ‘e’, you may decide to mark your own answer wrong. If the teacher gives the same quiz tomorrow, you will do better. But how much better? If you use the papers of the very same neighbors to evaluate your performance tomorrow, you may still be fooling yourself. If they all agree that “potatoes” has no more need of an ‘e’ then “tomato”, and you have changed your own guess to agree with theirs, then you will overestimate your actual grade on the second quiz as well. That is why the evaluation set should be different from the test set.” [3, pp. 76–77] 4

That is why when I recently came across * *"Machine Learning The Art and Science of Algorithms that Make Sense of Data", I decided to check it out and wasn't disappointed. Dr. Flach is the Professor of Artificial Intelligence at the University of Bristol and in this "future classic", he left no stone unturned when it comes to clarity and explainability. The book starts with a machine learning sampler, introduces the ingredients of machine learning fast progressing to Binary classification and Beyond. Written as a textbook, riddled with examples, foot-notes and figures, this text elaborates concept learning, tree models, rule models, linear models, distance-based models, probabilistic models to features and ensembles concluding with Machine learning experiments. I really enjoyed the "Important points to remember" section of the book as a quick refresher on machine-learning-commandments.

The concept learning section seems to have been influenced by author's own research interest and is not discussed in as much details in contemporary machine learning texts. I also found frequent summarization of concepts to be quite helpful. Contrary to it's subtitle and compared to it's counterparts, the book however is light on algorithms and code, possibly on purpose. While it explains the concepts with examples, number of formal algorithms are kept to a minimum. This may aid in clarity and help avoiding recipe-book-syndrome while making it potentially inaccessible to practitioners. Great at basics, the text also falls short on elaboration of intermediate to advance topics such as LDA, kernel methods, PCA, RKHS, and convex optimization. For instance, in chapter 10 "Matrix transformations and decompositions" could have been made an appendix while expanding upon meaningful topics like LSA and use cases of sparse matrix (pg 327). It is definitely not the book's fault; but rather of this reader expecting too much from an introductory text just because author explains everything so well!

As a text book on On the Art and Science of Algorithms, Peter Flach definitely delivers on the promise of clarity, with well chosen illustrations and example based approach. A highly recommended reading for all who would like to understand the principles behind machine learning techniques.

Materials can be downloaded from here which generously include excerpts with background material and literature references, full set of 540 lecture slides in PDF including all figures in the book with LaTeX beamer source of the above.