As a follow up on MIT's tackling the challenges of Big Data, I am currently in Boston attending Machine Learning for Big Data and Text Processing Classification (and therefore blogging about it for posterity based on public domain data / papers - nothing posted here is MIT proprietary info to violate any T&C). MIT professional education courses are tailored towards professionals and it is always a great opportunity to learn what others practitioners are up to, especially in a relatively new field of data science.
Today's lecture #1 was outlined as
- machine learning primer
- features, feature vectors, linear classifiers
- On-line learning, the perceptron algorithm and
- application to sentiment analysis
Instructors Tommi Jaakkola (Bio) (Personal Webpage) and Regina Barzilay (Bio) (Personal Webpage) started the discussion with breif overview of the course. Dr. Barzilay is a great teacher who explains the concepts in amazing detail. As an early adapter and practitioner, she was one of the technology review innovator under 35.
The course notes are fairly comprehensive; following are the links to the publicly available material.
- Youtube: http://www.youtube.com/MITProfessionalEd
- FB: https://www.facebook.com/MITProfessionalEducation
- twitter: https://twitter.com/MITProfessional
- LinkedIn - https://www.linkedin.com/grp/home?gid=2352439
In collaboration with CSAIL - MIT Computer Science and AI Lab- www.csail.mit.edu, today's lecture was a firehose version of Ulman's large scale machine learning. Dr. Barzilay walked through the derivation of the Perceptron Algorithm, covering Perceptrons for Dummies and Single Layer Perceptron as Linear Classifier. For a practical implementation, Seth Juarez's NUML implementation of perceptron is a good reading. A few relevant publications can be found here.
- NLP Programming Tutorial 3 - The Perceptron Algorithm
- Machine Learning: Exercise Sheet 4
- Perceptron Find Weight
- ML LAb Solutions
- Classification Exercise
- Perceptron Learning
The discussion progressed into Opinion Mining and Sentiment Analysis with related techniques. Some of the pertinent data sets can be found here:
- Huge ngrams dataset from googlestorage.googleapis.com/books/ngrams/books/datasetsv2.html
- Global ML dataset repository: https://archive.ics.uci.edu/ml
- Sentiment 140 Dataset
- Cornell Movie Review Dataset
Dr. Barzilay briefly mentioned Online Passive-Aggressive Algorithms and details from Lillian Lee, AAAI 2008 Invited Talk - A “mitosis” encoding / min-cost cut while talking about Domain Adaptation which is quite an interesting topic on its own. Domain Adaptation with Structural Correspondence Learning by John Blitzer, Introduction to Domain Adaptation guest lecturer: Ming-Wei Chang CS 546, and Word Segmentation of Informal Arabic with Domain Adaptation are fairly interesting readings. The lecture slides are heavily inspired by Introduction to Domain Adaptation guest lecturer: Ming-Wei Chang CS 546.
With sentiment analysis and opinion mining, we went over the seminal Latest Semantic Analysis - LSI, Clustering Algorithm Based on Singular Value Decomposition, Latent Semantic Indexing (LSI), (Deerwester et al. 1990), and Latent Dirichlet Allocation (LDA), (Blei et al. 2003). The class had an interesting discussion around the The Hathaway Effect: How Anne Gives Warren Buffett a Rise, with a potential NSFW graphic. The lecture can be summed up in Comprehensive Review of Opinion Summarization Kim, Hyun Duk; Ganesan, Kavita; Sondhi, Parikshit; Zhai, ChengXiang (PDF version).
Few other papers / research work and demos discussed during the lecture included Get out the vote: Determining support or opposition from Congressional floor-debate transcripts, Multiple Aspect Ranking using the Good Grief Algorithm, Distributional Footprints of Deceptive Product Reviews, Recursive Neural Tensor Network - Deeply Moving: Deep Learning for Sentiment Analysis, Code for Deeply Moving: Deep Learning for Sentiment Analysis, and Sentiment Analysis - The Stanford NLP Demo, Stanford Sentiment Treebank.
Among several class discussions and exercises/quiz, The Distributional Footprints of Deceptive Product Reviews was of primary importance. Started with Amazon Glitch Unmasks War Of Reviewers, darts were thrown around Opinion Spam Detection: Detecting Fake Reviews and Reviewers , Fake Review Detection: Classification and Analysis of Real and Pseudo Reviews
With all this sentiment analysis talks, I have asked fellow attendee Mohammed Al-Hamdan (Data Analyst at Al-Elm Information Security Company), about publishing a paper by the end of this course on sentiment analysis in Arabic language twitter feeds for potential political dissent. Would be a cool project / publication.
Looking forward to the session tomorrow!
Bonus, here is Dr. Regina Barzilay — Information Extraction for Social Media video - publicly available on youtube.