Getting Started Guide for Python, Data Science and Machine Learning for wanna be practitioners & hobbyists0
A friend recently asked about a good book on data science, with python. I have been trying to go through the current landscape of books for a course I am teaching this summer, so here is my recommendation. There are several other good ones out there, but this one quite fits the newbie bill. So without further ado
Are you a Beginner who would like to learn python, in context with a specific area, and tired of using syntax focused books sans practical examples?
Are you exploring data science landscape and want to see practical examples of how to actually use machine learning algorithms in data science context?
If you answer in the affirmative to either of the questions above, "Python for data science for dummies" is the perfect book for you. Luca Massaron is a practicing data scientist, and a prolific author of several books including Regression Analysis with Python , Machine Learning For Dummies, Python Data Science Essentials, Regression Analysis with Python, and Large Scale Machine Learning with Python. He is also a leading Kaggle enthusiast, and you can see his 'practitioner fingerprints' all over this book; especially in later chapters about data processing, ETL, cleanup, data sources, and challenges.
This book starts with the fundamentals of Python data analysis programming, and explains the setup of Python development environment using anaconda with IPython (Jupyter notebooks). Authors start by considering the emergence of data science, outline the core competencies of a data scientist, and describe the Data Science Pipeline before taking a plunge into explaining Python’s Role in Data Science and introducing Python’s Capabilities and Wonders.
Once you get your bearings about the IDE setup, chapter 4 focuses on Basic Python before you get your Hands Dirty with Data. What I like about this manuscript is that the writing keeps it real. Instead of giving made up examples, authors talk about items like knowing when to use NumPy or pandas and real world scenarios like removing duplicates, creating a data map and data plan, dealing with Dates in Your Data, Dealing with Missing data, parsing etc; problems which practicing data scientists encounter on a daily basis.
Contemporary topics like Text mining are also addressed in the book with enough details of topics such as working with Raw Text, Stemming and removing stop words, Bag of Words Model and Beyond, Working with n‐grams, Implementing TF‐IDF transformations, and adjacency matrix handling. This is also where you start getting a basic understanding of how machine learning algorithms work in practice.
Practical aspects of evaluating a data science problem are addressed later, with techniques defined for researching solutions, formulating a hypothesis, data preperation, feature creation, binning and discretization, leading up to vectors and matrix manipulation, and visualization with MatPlotLib. Even though the book does not discuss theano, DL4J, Torch, Caffe or TensorFlow, it still provides an introduction to key python ML library Scikit‐learn. This 400 page book also covers key topics like SVD, PCA, NMF, Recommendation systems, Clustering, Detecting Outliers, logistic Regression, Naive Bayes, Fitting a Model, bias and variance, Support Vector Machines, and Random Forest classifiers to name a few. The resources provided in the end are definitely worth subscribing to for every self-respecting data science enthusiast.
I highly recommend this book for those beginners interested in data science and also want to learn and leverage Python skills for this rapidly emerging field.
Last Saturday I attended Global Azure Bootcamp which was held in Microsoft Tampa office, hosted by Blain Barton of Microsoft. With presentations from Dan Patrick of Opsigility, Alex Melching and Blain himself, Azure bootcamp was a great primer/refresher to the Azure offerings. I especially enjoyed the Post-Build-conference offerings, announcements and demos pertaining to the new Azure-AWS parity. Great stuff!
Here are Some Random Notes/Links from the Talk!
- Azure Quickstart Templates (must see!)
- What is Azure Resource Manager
- Microsoft OMS - https://www.microsoft.com/en-us/server-cloud/operations-management-suite/overview.aspx
- Windows PowerShell Desired State Configuration Overview - https://msdn.microsoft.com/en-us/powershell/dsc/overview
- VorlonJS – A Journey to DevOps: Infrastructure as Code with Microsoft Azure and Resource Manager https://blogs.technet.microsoft.com/devops/2016/01/27/vorlonjs-a-journey-to-devops-infrastructure-as-code-with-microsoft-azure-and-resource-manager/
- Azure Readiness - https://github.com/Azure-Readiness/HOL-Intro-to-Azure
- AWS Direct Connect - https://aws.amazon.com/directconnect/
- Secure Cloud Interconnect - http://www.verizonenterprise.com/products/networking/secure-cloud-interconnect/
- Peak 10 http://www.peak10.com/
- Cloud Providers - http://callibt.com/cloud-providers/
- Creating and deploying Azure resource groups through Visual Studio - https://azure.microsoft.com/en-us/documentation/articles/vs-azure-tools-resource-groups-deployment-projects-create-deploy/
- Docker Swarm - https://github.com/Azure/azure-quickstart-templates/tree/master/docker-swarm-cluster
- ExpressRoute: Connecting Private and Public Clouds through - video.ch9.ms/sessions/teched/na/2014/DCIM-B423.pptx
Troubleshooting Tip - Service cannot be started. System.TypeLoadException: Could not load type 'type name' from assembly
Recently I ran into this bizarre issue while developing a windows service, and thought it would be great to share the remedy on interwebs to save others some time and pain.
The problem usually starts when you have a windows service project, with possibly other associated class library projects as part of the solution. Everything worked fine, your unit tests still run, and your console tester (recommended to have with a windows service for stepping thru / debugging) also still works. However, your service stops working. Install util works ok, however upon net start, when the service is started it immediately stops. The Event Log says following.
Service cannot be started. System.TypeLoadException: Could not load type '' from assembly '', Version=188.8.131.52, Culture=neutral, PublicKeyToken=null'. at nova.edu.MyService.OnStart(String args) at System.ServiceProcess.ServiceBase.ServiceQueuedMainCallback(Object state)
You check for diffs, run it thru tests, console app etc, and wonder what could possibly be wrong. Its because of namespaces.
Somewhere in your solution, you have a class, outside of your service project, which refers to the same namespace as the service namespace.
So how do you resolve this issue?
Step 1. Locate the assembly where there is a "duplicate" reference; so for instance your type library; see the matching namespace and be careful NOT to do replace all via search and replace.
Step 2. Rename the namespace appropriately.
Tada! Service cannot be started. System.TypeLoadException: Could not load type '' from assembly should be gone.
Tampa //rebuild/ Event via Microsoft Mondays
Tuesday, Aug 18, 2015, 6:30 PM
5426 Bay Center Dr tampa, FL
29 App Developers Went
Tampa //rebuild/ Event via Microsoft Mondays Catch up on some of what you missed at the Microsoft //build/ Conference! Join Randy Patterson with Catapult Systems, Donald Bickel with Mercury New Media and others as we take a deep dive into topics covered at the conference.Agenda• Welcome and Introduction• Microsoft Edge has redefined itself! Lea...
Tampa //rebuild/ Event at Microsoft office comprised of 3 lightening talks on Microsoft Edge Browser, ASP.NET 5, and IoT with Raspberry Pi and Windows 10. Some pictures and links from the talk follows.
MS Edge / UI/UX Talk
- David Walsh's Blog
- Code Pen - http://codepen.io/
- Can I use - http://caniuse.com/
- wufoo - http://www.wufoo.com/
- flight arcade - http://flightarcade.com/missions/tin
- HTML5 hub - http://html5hub.com/
- JetStream Benchmark Suite - https://www.webkit.org/blog/3418/introducing-the-jetstream-benchmark-suite/
- Octane benchmark suite - https://developers.google.com/octane/?hl=en
- Octopus Deployment - https://octopusdeploy.com/
- Modern IE - http://dev.modern.ie/
- Windows IoT https://dev.windows.com/en-us/iot
- Test Drive sites and demos dev.modern.ie/testdrive/
- Download virtual machines dev.modern.ie/tools/vms/mac/
IoT Talk (great session by Randy Patterson)
- Getting Started - http://ms-iot.github.io/content/en-US/GetStarted.htm
- Microsoft is holding a contest! Join Windows 10 IoT Core - Home Automation Contest on Hackster.io
- Hackster.io https://microsoft.hackster.io/en-US
- Become a part of our early adopter community https://www.windowsondevices.com/signup.aspx
- Randy Patterson Github repo https://github.com/RandyPatterson
- and if you like to troll Randy Patterson 🙂 http://rrpiot.azurewebsites.net/%20rrpiot.azurewebsites.net/SensorData?what%27s%20up!
And last but not least, an honorable mention to team duct tape who is fundraising for their upcoming robotics / tech challenge. All the best guys & gals.
الحمد للہ رب العالمین
Wondering what to do on 4th of July long weekend? Learn Functional Programming in F# with my book!
I am glad to inform that my book on Learning F# Functional Data Structures and Algorithms is published, and is now available via Amazon and other retailers. F# is a multi-paradigm programming language that encompasses object-oriented, imperative, and functional programming language properties. The F# functional programming language enables developers to write simple code to solve complex problems.
Starting with the fundamental concepts of F# and functional programming, this book will walk you through basic problems, helping you to write functional and maintainable code. Using easy-to-understand examples, you will learn how to design data structures and algorithms in F# and apply these concepts in real-life projects. The book will cover built-in data structures and take you through enumerations and sequences. You will gain knowledge about stacks, graph-related algorithms, and implementations of binary trees. Next, you will understand the custom functional implementation of a queue, review sets and maps, and explore the implementation of a vector. Finally, you will find resources and references that will give you a comprehensive overview of F# ecosystem, helping you to go beyond the fundamentals.
If you have just started your adventure with F#, then this book will help you take the right steps to become a successful F# coder. An intermediate knowledge of imperative programming concepts, and a basic understanding of the algorithms and data structures in .NET environments using the C# language and BCL (Base Class Library), would be helpful.
With detailed technical and editorial reviews, it is a long process to write a technology book, but equally rewarding and unique learning experience. I am thankful to my technical reviewer, and Packt editorial team to provide excellent support to make this a better book. Nothing is perfect and to err is human; if you find any issues in the code or text, please let me know.
Happy Functional Programming!
The source code for the book can be downloaded from here.
The fun thing about spending time at MIT is that you always run into interesting things. Couple of days ago, I encountered the MIT Bot submission for NASA - Sample Return Robot Challenge.
NASA and the Worcester Polytechnic Institute (WPI) in Worcester teamed up for competing in the Sample Return Robot Challenge to demonstrate a robot that can locate and retrieve geologic samples from a wide and varied terrain without human control.
Sample Return Robot Challenge is part of NASA centennial challenges; a robot which has autonomous capability to locate and retrieve specific sample types from various locations over a wide and varied terrain and return those samples to a designated zone in a reasonable amount of time with limited mapping data.
The challenge description follows:
The Sample Return Robot Challenge is scheduled for June 14-17, 2012 in Worcester, MA. The Challenge requires demonstration of an autonomous robotic system to locate and collect a set of specific sample types from a large planetary analog area and then return the samples to the starting zone. The roving area will include open rolling terrain, granular medium, soft soils, and a variety of rocks, and immovable obstacles (trees, large rocks, water hazards, etc.) A pre-cached sample and several other samples will be located in smaller sampling zones within the larger roving area. Teams will be given aerial/geological/topographic maps with appropriate orbital resolution, including the location of the starting position and a pre-cached sample.
MIT Robotics Team 2015 Promo Video
The bot is powered with the following technologies:
ROS: The Robot Operating System (ROS) is a set of software libraries and tools that help you build robot applications. From drivers to state-of-the-art algorithms, and with powerful developer tools, ROS has what you need for your next robotics project. And it's all open source. www.ros.org
Arduino: Arduino is an open-source electronics platform based on easy-to-use hardware and software. It's intended for anyone making interactive projects.
RabbitMQ for Async messaging: RabbitMQ is a messaging broker - an intermediary for messaging. It gives your applications a common platform to send and receive messages, and your messages a safe place to live until received.
MIT team couldn't make it to the challenge due to some technical issues. NASA has awarded $100,000 in prize money to the Mountaineers, a team from West Virginia University, Morgantown, for successfully completing Level 2 of the Sample Return Robot Challenge, part of the agency’s Centennial Challenges prize program.
- Robots Face Off in $1.5 Million NASA Sample Return Challenge
- NASA's robot event challenges, robots, engineers
- NASA Awards $100,000 to Winning Team of Robot Challenge
- MIT CrowdFunding Page
- MIT RoboTeam
- Non-linear classification and regression, kernels
- Passive aggressive algorithm
- Overfitting, regularization, generalization
- Content recommendation
Dr. Jaakkola's socratic method of inquiring the common sense questions ingrain the common concepts in the mind of people. The class started with the follow up of perceptron from yesterday and quickly turned into a session on when NOT to use perceptron such as in case of non linearly seperable problems. Today's lecture was derieved from 6.867 Machine Learning Lecture 8. The discussion extended to Support Vector Machine (and Statistical Learning Theory) Tutorial, which is also well explained in the An Idiot’s guide to Support vector machines (SVMs) R. Berwick, Village Idiot
Speaking of SVM and dimensionality, Dr. Jaakkola posed the question if ranking can also be a secondary classification problem? Learning to rank or machine-learned ranking (MLR) is a fascinating topic where common intuitions like number of items displayed, error functions between user's preference and display order sparseness fall flat. Microsoft research has some excellent reference papers and tutorials on learning to rank which are definitely worth pouring over in case you are interested in this topic. Label ranking by learning pairwise preferences is another topic discussed in detail during the class. Some reference papers follow:
- A Short Introduction to Learning to Rank
- Reviewing Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales
- LETOR: Learning to Rank for Information Retrieval Tutorials on Learning to Rank
- Ranking Methods in Machine Learning A Tutorial Introduction
- Yahoo! Learning to Rank Challenge Datasets
- Large Scale Learning to Rank
- Yahoo! Learning to Rank Challenge Overview
- Multiclass Classification: One-vs-all
- Zipf, Power-laws, and Pareto - a ranking tutorial Lada A. Adamic
Indeed with SVM, the natural progression led to the 'k' word; kernel functions. A brief introduction to kernel classifiers Mark Johnson Brown University is a good starting point and The difference of kernels in SVM?, and how to select a kernel for SVM provide good background material to understand the practical aspects of kernel. Kernels and the Kernel Trick Martin Hofmann Reading Club "Support Vector Machines"
The afternoon topic was Anomaly detection; use cases included aberrant behavior in financial transactions, insurance fraud, bot detection, manufacturing quality control etc. One the most comprehensive presentations on Anomaly Detection Data Mining Techniques is by Francesco Tamberi which is great for the background. Several problems worked on during the class were from 6.867 Machine learning which shows how instructors carefully catered the program for practitioners with the right contents from graduate level courses, as well as industry use cases. Other topics discussed included Linear versus nonlinear classifiers and we learned how decision boundary is the region of a problem space in which the output label of a classifier is ambiguous. Class discussions and Q&A touched on the wide variety of subjects including but not limited to How to increase accuracy of classifiers?, Recommendation Systemsm A Comparative Study of Collaborative Filtering Algorithms which eventually led to Deep Learning Tutorial: From Perceptrons to Deep Networks which performed really well on MNIST Database for handwritten digits.
- Caltech 101
- THE MNIST DATABASE of handwritten digits
- Why do naive Bayesian classifiers perform so well?
Linear vs. non linear classifiers followed where Dr. Jaakkola spoke about why logistic regression a linear classifier, more on Linear classifier, Kernel Methods for General Pattern Analysis, Kernel methods in Machine learning, How do we determine the linearity or nonlinearity of a classification problem? and review of Kernel Methods in Machine Learning
Misc. discussions of Kernel Methods, So you think you have a power law, Radial basis function kernel, Kernel Perceptron in Python surfaced, some of which briefly reviewed in Machine Learning: Perceptrons- Kernel Perceptron Learning Part-3/4. Shape Fitting with Outliers and SIGIR 2003 Tutorial Support Vector and Kernel Methods tutorial with radial basis functions. Other topics included Kernel based Anomaly Detection with Multiple Kernel Anomaly Detection (MKAD) Algorithm, Support Vector Machines: Model Selection Using Cross-Validation and Grid-Search, LIBSVM -- A Library for Support Vector Machines, Practical Guide to Support Vector Classification, Outlier Detection with Kernel Density Functions and Classification Framework for Anomaly Detection as relevant readings.
Looking forward to the Deep Learning and Boosting tomorrow! Dr. Barzilay said its going to be pretty cool.
As a follow up on MIT's tackling the challenges of Big Data, I am currently in Boston attending Machine Learning for Big Data and Text Processing Classification (and therefore blogging about it for posterity based on public domain data / papers - nothing posted here is MIT proprietary info to violate any T&C). MIT professional education courses are tailored towards professionals and it is always a great opportunity to learn what others practitioners are up to, especially in a relatively new field of data science.
Today's lecture #1 was outlined as
- machine learning primer
- features, feature vectors, linear classifiers
- On-line learning, the perceptron algorithm and
- application to sentiment analysis
Instructors Tommi Jaakkola (Bio) (Personal Webpage) and Regina Barzilay (Bio) (Personal Webpage) started the discussion with breif overview of the course. Dr. Barzilay is a great teacher who explains the concepts in amazing detail. As an early adapter and practitioner, she was one of the technology review innovator under 35.
The course notes are fairly comprehensive; following are the links to the publicly available material.
- Youtube: http://www.youtube.com/MITProfessionalEd
- FB: https://www.facebook.com/MITProfessionalEducation
- twitter: https://twitter.com/MITProfessional
- LinkedIn - https://www.linkedin.com/grp/home?gid=2352439
In collaboration with CSAIL - MIT Computer Science and AI Lab- www.csail.mit.edu, today's lecture was a firehose version of Ulman's large scale machine learning. Dr. Barzilay walked through the derivation of the Perceptron Algorithm, covering Perceptrons for Dummies and Single Layer Perceptron as Linear Classifier. For a practical implementation, Seth Juarez's NUML implementation of perceptron is a good reading. A few relevant publications can be found here.
- NLP Programming Tutorial 3 - The Perceptron Algorithm
- Machine Learning: Exercise Sheet 4
- Perceptron Find Weight
- ML LAb Solutions
- Classification Exercise
- Perceptron Learning
The discussion progressed into Opinion Mining and Sentiment Analysis with related techniques. Some of the pertinent data sets can be found here:
- Huge ngrams dataset from googlestorage.googleapis.com/books/ngrams/books/datasetsv2.html
- Global ML dataset repository: https://archive.ics.uci.edu/ml
- Sentiment 140 Dataset
- Cornell Movie Review Dataset
Dr. Barzilay briefly mentioned Online Passive-Aggressive Algorithms and details from Lillian Lee, AAAI 2008 Invited Talk - A “mitosis” encoding / min-cost cut while talking about Domain Adaptation which is quite an interesting topic on its own. Domain Adaptation with Structural Correspondence Learning by John Blitzer, Introduction to Domain Adaptation guest lecturer: Ming-Wei Chang CS 546, and Word Segmentation of Informal Arabic with Domain Adaptation are fairly interesting readings. The lecture slides are heavily inspired by Introduction to Domain Adaptation guest lecturer: Ming-Wei Chang CS 546.
With sentiment analysis and opinion mining, we went over the seminal Latest Semantic Analysis - LSI, Clustering Algorithm Based on Singular Value Decomposition, Latent Semantic Indexing (LSI), (Deerwester et al. 1990), and Latent Dirichlet Allocation (LDA), (Blei et al. 2003). The class had an interesting discussion around the The Hathaway Effect: How Anne Gives Warren Buffett a Rise, with a potential NSFW graphic. The lecture can be summed up in Comprehensive Review of Opinion Summarization Kim, Hyun Duk; Ganesan, Kavita; Sondhi, Parikshit; Zhai, ChengXiang (PDF version).
Few other papers / research work and demos discussed during the lecture included Get out the vote: Determining support or opposition from Congressional floor-debate transcripts, Multiple Aspect Ranking using the Good Grief Algorithm, Distributional Footprints of Deceptive Product Reviews, Recursive Neural Tensor Network - Deeply Moving: Deep Learning for Sentiment Analysis, Code for Deeply Moving: Deep Learning for Sentiment Analysis, and Sentiment Analysis - The Stanford NLP Demo, Stanford Sentiment Treebank.
Among several class discussions and exercises/quiz, The Distributional Footprints of Deceptive Product Reviews was of primary importance. Started with Amazon Glitch Unmasks War Of Reviewers, darts were thrown around Opinion Spam Detection: Detecting Fake Reviews and Reviewers , Fake Review Detection: Classification and Analysis of Real and Pseudo Reviews
With all this sentiment analysis talks, I have asked fellow attendee Mohammed Al-Hamdan (Data Analyst at Al-Elm Information Security Company), about publishing a paper by the end of this course on sentiment analysis in Arabic language twitter feeds for potential political dissent. Would be a cool project / publication.
Looking forward to the session tomorrow!
Bonus, here is Dr. Regina Barzilay — Information Extraction for Social Media video - publicly available on youtube.