Yet Another Data Blog

Showing posts with label NLP. Show all posts

Sunday, February 23, 2014

Week 5 : Zipfian Academy - Graphs and Community Detection

The update for last week will be short and quick. Doing these blog posts is getting much harder.

We started the week looking at unsupervised learning techniques like k-means and hierarchical clustering. We also visited dimension reduction techniques like SVD and NMF. By mid-week, we switched gears to graph analysis and covered in no particular order BFS, DFS, A*, Dijkstra and community detection in graph networks

Take aways from the week:

We had several guest lectures this week. @kanjoya is working on the cutting edge of Natural Language Processing. They help their clients derive actionable intelligence from emotions and intuition. The speaker discussed the general NLP landscape : tools and techniques. I found it interesting that some of their training data comes from The Experience Project
@geli gave an interesting talk. They've basically built an OS for energy systems and hope to revolutionize the energy management space
@thomaslevine talk was on open data initiatives around the country. Open Data is one of those things cities like to talk about but very few of them are doing it well
Things were switched around this week. We ended the week working on a dataset from one of the partner companies. The dataset recorded mobile ads served to user at various locations, we were supposed to do some exploration and find out the best locations to serve ads to users. The dataset had a couple million records. Trying to wrangle giga-byte sized data on just 4gb of RAM is definitely not fun. I ordered a 16gb RAM kit, should get it by this weekend. If you are thinking of enrolling for the course, you should shoot for at least 8gb of RAM.

Saturday, February 15, 2014

Week 4 : Zipfian Academy - Oh SQL, Oh SQL... MySQL and some NLP too

So things were totally ramped up this week. We started out by scrapping, parsing and cleaning data from the NYTimes API and then jsonified and stored the data in MongoDB. The next day we used the same dataset ported to a few SQL tables and implemented the Naive Bayes algorithm in SQL to classify which labels an article would fall under. We continued with some diagnostics like confusion matrix , confusion tables, false alarm rate, hit rate, precision, recall, ROC curves, etc. Other topics covered include NLTK, tokenization, TF-IDF, n-grams, regular expressions, feature selection using Chi-Squared and Mutual Information. We ended the week by working on another past Kaggle Competition - StumbleUpon Evergreen Classification Challenge

We are at the half way point for the structured part of the class. Just in case you're thinking of doing this, my schedule these days is about 12 - 15 hrs / day during the week doing daily sprints (data scrubbing , transformation / machine learning challenges), reading data science materials and lectures. Over the weekend, I'd say about 10 hrs / day closing the loop on a few of the sprints from the current week and doing more data science / readings for the following week. You basically live and breathe data science... all day long..all week long

Highlights from the week :

We had two guest lectures this week. They were on Naive Bayes and feature extraction in NLP. Zipfian also added a guest lecturer to their roster. The new instructor is a Deep Learning expert and I'm really excited to explore working on new datasets with Neural Networks.
Implementing things from first principles gives you a better understanding of how some of these algorithms work and what may be going on under the hood when they fail.
My team also took the top spot in the Kaggle competition for the second week. The problem we worked on was a classification problem using AUC (Area Under the Curve) as the evaluation metric. We achieved an AUC of $\approx 0.8895$ which is about 0.008 off the leading Kaggle submission on the public leaderboard
Cross-validating on your training set is always a good idea

Tuesday, October 22, 2013

Real Time Speech Translation

Now this is pretty cool, real time language transalation...

Language translation has come a long way from Hidden Markov Models, Pattern Recognition and now Deep Learning Neural Networks. These deep neural networks are also behind the new Google 'Hummingbird' Algorithm

Wednesday, October 31, 2012

Slides from “Survey of Text Mining / NLP in R” talk #rstats

Here are my slides from Survey of Text Mining / NLP in R talk delivered at the October Meetup of the Austin R User Group