Yet Another Data Blog

Showing posts with label Kaggle. Show all posts

Sunday, December 10, 2017

Week 1 : fast.ai - 1 v2 - CNNs, Kernels, Optimal Learning Rate

Deep Learning is essentially a particular way of doing Machine Learning where you give your system a bunch of examples and then it learns the rules and representations vs manually programming the rules

We have interesting applications today like Cancer diagnosis, Language Translation, Inbox by Google, Style Transfer, Data Center Costs optimization, Playing the Game Of Go among others that are powered by Deep Learning. Jeremy also emphasizes all the negatives that come with the growth of Deep learning like algorithmic bias, societal implications, job automation, etc

With Deep Learning we need an infinitely flexible mathematical function that can solve any problem. Such a function would be quite large with lots of parameters and we have to fit those parameters in a fast and scalable way using GPUs.

The fast.ai philosophy is closely modeled after some of the concerns Paul Lockhart voiced in his essay A Mathematician's Lament which pushes you to start by doing right away and then gradually peel back the layers, modify and look under the hood. The general feeling out there is that there is a survival bias problem in the Deep Learning space which is typified by this Hacker News post. The only currency that should matter is how well you're able to use these tools to solve problems and generate value.

Convolutions

CNNs are the most important architecture for Deep Neural Networks. They're the state of the art for solving problems in many areas of Image Processing, NLP, Speech Processing, etc

The basic structure of a CNN is the convolution which on its own is a relatively straightforward process. A convolution is a linear operation that finds interesting features in an image. Performing one instance of the convolution operation (element-wise multiplication and addition) requires the following steps

Identify kernel matrix - this is typically a 3 x 3 matrix or in some cases a 1 x 1 matrix
Pass kernel matrix over image (see figure below)
Perform element-wise multiplication between kernel and overlaying image pixels (see red box in image below)
Sum all the elements in the resulting matrix ( in the figure below, the sum is 297)
Assign the sum as the new pixel value for the center pixel in the overlayed image crop in the activation map
This operation is repeated until you've completed passes over the entire image

There are other parameters like kernel stride and padding that determine the dimension of the activation maps. I'll be doing a more in-depth post on Convolutional Neural Network to discuss theses and the full CNN pipeline.

Courtesy of setosa.io/ev/image-kernels/

In the figure above, we used the sharpen kernel. There are also a few other predefined kernels like sobel, emboss, etc. In a typical CNN pipeline, we start with randomly initialized convolution filters, apply a non-linear ReLU activation (remove negatives) and then use SGD + backpropagation to find the best convolution filters. If we do this with enough filters and data we end up with a state of the art image recognizer. CNNs are able to learn filters that detect edges and other simple image characteristics in the lower layers and then use those to detect more complex image features and objects in the deeper layers.

To train an image classifier using the fast.ai library you need to

Select a starting architecture: resnet34 is a good option to start with.
Determine the number of epochs: start with 1, observe results and then run multiple epochs if needed
Determine the learning rate: Using the strategy in the Cyclical Learning Rate paper, we keep increasing learning rate until the loss stars decreasing. This will probably take less than one epoch if you have a small batch size. From the figure below, we want to pick the largest learning rate as long as the loss is still decreasing (in this case learning_rate = 0.01)
Train learn object

Courtesy of fast.ai

Highlights

We used a pre-trained ResNet34 to train a CNN on data from the Cats vs Dogs Kaggle competition and obtained > 99% accuracy
Used a new method (Cyclical Learning Rates for Training Neural Networks) to determine the optimal learning rate which determines how quickly we update our weights.

Some Useful Links

Sunday, February 23, 2014

Week 5 : Zipfian Academy - Graphs and Community Detection

The update for last week will be short and quick. Doing these blog posts is getting much harder.

We started the week looking at unsupervised learning techniques like k-means and hierarchical clustering. We also visited dimension reduction techniques like SVD and NMF. By mid-week, we switched gears to graph analysis and covered in no particular order BFS, DFS, A*, Dijkstra and community detection in graph networks

Take aways from the week:

We had several guest lectures this week. @kanjoya is working on the cutting edge of Natural Language Processing. They help their clients derive actionable intelligence from emotions and intuition. The speaker discussed the general NLP landscape : tools and techniques. I found it interesting that some of their training data comes from The Experience Project
@geli gave an interesting talk. They've basically built an OS for energy systems and hope to revolutionize the energy management space
@thomaslevine talk was on open data initiatives around the country. Open Data is one of those things cities like to talk about but very few of them are doing it well
Things were switched around this week. We ended the week working on a dataset from one of the partner companies. The dataset recorded mobile ads served to user at various locations, we were supposed to do some exploration and find out the best locations to serve ads to users. The dataset had a couple million records. Trying to wrangle giga-byte sized data on just 4gb of RAM is definitely not fun. I ordered a 16gb RAM kit, should get it by this weekend. If you are thinking of enrolling for the course, you should shoot for at least 8gb of RAM.

Saturday, February 15, 2014

Week 4 : Zipfian Academy - Oh SQL, Oh SQL... MySQL and some NLP too

So things were totally ramped up this week. We started out by scrapping, parsing and cleaning data from the NYTimes API and then jsonified and stored the data in MongoDB. The next day we used the same dataset ported to a few SQL tables and implemented the Naive Bayes algorithm in SQL to classify which labels an article would fall under. We continued with some diagnostics like confusion matrix , confusion tables, false alarm rate, hit rate, precision, recall, ROC curves, etc. Other topics covered include NLTK, tokenization, TF-IDF, n-grams, regular expressions, feature selection using Chi-Squared and Mutual Information. We ended the week by working on another past Kaggle Competition - StumbleUpon Evergreen Classification Challenge

We are at the half way point for the structured part of the class. Just in case you're thinking of doing this, my schedule these days is about 12 - 15 hrs / day during the week doing daily sprints (data scrubbing , transformation / machine learning challenges), reading data science materials and lectures. Over the weekend, I'd say about 10 hrs / day closing the loop on a few of the sprints from the current week and doing more data science / readings for the following week. You basically live and breathe data science... all day long..all week long

Highlights from the week :

We had two guest lectures this week. They were on Naive Bayes and feature extraction in NLP. Zipfian also added a guest lecturer to their roster. The new instructor is a Deep Learning expert and I'm really excited to explore working on new datasets with Neural Networks.
Implementing things from first principles gives you a better understanding of how some of these algorithms work and what may be going on under the hood when they fail.
My team also took the top spot in the Kaggle competition for the second week. The problem we worked on was a classification problem using AUC (Area Under the Curve) as the evaluation metric. We achieved an AUC of $\approx 0.8895$ which is about 0.008 off the leading Kaggle submission on the public leaderboard
Cross-validating on your training set is always a good idea

Saturday, February 8, 2014

Week 3 : Zipfian Academy - Multi-armed bandits and some Machine Learning

We started the week by finishing off the session on Bayesian Statistics with the study of Bayesian A/B Testing techniques. Some of the strategies covered are extensions of the Multi-armed Bandit problem : epsilon greedy, Bayesian Bandits and UCB1. These algorithms typically out perform traditional A/B testing. We officially started machine learning this week with the treatment of linear regression, multiple linear regression, hetero/homo-scedasticity and multicolinearity. Other topics we covered include Lasso / Ridge regression, cross-validation , over fitting, bias / variance and Gradient Descent. We capped off the week by working on data from one of the past Kaggle competitions - Blue Book for Bulldozers

A few take aways from this week:

There were a few algorithms I had always sort of understood. Some of these algorithms become very clear once you implement them from first principles and then apply them on a dataset. We implemented a Gradient Descent function and then used it to minimize the cost function of both linear and logistic regression problems ( I'll probably have a more detailed blog post on this). Working on some regularization with Lasso and Ridge also gave a better understand on how they both work
We had a visit from @StreetLightData .Very cool problem they're working on. They essentially model mini migration patterns in cities / across the country. They feed data from cell signals, GPS, Census Data (Demographics / Geo) and Traffic data into their systems to extract insights used for marketing and planning
Always remember 80-20. Data scientists spend 80% of their time cleaning datasets and extracting features (or at least more than half their time) and about 20% of their time doing modeling and parameter tuning. Forget those datasets you used in Stats class, real world data can be real messy
$k-fold$ Cross-validation helps you prevent over fitting, get an estimate for your prediction error and helps you understand how stable / robust your model is

$$CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_{i}$$

where MSE is Mean Squared Error

My team took the top spot in the Kaggle competition we worked on. We had an RMSLE (Root Mean Squared Log Error) of $\approx 0.43$ which is about $0.2$ off the winning Kaggle submission. Decent for a few hours of work. It does look like working on Kaggle competitions may become a mainstay / regular end of week exercise