Learning How Machines Learn!

Wednesday, December 14, 2016

MLE, MAP and Naive Bayes.... so similar looking yet so different

So some of you might be knowing, I was following this course of CMU on machine learning by Tom Mitchell and Nina Balcan and was going through few of the initial lectures. Although everything was going okayish, there was this one basic doubt of how MLE and MAP essentially differed from probabilistic based approaches for machine learning which use Bayes' theorem. Now some of those who know the answer and have it clear in there head would be thinking as to what foolish thing it is to get confused about. Well, this post is not for them. This post might be very basic for many of you, but still it was indeed the ahaa moment for me. So here I am writing as to how MLE is different from Naive Bayes (NB).

Note: the following was jotted down as "notes" in the default "Notes" app and hence lack proper formatting(subscripts/superscripts) and refinement in terms of language. But the ideas are probably clear and are worth sharing. I promise I will come back and properly format is, but am too tired after a full days of work to edit it right now. I just hope I made sense here, good night!

Maximum Likelihood estimate

You are given data.

You need to determine a probability of some thing. (in some examples you may see that the final answer matching your intuition and you may tend to doubt why are you doing this lengthy process, but there would be cases when your intuition won’t be that strong and then using this MLE principal would help a lot).

That probability is expressed as some parameters. These parameters are called probability parameters.

If we can express our (already observed) data in terms of these parameters, then by a logic we can say that the value of parameters that maximizes the probability of data should be taken. Which just says argmax θ: P(D/θ). Thats the whole logic/hope of this.

We express our data in terms of these parameters. This data expressed in the form of parameters is called maximum likelihood function.

So finally just maximize this function. How do you do that? Differentiate and put it to 0. That surely works in the case if we have only one parameter. For more than one parameter, we use gradient decent, which I think is kind of generalization of the first case only.

Now, MLE is just to determine probability (from probability parameters) from given data. Now let’s forget about MAP. And we see that there is no meddling with Bayes' rule. This would help you understand NB without any confusion.

My confusion regarding NB and MLE.

I just used to think that MAP and NB are just so similar. There also are in turn doing P(Y/X) from P(D/O).P(O)/P(D) and P(X/Y).P(Y)/P(X) respectively. So where exactly is the difference? Both are using Bays to find their probabilities.
For NB (and actually all generative models), we are actually trying to determine P(Y/X) using Bays rule in terms of P(X/Y)P(Y) P(X) ko lite le liya. This method wouldn’t have actually needed any MLE or MAP or any other probability estimation techniques if we were given joint probability distributions. If that would have been the case, we could have simply determined the value of P(Xi/Y) just by counting the number of examples in training data. Same for P(Y). But alas as we saw, JPD is not that easy to come by. Thus we have to estimate P(X/Y) in other ways. One way to go for it is by having no assumptions and trying to learn all the probabilities (and thus probability parameters) that we may need.
Now if our Y (number of classes) are k and if our X is a vector of features <X1, X2 …….Xn>, each of which are discreet and can take j values, then we would need to estimate (j^n)*K probabilities. Now we can use MLE (or MAP) to model these probabilities as one or more parameters each and then try to estimate it using MLE or MAP. But number of parameters (or probabilities to estimate is large).
What if we assume that all Xi are independent. Then P(X/Y) can be decomposed into P(X1/Y).P(X2/Y).P(X3/Y)….P(Xn/Y). We can just estimate each individual little probabilities and compose the required ones to find a P(X/Y). With same assumptions for X and Y, the number of parameters we need to estimate now is: n*j*k. so, we are saving on this.
But again we would use MLE or MAP to estimate these little probability parameters.
So Bays theorem may help us with training a classifier, but just to find those individual little probabilities you are using MLE (or MAP).
Now these MLE and MAP would still be valid for estimating probabilities of discriminative classifiers like logistic regression. In LR you won’t get confused with Bays rule, and just for determining those little probabilities you would use MLE or MAP. Same is for Linear Regression.

To be honest I thought that i would discuss MAP as well. But I guess I have covered the important differences between what MLE and MAP try to do vs what is present in NB. So I would leave the explicit explanation of MAP here.

Saturday, November 19, 2016

The good old gradient descent

Machine learning is all about finding that perfect function approximation, as Tom Mitchell mentioned in his very first lecture. Hence more often than not we end up minimizing some thing (error) and hence with an optimization problem. An extremely popular method for optimization is Gradient descent.

Basic discussion

It is (at least as far as I know):

An algorithm to find optimum.
Works with convex functions (with convex functions and constraints on step size, one can guarantee convergence)

Main intuition is to go down in the direction of maximum (negative??) “slope”. Lets take a single variable function f(x) = x^2. Let’s say to start with we have some value of x. What we need to do to go down to minima? Calculate the slope and go in the opposite direction. Just move “a little bit” so as to not miss the minima. With low enough “jump” (or step size), we would finally reach a point where the slope is 0. You have hit the minima.

For functions with more than one variable, the whole space can be thought of as a terrain in 1-D space. We can visualize it for a function for 2 variables as the whole thing would be in 3D. Khan academy video here explains a bit about gradient in general. Now we again do similar thing. Start with an arbitrary point, go in the direction of greatest gradient. For our function with 2 variable, you can imagine it as a marble rolling on the domain. It would roll off in the direction of highest slope (or generalized gradient). Similar thing would be true for n-1 variable function and for N-dimensional space.

Deeper:

The important thing is step size to be small. Else, it may happen that it completely misses the optima again and again. Keep step size small, and as you guessed it, it might take extremely long to converge. Thus one possible solution is to not keep step size constant but to vary it as well.

Few cool aspects of step sizes: line search

Still deeper:

Guarantees:

We can provide guarantee of convergence if the following happens:

the function is convex.
The step sizes are small enough or are varied (Have to confirm and update)

Limitation or constrains:

The function has to be differentiable throughout (?? Have to confirm and update)
Have to find other limitations. Probably some info about practical nuances and limitation of using GD.

What modification or variant of it is used in industry?

This is my favorite section since I can actually provide some more input than just state and give the information present on the net. I'll keep editing the post so as to enhance this section with the best knowledge I can provide.

Generally I have seen stochastic gradient descent being applied in Industry. SGD as it is called, has one implementation in Apache Spark too. But what is cool about it is that Spark has a distributed SGD. However, another implementation that I have seen being used is L-BFGS. Would soon come up with some in depth discussion about both of these and update this blog post.

If you find anything erroneous please comment and let me know. By that time I'll also keep on improving my knowledge and revising this post. Thanks for reading.

Friday, November 4, 2016

More about trees, boosted ones!

Hey,

Sorry for being a bit dormant, a lot of office related work cropped up. But then it also means a lot of learnings which I am itching to pen down. So continuing with what was being studied, decision trees, I would elaborate a bit more and probably hint what is generally done in industry regarding that. Rest assured no NDAs are being broken and I am not giving you highly confidential information.

So let the learning begin:

So along with decision trees there are following terms which one hears a lot:

Random Forests
Gradient Boosted trees

So I decided to jump into these, with insights that I have from working in industry. Rest assured no secrets of my current (or former) company are being spilled out. It’s just the exposure which as made me aware of various algos/frameworks which I would be discussing.

So decision trees were considered in the last post, which probably was written in another lifetime. Here is something with sounds similar to decision trees but in fact may be very different from it in terms of concept. Gradient boosted trees. Here is my current understanding of it, what I see in industry about it in general and other comments. Of course one can find more rigorous work/paper if one wants to deep dive into the complexities.

Gradient boosted trees:

Hoping that the name is not too far off, one can see that there are three distinct words and associated concepts. Since life is a learning experience and I would keep on learning about these various concepts in more detail, this current being is according to my current understanding and would be updated as and when I learn more about them. So the words being:

Gradient, boosted and trees.

Gradient: We can guess that it has something to do with gradient we deal with optimization. Hell, it is machine learning we are talking about, of course this would be the same gradient.

Boosted: The word reminds me of a technique I had studied as part of my undergrad data mining course. Boosting of classifiers. More rigorous details below.

Trees: Probably decision trees are being used somewhere. Let’s see where and how!

Boosting: a bit of dive deep

Idea: A question was asked: if we can indeed get a classifier which is week but better than random (say which gives results(accuracy) better than 0.5 for binary classification problem) can we somehow use few of such week classifiers to make a better, “stronger” one? An intuitive question to ask. And the answer? Yes! Well before you start celebrating that we have found a philosophers stone…. there are few ifs and they are quite big IFs.

The basic: combine different week classifier which might be working well due to some limited dimensions (features). Combine them on bases of how well they work. Weighted average anyone?

What if you had decision trees as your basic models? Then you get “boosted trees”. But where does the word gradient come from?

The problem of boosting, i.e combining weaker classifiers to form a strong one can be thought of optimization problem. What are you trying to optimize? The error rate of the final classifier. What is your search space? Well if you take the problem of modeling as finding an approximate function, then the problem of boosting becomes that of finding make approximate functions and then combining them together. Thus what your search space is probably space of these approximate functions and you want to select a good function to approximate.

Your final model is combination of weaker classifiers h(x). The impact of classifiers are weighted with gamma.

With the intentions and bigger picture set, it’s time to dive into some sweet equations to formalize things a bit more. Updates on it soon.

(To be continued…)

P.S: I started writing this post in the hope of talking about gradient boosted trees, their implementation in spark and xgboost. Looks like it it going to take a lot more than a single post. Thus rest of the topics would be dealt with in the next post.

Wednesday, February 3, 2016

Decision trees and overfitting

Hey!
So I am about to watch the second video in the CMU's course for ML. I am that unlike earlier, I am able to find time and complete the full course this time. So the title of the video is "Decision trees overfitting and probability". So let's see what do I remember from before:

Overfitting: A phenomenon which occurs when your classifier performs very well on your training data but is not able to perform well on unseen data (testing data).
Why does it happen? You are very limited in your training set?? uhh.... no being able to explain this... lets see what Tom has to say about it.
How to mitigate it? Using regularization parameters wherever applicable. L1 and L2 regularizations can be used in some linear classifiers like Logistic regression. A great line from great man is: amount of training data should never be a problem (in case it is more!!). All in all there is a lot that I need to learn about overfitting.

So lets begin watching the video, taking down notes and hopefully I would be a bit clearer about the concept.

Comments and question:

Everything in machine learning depends on the assumption that the testing data is random and a good approximation of the unlabelled data.
Class imbalance hasn't been touched yet! Though to think of it more, does doing upsampling even help in the case of decision tree?
Short answer for why a short tree is preferred: Occam's razaor. We hope that it would mean less chances of over fitting.
Have to watch/read again about the guarantees about she was talking about in supervised learning.... feeling extremely sleepy right now...... have to do it sometime!!
The prof says that post pruning is better than pre pruning because here we evaluate more trees. Yup. It may be more computationally intensive and hence might not be practical on large datasets if you are constrained with limited time. Should see some literature/practical open source software about how it is done. Might even see Amazon's random forest implementation. A thing which I won't (and I can't) share with you guys ;)
Reasons for overfitting: noise and statistical coincidences.
I remember someone saying the statement: decision tree is polynomial in terms of input attributes.... what is the meaning and significance of this statement?

Saturday, January 30, 2016

Decision trees and into to ML from CMU's course

Hey! Decision trees as a classifier for machine learning are I guess one of the best starting points. Probably because they are so close to our intuition. I'm going to watch this course video from CMU's ML class about decision trees. Let's embark on our journey to learn something new, disabuse ourselves from our wrong notions and learn from a great teacher.

So here is what I already know about decision trees (yeah I had done a ML course in my college as undergrad. Haven't touched decision trees since then though.)

What I already know:

Supervised method to classify records/tuples/examples/points.
Main aim is to map each record to a leaf node, the label corresponding to that leaf node is given to the class.
Construction of tree is based on attribute values. from the root, branches come out based on attribute values.
The main aim is splitting the space. By intuition of common sense, we would like to split based on an attribute which splits the input space best. By input space I mean feature space.
Since there can be many leaf nodes, decision trees are well suited for multi class problems unlike SVM or Logistic Regression.
Decision trees is inherently non linear classifier.
They too suffer from over fitting/under fitting. Pruning is one way to deal with it.

Questions I have in mind, already!

How is it different from rule based classifiers? Because in a way, each node is just a rule. And execution of these rules one after another is like traversing decision tree.
Despite having so great advantages (inherently being non linear and suitable for multi class), why isn’t it used much? Random forest is used, but I don’t know much about them.

Time to watch the lecture guys. This is the course website which I am following.

And here is the video which I am about to watch:

So the video has been watched. Here are some initial comments and questions. Would come back later to improve upon them/neatly classify them and dive deep into them:

The prof talks about altos which work with some labeled data and a lot of unlabelled data. What was he hinting at? Have to see….
Structured v/s unstructured data….. got to know about the terms…. I guess seeing programs learn from unstructured data is what enthralls novices!! Where would some blob of text come? It is not exactly structured….. and marking or tagging few parts of the sentences would make it different from unstructured data as well. Though the labeling it into structured or unstructured is not that important but just for fun.
The part of self adapting algorithms…. haven’t explored them much. Should see the part about optimizing matrix multiplication paper.
Can any function be represented using decision tree if we are given the attributes take boolean value. The fact is that we can easily represent NAND and NOR gates using decision trees and since nand and nor gates can be used to model any boolean function , we can say yes it any boolean function can be represented. The cool thing is the guy in the video gives a more general and awesome answer. His answer of having N layers corresponding to N attributes and at each level number of branches depending on cardinality of what the attribute can take is elegant and more general than what I thought. The prof says that attributes take “discreet” values. For the numeric case, we can deal with them by binning them. But alas what about “text” features. Can’t use some attribute like “title of the book” for making a decision tree. There should be some way…. would need to check it out.
In the real example of decision tree which the prof showed (of some hospital) the leaf nodes were not pure and certain probabilities attached to them were shown. Hmm…. this is somewhat like pruning isn’t it? (I know the basics of machine learning…. so you can say I am cheating… but learning from such a great prof is a chance I cannot pass ;) )
I had earlier mentioned that you would take an attribute on its efficacy to split your feature space well. I should have rather said that you would take an attribute its efficacy to split the classes (to predict) well. As in the split should be as homogenous as possible. This is kind of what entropy would be measuring.
So while choosing the next attribute to split, the entropy gain calculations of all the attributes can be done in parallel. Thus there is an element of parallelization involved which can make the algorithm a bit faster.