Understanding Machine Learning

For a Human, reading a CV/Resume is an easy task. Resumes are usually divided into separate sections and have layouts which make them easy to quickly identify important information.

In contrast, it is difficult for a computer to understand the context of a resume. It needs to be continuously trained and adapted to deal with different kinds of resumes.

While working on aircto Filters, we had to deal with the same kind of problem. Every time, writing a regular expression to find some important information from the resume resulted in a complex and ambiguous process. That is when Machine Learning came to the rescue. By applying Machine Learning algorithms to Filters, we were able to successfully extract relevant information no matter how the resume was structured.

This led us to dig more deep into Machine Learning.

Spam or Ham?

Have you ever received a lot of emails from people who are either trying to sell you stuff or get you to signup for things that you are not really interested in? This can blow up really quick and start cluttering your inbox. These emails are called “Spam Emails”. This obscures the real emails i.e “Ham Emails” which you want to see.

Therefore, the relevant emails are called “Ham Emails” and the irrelevant ones are called “Spam Emails”.

Wouldn’t it be awesome if your email service would be able to automatically detect or filter out the irrelevant emails or “Spam Emails” so that you never have to see them? That’s exactly what Gmail, outlook or other email services do. They filter out the spam emails from your inbox even before they reach you.

Now imagine you work at a very large email service, let’s say Gmail at Google and you are given a task to solve the problem of Spam Detection.

The first thing you need to do is to figure out a way to test if emails coming into the inboxes are SPAM or HAM. The basic idea is that you would be able to monitor the attributes of different emails that are coming to your inboxes and some of these attributes will be able to tell you whether that email is a Spam email or a Ham email.

One way of doing this is to use Rule-Based approach.

Rule-Based Approach

In Rule-Based approach we define a set of rules. These are the protocols that we would check for, before we allow any email to pass into an inbox or before we mark some emails to be spam.

These rules are intuition and logic based. Let’s say for our problem of spam detection, we defined the following set of rules to identify if an email is Spam or Ham:

  • Any email from a certain IP address or an email is spam: A Black list of all spam IP and email addresses.
  • Any email from a contact of a contact is not spam: A White list of all emails from our contacts and their contacts are not spam.

These rules are based on the sender of the email. We can also check what exactly an email says from the Subject or the Content of the email.

  • Any email containing a certain set of words is spam: These words can be like Sell, Offer, Lottery etc.

These are just the examples of the rules that we can come up with, we can have a lot more complex rules as well.

Now, the problem with the Rule-Based approach to a problem like spam detection is that the rules are rather static and they change really slowly. Maintaining and Updating Black list and White list of emails and IP addresses is a very cumbersome process and on the other hand behaviour patterns of spammers are dynamic and they change when they identify these rules.

The Rule-Based approach is mainly a static algorithmic approach which is missing out on the opportunity to improve itself based on the feedback that user actions provide.

An alternative to a Rule-Based approach might be to figure out patterns in the kind of emails that are explicitly marked as spam by the users. Now, when a new email comes in, we check whether the email matches with those patterns. If the patterns match, the email is a Spam email else it is a Ham email.

This is known as Machine Learning based approach.

The main difference between these approaches is the use of data or the emails that are explicitly marked by users.

As spammers mask their IP addresses, change email ids or change what they are seeking to sell. In these cases, a Rule-Based Approach will slowly but inevitably fall behind.

The Machine-Learning Based Approach

Machine Learning based approach varies its algorithm based on the data.

Machine Learning can be used when the problem is dynamic in nature. Further, it changes its algorithm based on the feedback that it gets from the data.

A Corpus is a large collection of data. In our case, it is a large collection of emails explicitly marked by users as spam or ham.

Let’s solve our problem with Machine Learning based approach. Spam Detection is a classical classification problem and as usual with Machine Learning based approach, we have a large corpus of spam and ham emails.

From this, we will calculate a measure for each word in each email i.e. the number of times a word appears in spam and ham emails. For each word, we will calculate a spamminess measure for each word.

S(T), the spamminess measure is calculated for every word present in every email before we apply the algorithm on a new input.

For ex. Let’s say a new message comes in consisting of words T1, T2, T3.....TN

First, We will calculate the spamminess S(T) of each word T1, T2... TN.

Second, we will find the Total Spamminess. The Total Spamminess of the message can be calculated by adding the spamminess of each word.

Lastly, We will find the total hamminess of each word, by multiplying (1-Spamminess) of each word.

If S[M] > H[M], then the message is SPAM else its HAM.

This type of machine learning technique which explicitly have a Corpus of data is called Supervised Learning. Also, the problem of having to decide how some entity should be classified is a classic use case of Machine Learning Classification problems.

The method which we just saw is known as Naive Bayes Classifier.

The entities that we are seeking to classify are called problem instances. In our case, the email input is called a problem instance. Problem instances are usually represented by some attributes.

The observable attributes of each problem instance are called Features(ex. Words in an email like Subject and Sender of the email). So, Each problem instance is a vector of feature values(ex. List of words in an email).

The categories that we seek to classify the emails into are called Labels(Sam or Ham in our case).

Let’s go back to Naive Bayes classifier, the “naive” word in the name is misleading, it simply means that this method assumes that the feature values are independent of each other.

Mathematically speaking, It basically means that the probability distribution of the words are independent. For ex. we didn’t look at the combination of words, we didn’t check if the combination of any two words will increase or decrease the spamminess.

In Short, Naive Bayes Classifier is a supervised probabilistic Machine Learning based approach.

An Implementation of Naive Bayes Classifier in Python.

Link to GitHub repository: Spam or Ham ? implementation.

The Repository contains all the code in a .ipynb file and some trained spam and ham emails. Open the "Naive Bayes Classifier.ipynb" file and see the code in action.

Now, let’s look at other ways of solving the spam detection problem.

We know, any line can be represented by a number which is the distance from the origin.
Also, any point on a square can be represented using two numbers which are the coordinates of that point and which represent the distance from the axis, and any point on a cube or a three-dimensional space can be represented with three number coordinate(x, y, z) system which represents the distance from the axis.

Similarly, a point in n-dimensional space can be represented with n coordinates. A 2-Dimensional space is represented by a rectangle or a square. A 3-Dimensional space is represented by a cube. So, an n-dimensional space can be represented by an n-dimensional hyper cube, it simply means each point in the hyper cube is a vector of n-numbers.

In the same way, any email message can be represented as a point in an n-dimensional hyper cube.

Suppose, we have a list that represents the entire universe of words that can appear in an email.

Let us consider this is a list of all words in the universe.

Basically, any email would contain a subset of these words.
Each of these messages can be represented as a tuple of 1’s and 0’s.

For example,

Message 1: hello, this is a test

Message 2: how are you

Message 3: Hey, goodbye

These tuples can now be represented as points in an n-dimensional hyper cube.
We will do this for all the emails that we want to classify(the problem instances) as well as for the emails we already have information about(the training data – emails from the corpus).

Since, these points are in an n-dimensional space, we can find the distance between them.
The simplest way is to use Euclidean distance formula.

It can be extended easily to an n-dimensional space.
We can somehow say that the more close is the problem instance to a spam email, the more likelihood of it being a spam email, else it is a ham email.

This method of finding the distance and checking the closest point is known as K-Nearest Neighbours.

As shown in the above diagram the points in the circle are the K-Nearest Neighbours of our problem instance. The right value of k has to be decided. Since, from the diagram we can see that our problem instance is much closer to ham emails, therefore we can classify this problem instance as a ham email.

The method that we used above is slightly simpler than the real life problem. Remember, we divided the message into a tuple of 1s and 0s. In real life, the feature vector will almost never contain a tuple of 1s and 0s, instead, there would be a smart algorithm that “hashes” the subsets of the email.

The reason to do this is to avoid the curse of Dimensionality. For ex. We considered a list of all words in the universe, practically it means each vector would be infinitely long. The distance will likely not be a straight Euclidean distance, but rather some more sophisticated algorithm that is found to work well with the training data.
The following are the complications that we face when we use K-Nearest Neighbours algorithm:

Dimensionality Reduction

Dimensionality is basically the length of the feature vector. So, if we are representing our problem instance in an n-dimensional hyper cube then we need n-coordinates to express all points.

As the dimensionality increases, both efficiency and efficacy of the k-nearest neighbours reduces.
The efficiency suffers because finding the point wise distance between very high dimension vectors becomes very computationally expensive.

Efficacy suffers because, in such a high dimensional space, it might be really difficult to really differentiate between points - they might all seem to be approximately equidistant i.e. the distance formula chosen might fail to differentiate between neighbours.

For these reasons, dimensionality reduction becomes a key to the k-nearest neighbours implementation and that's why some form of dimensionality reduction is usually applied to feature vectors before k-nearest neighbour is used.

Dimensionality reduction in this case is done using FEATURE REDUCTION. This is basically a process to create new features of lower number of dimensions using technique like PRINCIPAL COMPONENTS ANALYSIS.

Another smart dimensionality reduction trick is hashing the features using a hash function that maps similar items to similar buckets. For ex. Locality Sensitive hashing - which treats points which are near by each other as similar.

Definition of Distance

Euclidean distance is the most common definition of distance used in geometry. In two and three dimensional spaces this is the physical distance between the two points. This can also be extended to n-dimensions.

But this is not always a suitable definition of distance for k-nearest neighbours implementation.

Usually in k-nearest neighbours the algorithm uses weights. The algorithm is often modified to down-weight the importance of far points relative to that of nearby points.

Actually, Euclidean Distance only works when the feature is continuous variable(rather than a discrete one).

In case of discrete variables, we can use distance measure like Edit Distance or Hamming distance.

Edit Distance: The number of edits required to go from one string to the other. so, this is the minimum number of changes that we need to make so that we can one string same as the other.

The Choice of K

The process of choosing the right value of k is called parameter selection.
An important special case is k=1 called the Nearest Neighbour Algorithm.

Choosing values of k is sometimes tricky. If we chose large value of k then we could reduce the effect of outliers in the data.

The points which are rather far away from the cluster, are likely to be miss-classified then to be correctly classified.

But the large value of k increases the chances that we pull in large number instances that are of the other category.

Data Reduction

It involves trimming the number of data points where ever possible.

Getting rid of the right set of points will improve both the computational efficiency and efficacy of the k-nn.
A common data reduction technique is to divide the training data into 3 categories:

  • Prototypes: represents the training data - they are selected by prototype selection algorithm.

  • Class Outliers: are points in the training data that are not correctly classified using the prototypes.

  • Absorbed points: are the points in the training data that are correctly classified by the prototypes.

The data reduction process then retains all the prototypes, discard all the absorbed points, and selectively keeps the class outliers.

Use in Prediction

The Support Vector Machines and Artificial Neural Networks basically depends upon K-Nearest neighbours algorithm.

Our discussion so far has been focused on the use of k-nearest neighbours for classification.

Given a problem instance(point to be classified), we checked which category a majority of its neighbours belong to, and assign that category to the problem instance.

We can just as easily use the method to "predict" the value of any function for a problem instance.

Simply calculate the function to be predicted for each of the k-nearest neighbours, and use the average as our prediction.

Like the Naive Bayes Classifier, K-Nearest Neighbour classifier is also a Supervised learning method and is independent of probability distribution of each word. For this reason, it is said to be a Non-Probabilistic classifier and Non-Parametric classifier. In contrast, the Naive Bayes classifier does indeed make assumptions about the probability distribution. For this reason, it is said to be a Probabilistic classifier.

The other way of solving the spam detection problem is with the help of Support Vector Machines.

Support Vector Machines

Like K-Nearest Neighbours, Support Vector Machines(SVM) are also classifier and They also represent points in an n-dimensional hyper cube.

A Support Vector Machine is used to build binary classifiers, this means that given a set of points, a support vector machine will classify those points into two categories (ex. Spam or Ham).

In addition, Support Vector Machines make their classification decision on the basis of a “Linear Function” of the point’s co-ordinates.

If a point is:

A Linear function will be something like this:

The Support Vector Machine will run a test like:

if f(X) > 0, then email is a spam, else ham.

This linear function represents an n-dimensional hyper plane in the n-dimensional hyper cube space.

Also, the support vector machines do not involve explicit assumptions about the probability distributions of the points.

Lastly, SVM involves a Training Stage where the model learns from a set of training data.

A support vector machine is a supervised machine learning approach used to build linear non-probabilistic binary classifiers.

Supervised because its takes the help of training data. It’s linear because it involves a Linear Function. Non-Probabilistic because it does not make assumptions of the probability distribution and Binary because it only works when we have two categories in which we can separate the data.

The support vector machine finds a hyper-plane that neatly separates points of the two categories. This hyper-plane can then be used to classify any new points that need classification.

The hyper plane shown in the figure separates spam and ham emails i.e everything on one side is spam and everything on the other side is ham. Basically, this hyper-plane acts as a boundary. Now, if a new problem instance comes in and if its on the spam side of the hyper plane its spam otherwise ham.

let's think a bit about how SVM actually constructs this hyper-plane.

Hyper Plane: In a vector space of n-dimensions, a hyper-plane is a geometric shape, that is a set of points with n-1 dimensions and zero thickness in 1 dimension.

The equation which describes the set of points on a hyper plane is always "linear".

This is the equation of a hyper-plane in 3d-space

All the points of the plane will satisfy this equation.

All points on one side of the plane will satisfy the condition

and all points on the other side of the plane will satisfy the condition

This is the linear equation that the SVM uses to classify points, which is why the svm is a linear classification technique.

The distance of the point from the plane can be calculated on the basis of below equation:

Let's see how does SVM find the "best" hyper-plane to separate the two set of points:

Actually, the best hyper-plane is one that: maximises the sum of the distances of the nearest points on either side making sure that all points of one type are on one side of the plane and all points of the other side are on the other side.

This is clearly an Optimisation Problem. The constraint is to make sure that still the two clusters are separated by the hyper-plane.

The solution to this optimisation problem is Maximum Margin Hyper-plane.

The Soft Margin Method: It finds a hyper-plane that does "as clean a separation as possible."

This method also allows the measurement of the degree of miss classification of the data.

Normally, SVMs are linear classifiers. But there is actually a way to use SVM to perform non-linear classification using something called THE KERNEL TRICK.

Kernel Trick is a way in which SVM can work in high dimensional spaces. The kernel trick helps us get around the curse of dimensionality.

The problem with the linear classification setup is the need to calculate dot products of vectors with huge number of elements. A Kernel is a function that replaces the dot product of two vectors. It basically allows the classification to occur without having to explicitly do element wise operations on the feature vectors. It uses non-linear function.

The advantage of the Kernel function is that we can compute the dot product in that higher dimensional space without actually computing the coordinates of these points in the higher dimensional space.

Note that the kernel functions operate in a transformed feature space, which can be of far higher dimensionality than the original feature space. Even though the dimensionality in which they operate is higher, they will find the maximum margin hyper-plane more easily because of the kernel trick.

The maximum margin hyper-plane is in the transformed feature space and is linear in that space.

Thus the kernel trick also allows a way to solve problems where the data is not linearly separable - by projecting such data into higher dimensional space.

Combining Naive Bayes Classifier, K-Nearest Neighbours, Support Vector Machines algorithm we can design a classification system that can easily identify whether an email is spam or ham.