Small Lab, Big Data: January 2019

When we predict the class we can then measure how accurate it has predicted the classification. There are several methods to do this:

Confusion Matrix

For a binary classification problem the table has 2 rows and 2 columns. Across the top is the observed class labels and down the side are the predicted class labels. Each cell contains the number of predictions made by the classifier that fall into that cell.

		Observed
		Survived	Deceased
Predicted	Survived	10	2
Predicted	Deceased	4	15

Accuracy Paradox

Sometimes it may be desirable to select a model with a lower accuracy because it has a greater predictive power on the problem.

For example, in a problem where there is a large class imbalance, a model can predict the value of the majority class for all predictions and achieve a high classification accuracy, the problem is that this model is not useful in the problem domain. For example, Fraud occurs in less than 3% of transactions, so we could just say all transactions are ok, if you looked at the overall accuracy, then it would say our prediction was 97% accurate, however look under the hood, in the fraud class we would be 100% incorrect.

Precision

Precision is the number of True Positives divided by the number of True Positives and False Positives. Put another way, it is the number of positive predictions divided by the total number of positive class values predicted. It is also called the Positive Predictive Value (PPV).

Precision can be thought of as a measure of a classifiers exactness. A low precision can also indicate a large number of False Positives.

Recall

Recall is the number of True Positives divided by the number of True Positives and the number of False Negatives. Put another way it is the number of positive predictions divided by the number of positive class values in the test data. It is also called Sensitivity or the True Positive Rate.

Recall can be thought of as a measure of a classifiers completeness. A low recall indicates many False Negatives.

What is AUC - ROC Curve?

AUC (Area under the Curve) - ROC (Receiver Operating Characteristics) curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between those who survived and those who didn’t.

The ROC curve is plotted with True Positive Rate (TPR) against the False Positive Rate (FPR) where TPR is on y-axis and FPR is on the x-axis.

An excellent model has AUC near to the 1 which means it has good measure of separability. A poor model has AUC near to the 0 which means it has worst measure of separability. In fact it means it is reciprocating the result. It is predicting 0s as 1s and 1s as 0s. And when AUC is 0.5, it means model has no class separation capacity whatsoever.

Accuracy Built In

Thankfully with the Oracle Advanced Analytics they have us covered. When applying machine learning models to data, they provide an option where you can monitor the accuracy, precision and recall of your models.

Accuracy can be shown through the model performance. Here it enables us to see the True-False Matrix (0 and 1 in this case) and the accuracy within each class.

Its possible to go one step further and investigate the AUC through the model as well:

If in my case you've trained a number of models, then don't panic, you can make an informed model selection a the click of a button, by comparing the model accuracy's and AUC all in 2 clicks.

In this exercise, I decided to test R against Python

Classification problem, these problems are about assigning an object to a group. In our case we have 2 classes, those who survived (0) and those who didn’t (1). Classification algorithms look to assign a class based on a probability.

On the Kaggle website there is a starter problem, Titanic (link), the objective of the challenge is using a training data set can you predict who would survive and who would not survive the Titanic. Kaggle provides a training data set and a test data set. Once you've made your predictions you upload the test data set to Kaggle to obtain a test score.

So I thought, I wonder which would perform the best, R, Python or Oracle Advanced Analytics. To make the test fair I decided to implement an out of the box decision tree in all 3 of languages:

R Code

The R code took approximately 25 mins to code up, with a little cleaning of the data through to producing the file for submission.

The final prediction score was 76.55

Python Code

In a similar style using Jupyter notebook I did the same task, prediction with a decision tree, specifically the Sklearn package. In total from reading the file in to producing the submission file it took approximately 20 mins to complete.

The final prediction score was 71.29

Finally Oracle Advanced Analytics

I used the Oracle prebuilt VM to enable me to have Advanced analytics, loaded the data into a database table and created a OAA model, again using the decision tree approach. Implementation time from loading the data to file production was approximately 5 mins.

The prediction score was 76.55

Based on this, it would seem to suggest without any major tuning, OAA will work as efficiently as the open source counter parts. In addition it was quicker and some would say easier. In my next blog we'll explore what accuracy is and how to investigate it with OAA.

Small Lab, Big Data

Pages

Tuesday, 29 January 2019

What does accurate actually mean?

R vs Python vs SQL