Tuesday, 29 January 2019

R vs Python vs SQL

In this exercise, I decided to test R against Python 

Classification problem, these problems are about assigning an object to a group. In our case we have 2 classes, those who survived (0) and those who didn’t (1). Classification algorithms look to assign a class based on a probability. 

On the Kaggle website there is a starter problem, Titanic (link), the objective of the challenge is using a training data set can you predict who would survive and who would not survive the Titanic. Kaggle provides a training data set and a test data set. Once you've made your predictions you upload the test data set to Kaggle to obtain a test score. 

So I thought, I wonder which would perform the best, R, Python or Oracle Advanced Analytics. To make the test fair I decided to implement an out of the box decision tree in all 3 of languages:

R Code

The R code took approximately 25 mins to code up, with a little cleaning of the data through to producing the file for submission. 

The final prediction score was 76.55

Python Code
In a similar style using Jupyter notebook I did the same task, prediction with a decision tree, specifically the Sklearn package. In total from reading the file in to producing the submission file it took approximately 20 mins to complete. 


The final prediction score was 71.29

Finally Oracle Advanced Analytics
I used the Oracle prebuilt VM to enable me to have Advanced analytics, loaded the data into a database table and created a OAA model, again using the decision tree approach. Implementation time from loading the data to file production was approximately 5 mins. 



The prediction score was 76.55

Based on this, it would seem to suggest without any major tuning, OAA will work as efficiently as the open source counter parts. In addition it was quicker and some would say easier. In my next blog we'll explore what accuracy is and how to investigate it with OAA.



No comments:

Post a Comment