Small Lab, Big Data: September 2018

Friday, 28 September 2018

Data Lab vs Warehouse

https://en.wikipedia.org/wiki/Data_warehouse

According to Wikipedia a Data Warehouse is defined as:

“... are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.”

“The main source of the data is cleansed, transformed, catalogued, and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support. However, the means to retrieve and analyze data, to extract, transform, and load data, and to manage the data dictionary are also considered essential components of a data warehousing system.”

and a Data Lake as:

“...a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).”

In a nutshell these differences are:

Data Warehouse	Data Lake
Only retains the data required, reducing noise from the users	Retains all Data
Arranges data into standardised structures	Can handle data of any variety / structure
Consolidates data from a variety of sources into ‘One Source of the Truth’	Supports a variety of users but they may require specific skills
Supports decision making and can be used by a range of user abilities	Ability to adapt quickly to new data types
Stringent Row/ Column security can be put in place quickly.	Enables quicker data analysis

From a Data science perspective which is better… well that depends. There are advantages of a Data Warehouse, the structure and reproducibility allows quick, consistent analysis to be performed. A Data Lake on the other hand maybe more difficult, and not always reproducible, but give you the more granular and wide ranging data to find the really interesting analysis.

So my Answer – a Data warehouse where I can bring my own data. Maybe something like a Data Ware-Lake?

Tuesday, 25 September 2018

Oracle Pre-Tools and Coding Challenges

On a daily basis i'm working within an Oracle database environment. Much of that work includes Data Miner, R, Python and Oracle OBIEE.

Many will be wondering what they are and how you can use them. Thankfully there is a set of pre-built Virtual Machines (VMs) that enable you to simulate these different environments.

They are an excellent resources that can be used for investigation of these tools. Personally its a brilliant resource.

Oracle VM'S:http://www.oracle.com/technetwork/community/developer-vm/index.html

I will be using these in the future to demonstrate some of the great features within Oracle that we've been using in the data lab.

For the demonstrations i’m going to be using open data, mainly so users could follow along if there wanted, or even provide information on there own analysis conducted. These pre-built environments are also a great way to try out new skills. Personally i’ve never got into APEX or Docker, but through VMs as well other online resources i’ve started to expand my knowledge in these areas.

The other area I try to encourage people is to have a go at coding. Why not try FreeCode (@freeCodeCamp) or the 100 days of coding challenge (#100DaysOfCode)both of which are great ways to get into coding as well as having a supportive community around you.

Personal favourite languages:

R
Python
SQL

All of which have there own pro’s and con’s. From the R world i’m a huge fan of FlexDashboards, in Python, the ability to do deep learning through Edward/ Tensorflow. SQL is my go to for large data sets and data engineering tasks.

Give it a go, learn a new programming language, have a date night that is programming orientated, develop a new website, what ever it is that you want to push your self in.

Saturday, 22 September 2018

Prescriptive Analytics - the future?

Prescriptive analytics is the third and final phase of business analytics.

The first 2 stages of business analytics include:

Descriptive Analytics: Analytics that quantitatively describe or summarize features within a data set. The data set could be the combination of data sets for example house sale prices combined with ONS statistics on employment rates. We would then be able to describe the areas of the country, where has high house prices Vs employment rate etc.

Predictive Analytics: Uses a variety of techniques from predictive modelling, machine learning, and data mining to analyse current states and historical facts to make predictions about future or otherwise unknown events. For example, using our descriptive analytics above on house prices and employment rate, we may be able to predict the price a house should be added to the market as a sale price, based on what other houses sold for, the size of the house, income in the area, employment rates etc.

Often Prescriptive Analytics is sometime referred to as the “final frontier of analytic capabilities”, which entails the application of mathematical and computational sciences and suggests decision options to take advantage of the results of descriptive and predictive analytics.

Prescriptive Analytics goes beyond the prescriptive analytics by showing potential actions to benefit from the predictions and showing the implications of the decision options. Often prescriptive analytics will go one step further will not just predict what will happen or when it will happen but also why it will happen.

An example of where prescriptive analytics is utilised is in energy pricing. When predicting prices it depends on supply, demand, economy, geopolitics, weather conditions etc. As an energy company being able to use prescriptive analytics to take in the external data sets next to their own data sets to predict prices by modelling the internal and external variables at the same time and provide decision options and show the consequences of each decision.

Wednesday, 19 September 2018

Operating Models for Teams

CRISP-DM is short hand for Cross Industry Standard Process for Data Mining. The model is used by many different areas of business that describes an approach that can be used to tackle data problems.

In the to day work that I'm involved with we are often approached by ideas or problems from within the business. This is key, we're placing the business at the centre of the problem. Thus we are placing Value at the centre of the processing. As the problem is defined we're also gathering business understanding whilst understanding the data we have. Assessing the Variety of data and any enrichment of the data that could be applied. This understanding should then lead to preparing the data for the modelling, starting to understand the Volume of data that will be used in the modelling and the impact that it will have. The modelling allows us to now start tackling the problem. At this stage it might require to assess the Veracity of the data. This might be a key point in assessing the problem and the implementation of the solution to the wider business.

Finally we look to evaluate the potential solution to the problem and look to deploy into the business but ultimately returning to the start – the Value of the solution.

The experience I have had with this modelling approach has been positive, it allows the business to remain central to the process and ultimately focused on the Value returned to the business by going through this process.

Once you’ve got your model then these days there are many working within the Agile framework. At my current work place we’ve implemented Jira for tracking these stages. This is usually done through KANBAN boards, and more defined tasks leading to the overall story.

IMHO this process will allow Big Data to yield great results for your business just ensure the business is central to the work you conduct!

Monday, 3 September 2018

Big Data, what’s all the Fuss

Big Data is often defined as the so large or complex that traditional methods are no longer acceptable of useful. There are many challenges to face, such as data collection, cleaning data, supplementing or enriching data as well as analysing and extracting the data. This is alongside the fact that 90% of the data in the world today has been generated in the last two years! Currently its estimated that we output 2.5 quintillion bytes of data per day, through electronic devices, phones etc.

When people describe big data they often define it through a number of V’s. The number of V’s is variable (no pun intended) but I usually define big data through the 5 V's; Volume, Velocity, Variety, Veracity, Value:

Volume: Often relates to the bytes of data that are collected and stored everyday from many sources. In my day job we often have over 7 billion rows of data in a basic query, but sometimes it can go beyond that.

Velocity: The speed at which data changes or is generated is enormous. This of how many videos or pictures you've seen on Twitter or Facebook that have gone viral in hours. Bid data should be able to analyse the data as it's being produced.

Variety: Defines the different data sets that are now available each day. From postcode data, census information, health data, finance data and on and on. The variety provides both an opportunity to enrich and provide background but can also muddy the waters.

Veracity: Defines the type of data, traditionally it's been about databases, structured data. Today the world is built upon both structured and unstructured data, such as websites, twitter etc.

Value: The one for me that should be top of the list! You may have lots of data and lots of cool algorithms but the main question is, what is the value in the data? Can you help a patients experience? Can you model a system improvement that will save your company money? Here is the value of the data.

Personally I view big data as a processes of knowing more about the area of interest than we did yesterday, be that through adding a V to what we know today.

The final piece of the puzzle if how you use that data within you business, is it going to inform a decision, form a data product? What you are aiming to use your data, may help you recognise the V that you will be most reliant on going forward.

Pages