Small Lab, Big Data: 2018

Tuesday 9 October 2018

Journey to the cloud

So today we'll discuss the migration to the cloud. A journey that on the surface or at least the sales pitch, sounds picture perfect. A swift migration to cloud and all your problems are solved. Well maybe not....

So our migration started like most others, in our case a on-premise Exadata, working well but not offering the strategic future that we wanted. Cloud on the other hand, offers the Exadata, the software as a service as well as the posserbility of bursting CPU's and/ or storage when needed rather than always having them.

In our instance we were not only moving Exadata to cloud but also from OBIEE on premise to the Oracle cloud BICs software as a service. Our first problem was initially we were on 11.7 of OBIEE, a version that you can't just migrate straight to cloud. So after many conversations, we upgraded, and rightly or wrongly went to the newest version avaiable for us to download, OBIEE 12c. This was out preparation work for the cloud.

Now time to get our hands on that shiny new cloud.... well sort of....
Firstly if your migrating data to the cloud, you'll need to decide how your going to transfer the data. In our case we didn't have a VPN installed, so the corporate network stopped us accessing the oracle cloud directly. Next our data wasn't big enough to qualify for the Oracle data service. As all good problem solvers we managd to use the tools at our disposal to get the end result.

We transfered the source data (transactional data) to the on-premise X4, total size 90GB. We then encrypted and used the HCC in low-query Archieve mode to compress the data down to an impressive 9GB. Now we're cooking on gas as they say, the file was now small enough to use the internet (with appropriate security in place) to upload the data file to the cloud. Repeat a number of times and eventually you manage to get the data to the cloud - not ideal but at least in our control.

Once the dat was there, the next problem was moving the RPD and catalog from OBIEE to BICs. Now having done the prep work, geniunly we thought this part would be easy. Oh how I was wrong....

Oracle often maintain a cloud first deelopment structure, see this link for details, however this wasn't strictly true in BICs. It turned out that the OBIEE on-premise version was ahead of the BICs version. Unfortunatly there isn't a way to "downgrade" a OBIEE VM, once you've upgraded your stuck. The RPD there is a was to downgrade. There is a good guide to this here.

Anyway on the 5th June the BICs envorinment was finally upgraded, enabling catalog from on-premise to loud to happen, in our case causing a 4 week delay to the project.

All in all we've managed to migrate to cloud, but it wasn't that easy journey you've heard about. Now we're here, well there are still some snags, but more details of that in my next post.

Tuesday 2 October 2018

NLP – Just another acronym?

Data within a business is ever evolving and is being generated as we speak, as we tweet, as we send emails and in various other activities. Approximately 80% of corporate information is available in textual data formats¹

Structured: highly organised datasets which is easily searchable such as a database/Excel sheet
Unstructured: data which doesn’t have a pre-defined model and doesn’t fit nicely into a database. Examples include PDF’s, emails, phone conversations, Tweets

Natural Language Processing helps us understand text and have the ability to produce insights from text data. It is a branch of data science that consists of systematic processes for analysing, understanding, and deriving information from the text data in a smart and efficient manner.

Text Pre-processing

Unstructured Data is typically very messy or “noisy” so it needs to be cleaned. Good practice is to follow the below where appropriate:

Make all text the same case (upper or lowercase)
Remove numbers
Remove punctuation
Remove noise (words not required such as “re” & “fw” in an email context)

The end aim is to be left with only data you’re interested in and removing what you don’t.

Normalization

Normalization is the process of making similar words the same (normalizing) and some useful examples are:

Stemming:The removal of suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

Removing stop words: removal of common words in the English language: “the”, “is”, “at” etc.). These form the structure of a sentence but not necessarily the context.

Feature Engineering

Term Frequency:

If a word is identified as appearing many times, it can act as a way of identifying common themes in a body of text.

Wordcloud:

A visual representation of common words or themes appearing in a body of text.

Text Classification: a technique used to classify a word/sentence into a specified group. This could be applied to something like an Email Spam Filter, as emails fit nicely into one of two groups, either spam or not spam. Models can be trained using training/test data to define what spam is/isn’t.

The above are just some examples of what could be used in Natural Language Processing but this is by no means an extensive list.

UR-RAHMAN, N. and HARDING, J.A., 2012. Textual data mining for industrial knowledge management and text classification: A business oriented approach. Expert Systems with Applications, 39 (5), pp. 4729 – 4739

Friday 28 September 2018

Data Lab vs Warehouse

https://en.wikipedia.org/wiki/Data_warehouse

According to Wikipedia a Data Warehouse is defined as:

“... are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.”

“The main source of the data is cleansed, transformed, catalogued, and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support. However, the means to retrieve and analyze data, to extract, transform, and load data, and to manage the data dictionary are also considered essential components of a data warehousing system.”

and a Data Lake as:

“...a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).”

In a nutshell these differences are:

Data Warehouse	Data Lake
Only retains the data required, reducing noise from the users	Retains all Data
Arranges data into standardised structures	Can handle data of any variety / structure
Consolidates data from a variety of sources into ‘One Source of the Truth’	Supports a variety of users but they may require specific skills
Supports decision making and can be used by a range of user abilities	Ability to adapt quickly to new data types
Stringent Row/ Column security can be put in place quickly.	Enables quicker data analysis

From a Data science perspective which is better… well that depends. There are advantages of a Data Warehouse, the structure and reproducibility allows quick, consistent analysis to be performed. A Data Lake on the other hand maybe more difficult, and not always reproducible, but give you the more granular and wide ranging data to find the really interesting analysis.

So my Answer – a Data warehouse where I can bring my own data. Maybe something like a Data Ware-Lake?

Tuesday 25 September 2018

Oracle Pre-Tools and Coding Challenges

On a daily basis i'm working within an Oracle database environment. Much of that work includes Data Miner, R, Python and Oracle OBIEE.

Many will be wondering what they are and how you can use them. Thankfully there is a set of pre-built Virtual Machines (VMs) that enable you to simulate these different environments.

They are an excellent resources that can be used for investigation of these tools. Personally its a brilliant resource.

Oracle VM'S:http://www.oracle.com/technetwork/community/developer-vm/index.html

I will be using these in the future to demonstrate some of the great features within Oracle that we've been using in the data lab.

For the demonstrations i’m going to be using open data, mainly so users could follow along if there wanted, or even provide information on there own analysis conducted. These pre-built environments are also a great way to try out new skills. Personally i’ve never got into APEX or Docker, but through VMs as well other online resources i’ve started to expand my knowledge in these areas.

The other area I try to encourage people is to have a go at coding. Why not try FreeCode (@freeCodeCamp) or the 100 days of coding challenge (#100DaysOfCode)both of which are great ways to get into coding as well as having a supportive community around you.

Personal favourite languages:

R
Python
SQL

All of which have there own pro’s and con’s. From the R world i’m a huge fan of FlexDashboards, in Python, the ability to do deep learning through Edward/ Tensorflow. SQL is my go to for large data sets and data engineering tasks.

Give it a go, learn a new programming language, have a date night that is programming orientated, develop a new website, what ever it is that you want to push your self in.

Saturday 22 September 2018

Prescriptive Analytics - the future?

Prescriptive analytics is the third and final phase of business analytics.

The first 2 stages of business analytics include:

Descriptive Analytics: Analytics that quantitatively describe or summarize features within a data set. The data set could be the combination of data sets for example house sale prices combined with ONS statistics on employment rates. We would then be able to describe the areas of the country, where has high house prices Vs employment rate etc.

Predictive Analytics: Uses a variety of techniques from predictive modelling, machine learning, and data mining to analyse current states and historical facts to make predictions about future or otherwise unknown events. For example, using our descriptive analytics above on house prices and employment rate, we may be able to predict the price a house should be added to the market as a sale price, based on what other houses sold for, the size of the house, income in the area, employment rates etc.

Often Prescriptive Analytics is sometime referred to as the “final frontier of analytic capabilities”, which entails the application of mathematical and computational sciences and suggests decision options to take advantage of the results of descriptive and predictive analytics.

Prescriptive Analytics goes beyond the prescriptive analytics by showing potential actions to benefit from the predictions and showing the implications of the decision options. Often prescriptive analytics will go one step further will not just predict what will happen or when it will happen but also why it will happen.

An example of where prescriptive analytics is utilised is in energy pricing. When predicting prices it depends on supply, demand, economy, geopolitics, weather conditions etc. As an energy company being able to use prescriptive analytics to take in the external data sets next to their own data sets to predict prices by modelling the internal and external variables at the same time and provide decision options and show the consequences of each decision.

Wednesday 19 September 2018

Operating Models for Teams

CRISP-DM is short hand for Cross Industry Standard Process for Data Mining. The model is used by many different areas of business that describes an approach that can be used to tackle data problems.

In the to day work that I'm involved with we are often approached by ideas or problems from within the business. This is key, we're placing the business at the centre of the problem. Thus we are placing Value at the centre of the processing. As the problem is defined we're also gathering business understanding whilst understanding the data we have. Assessing the Variety of data and any enrichment of the data that could be applied. This understanding should then lead to preparing the data for the modelling, starting to understand the Volume of data that will be used in the modelling and the impact that it will have. The modelling allows us to now start tackling the problem. At this stage it might require to assess the Veracity of the data. This might be a key point in assessing the problem and the implementation of the solution to the wider business.

Finally we look to evaluate the potential solution to the problem and look to deploy into the business but ultimately returning to the start – the Value of the solution.

The experience I have had with this modelling approach has been positive, it allows the business to remain central to the process and ultimately focused on the Value returned to the business by going through this process.

Once you’ve got your model then these days there are many working within the Agile framework. At my current work place we’ve implemented Jira for tracking these stages. This is usually done through KANBAN boards, and more defined tasks leading to the overall story.

IMHO this process will allow Big Data to yield great results for your business just ensure the business is central to the work you conduct!

Monday 3 September 2018

Big Data, what’s all the Fuss

Big Data is often defined as the so large or complex that traditional methods are no longer acceptable of useful. There are many challenges to face, such as data collection, cleaning data, supplementing or enriching data as well as analysing and extracting the data. This is alongside the fact that 90% of the data in the world today has been generated in the last two years! Currently its estimated that we output 2.5 quintillion bytes of data per day, through electronic devices, phones etc.

When people describe big data they often define it through a number of V’s. The number of V’s is variable (no pun intended) but I usually define big data through the 5 V's; Volume, Velocity, Variety, Veracity, Value:

Volume: Often relates to the bytes of data that are collected and stored everyday from many sources. In my day job we often have over 7 billion rows of data in a basic query, but sometimes it can go beyond that.

Velocity: The speed at which data changes or is generated is enormous. This of how many videos or pictures you've seen on Twitter or Facebook that have gone viral in hours. Bid data should be able to analyse the data as it's being produced.

Variety: Defines the different data sets that are now available each day. From postcode data, census information, health data, finance data and on and on. The variety provides both an opportunity to enrich and provide background but can also muddy the waters.

Veracity: Defines the type of data, traditionally it's been about databases, structured data. Today the world is built upon both structured and unstructured data, such as websites, twitter etc.

Value: The one for me that should be top of the list! You may have lots of data and lots of cool algorithms but the main question is, what is the value in the data? Can you help a patients experience? Can you model a system improvement that will save your company money? Here is the value of the data.

Personally I view big data as a processes of knowing more about the area of interest than we did yesterday, be that through adding a V to what we know today.

The final piece of the puzzle if how you use that data within you business, is it going to inform a decision, form a data product? What you are aiming to use your data, may help you recognise the V that you will be most reliant on going forward.

Pages