Small Lab, Big Data: October 2018

Tuesday 9 October 2018

Journey to the cloud

So today we'll discuss the migration to the cloud. A journey that on the surface or at least the sales pitch, sounds picture perfect. A swift migration to cloud and all your problems are solved. Well maybe not....

So our migration started like most others, in our case a on-premise Exadata, working well but not offering the strategic future that we wanted. Cloud on the other hand, offers the Exadata, the software as a service as well as the posserbility of bursting CPU's and/ or storage when needed rather than always having them.

In our instance we were not only moving Exadata to cloud but also from OBIEE on premise to the Oracle cloud BICs software as a service. Our first problem was initially we were on 11.7 of OBIEE, a version that you can't just migrate straight to cloud. So after many conversations, we upgraded, and rightly or wrongly went to the newest version avaiable for us to download, OBIEE 12c. This was out preparation work for the cloud.

Now time to get our hands on that shiny new cloud.... well sort of....
Firstly if your migrating data to the cloud, you'll need to decide how your going to transfer the data. In our case we didn't have a VPN installed, so the corporate network stopped us accessing the oracle cloud directly. Next our data wasn't big enough to qualify for the Oracle data service. As all good problem solvers we managd to use the tools at our disposal to get the end result.

We transfered the source data (transactional data) to the on-premise X4, total size 90GB. We then encrypted and used the HCC in low-query Archieve mode to compress the data down to an impressive 9GB. Now we're cooking on gas as they say, the file was now small enough to use the internet (with appropriate security in place) to upload the data file to the cloud. Repeat a number of times and eventually you manage to get the data to the cloud - not ideal but at least in our control.

Once the dat was there, the next problem was moving the RPD and catalog from OBIEE to BICs. Now having done the prep work, geniunly we thought this part would be easy. Oh how I was wrong....

Oracle often maintain a cloud first deelopment structure, see this link for details, however this wasn't strictly true in BICs. It turned out that the OBIEE on-premise version was ahead of the BICs version. Unfortunatly there isn't a way to "downgrade" a OBIEE VM, once you've upgraded your stuck. The RPD there is a was to downgrade. There is a good guide to this here.

Anyway on the 5th June the BICs envorinment was finally upgraded, enabling catalog from on-premise to loud to happen, in our case causing a 4 week delay to the project.

All in all we've managed to migrate to cloud, but it wasn't that easy journey you've heard about. Now we're here, well there are still some snags, but more details of that in my next post.

Tuesday 2 October 2018

NLP – Just another acronym?

Data within a business is ever evolving and is being generated as we speak, as we tweet, as we send emails and in various other activities. Approximately 80% of corporate information is available in textual data formats¹

Structured: highly organised datasets which is easily searchable such as a database/Excel sheet
Unstructured: data which doesn’t have a pre-defined model and doesn’t fit nicely into a database. Examples include PDF’s, emails, phone conversations, Tweets

Natural Language Processing helps us understand text and have the ability to produce insights from text data. It is a branch of data science that consists of systematic processes for analysing, understanding, and deriving information from the text data in a smart and efficient manner.

Text Pre-processing

Unstructured Data is typically very messy or “noisy” so it needs to be cleaned. Good practice is to follow the below where appropriate:

Make all text the same case (upper or lowercase)
Remove numbers
Remove punctuation
Remove noise (words not required such as “re” & “fw” in an email context)

The end aim is to be left with only data you’re interested in and removing what you don’t.

Normalization

Normalization is the process of making similar words the same (normalizing) and some useful examples are:

Stemming:The removal of suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

Removing stop words: removal of common words in the English language: “the”, “is”, “at” etc.). These form the structure of a sentence but not necessarily the context.

Feature Engineering

Term Frequency:

If a word is identified as appearing many times, it can act as a way of identifying common themes in a body of text.

Wordcloud:

A visual representation of common words or themes appearing in a body of text.

Text Classification: a technique used to classify a word/sentence into a specified group. This could be applied to something like an Email Spam Filter, as emails fit nicely into one of two groups, either spam or not spam. Models can be trained using training/test data to define what spam is/isn’t.

The above are just some examples of what could be used in Natural Language Processing but this is by no means an extensive list.

UR-RAHMAN, N. and HARDING, J.A., 2012. Textual data mining for industrial knowledge management and text classification: A business oriented approach. Expert Systems with Applications, 39 (5), pp. 4729 – 4739

Pages

Tuesday 9 October 2018

Journey to the cloud

Tuesday 2 October 2018

NLP – Just another acronym?