"Knowledge is power"
Francis Bacon
The first starting point for most analytics projects, is descriptive analytics. Basically answering that simple question of 'What happened', 'where are we' or 'When did it happen'. In the same sense we can get a feel for a unstructured document, or text field.
For each sentence, we can start understand the number of words, length, number of characters etc.
Depending on your Oracle environment, you could use Python through Oracle Data Science platform. Python has a number of language modules - we are going to get onto more of them shortly - but ultimately you could use something as simple as len.
If you are doing your analytics in SQL, maybe it's a text field inside the database e.g. survey responses etc, then there are a number of string functions you can utilise. Such as length or length, but with regular expression to remove special characters or spaces.
Stop Words
Once we understand the sentence in it's highest level then we can start to pull apart what the sentence is actually made up of. Stop words are those words they enable the connection of other words. They are, in a sense, the fluffy stuff. They don't add anything to the sentence, but make it nicer for us to read/ hear a sentence.
In this sentence we have multiple stop words. Now there is a caveat, and it's the work "it". In lower case we read it as "it", however in upper case "IT" is usually a department in your organisation. So be careful on the steps you take when it comes to handling stop words. If you do you cleaning/ preparation first such as lower case or removing punctuation, there is a chance that "IT" becomes "it" and is therefore a stop word.
If you are in Oracle Data Science platform then packages such as NLTK have a stop words built in and can remove it very quickly.
If you are in Oracle SQL, you'll need to build up a list of stop words and then apply the function.
In the next blog we'll start to look at Sentiment analysis.
No comments:
Post a Comment