Small Lab, Big Data: February 2023

Tuesday, 21 February 2023

Natural what now - NLP

“A different language is a different vision of life.”

Federico Fellini

I realised in my last blog, I didn't really provide enough description as to what NLP is or its history. So a quick blog to fill the readers in :)

Natural language processing (NLP) is a field of artificial intelligence and computer science that focuses on enabling computers to process, understand, and generate human language. It can be rule based or involve a number of complex algorithms. The tasks can be things like translation to extraction of specific items or grouping themes together. In the real world there are many applications to NLP from chat bots to virtual assistants.

Here is a brief overview of the history of NLP:

1950s and 1960s: The foundations of NLP are laid with the development of early computer programs that can process and analyse natural language data.
1970s and 1980s: NLP begins to develop more advanced techniques, including the use of rule-based systems and the development of the first machine translation systems.
1990s: The field of NLP sees significant growth and advances, including the development of statistical models and ML algorithms that can be used for NLP tasks.
2000s: NLP continues to advance, with the development of new techniques such as deep learning
2010s and beyond: NLP continues to evolve, with the development of new techniques and the widespread use of NLP in a wide range of applications, including virtual assistants and chatbots.

Overall, the history of NLP reflects the evolution of artificial intelligence and computer science, and it has led to significant advances in the field that have had a wide-ranging impact on society.

Monday, 20 February 2023

NLP: Where do I start

"Knowledge is power"

Francis Bacon

The first starting point for most analytics projects, is descriptive analytics. Basically answering that simple question of 'What happened', 'where are we' or 'When did it happen'. In the same sense we can get a feel for a unstructured document, or text field.

For each sentence, we can start understand the number of words, length, number of characters etc.

Depending on your Oracle environment, you could use Python through Oracle Data Science platform. Python has a number of language modules - we are going to get onto more of them shortly - but ultimately you could use something as simple as len.

If you are doing your analytics in SQL, maybe it's a text field inside the database e.g. survey responses etc, then there are a number of string functions you can utilise. Such as length or length, but with regular expression to remove special characters or spaces.

Stop Words

Once we understand the sentence in it's highest level then we can start to pull apart what the sentence is actually made up of. Stop words are those words they enable the connection of other words. They are, in a sense, the fluffy stuff. They don't add anything to the sentence, but make it nicer for us to read/ hear a sentence.

In this sentence we have multiple stop words. Now there is a caveat, and it's the work "it". In lower case we read it as "it", however in upper case "IT" is usually a department in your organisation. So be careful on the steps you take when it comes to handling stop words. If you do you cleaning/ preparation first such as lower case or removing punctuation, there is a chance that "IT" becomes "it" and is therefore a stop word.

If you are in Oracle Data Science platform then packages such as NLTK have a stop words built in and can remove it very quickly.

If you are in Oracle SQL, you'll need to build up a list of stop words and then apply the function.

In the next blog we'll start to look at Sentiment analysis.

Pages

Tuesday, 21 February 2023

Natural what now - NLP

“A different language is a different vision of life.”

Monday, 20 February 2023

NLP: Where do I start