Tuesday 2 October 2018

NLP – Just another acronym?

Data within a business is ever evolving and is being generated as we speak, as we tweet, as we send emails and in various other activities. Approximately 80% of corporate information is available in textual data formats1

  • Structured: highly organised datasets which is easily searchable such as a database/Excel sheet
  • Unstructured: data which doesn’t have a pre-defined model and doesn’t fit nicely into a database. Examples include PDF’s, emails, phone conversations, Tweets

Natural Language Processing helps us understand text and have the ability to produce insights from text data. It is a branch of data science that consists of systematic processes for analysing, understanding, and deriving information from the text data in a smart and efficient manner.

Text Pre-processing

Unstructured Data is typically very messy or “noisy” so it needs to be cleaned. Good practice is to follow the below where appropriate:

  • Make all text the same case (upper or lowercase)
  • Remove numbers
  • Remove punctuation
  • Remove noise (words not required such as “re” & “fw” in an email context)

The end aim is to be left with only data you’re interested in and removing what you don’t.

Normalization

Normalization is the process of making similar words the same (normalizing) and some useful examples are:

Stemming:The removal of suffixes (“ing”, “ly”, “es”, “s” etc) from a word. 

Removing stop words: 
removal of common words in the English language: “the”, “is”, “at” etc.). These form the structure of a sentence but not necessarily the context.

Feature Engineering

Term Frequency: 

If a word is identified as appearing many times, it can act as a way of identifying common themes in a body of text. 

Wordcloud:

A visual representation of common words or themes appearing in a body of text.

Text Classification: a technique used to classify a word/sentence into a specified group. This could be applied to something like an Email Spam Filter, as emails fit nicely into one of two groups, either spam or not spam. Models can be trained using training/test data to define what spam is/isn’t. 

The above are just some examples of what could be used in Natural Language Processing but this is by no means an extensive list. 

  1. UR-RAHMAN, N. and HARDING, J.A., 2012. Textual data mining for industrial knowledge management and text classification: A business oriented approach. Expert Systems with Applications, 39 (5), pp. 4729 – 4739

No comments:

Post a Comment