NLP is a hot topic right now, but it has a barrier to entry that only the number of .ac.uk extensions for any google search can truly describe. Let's try and demystify NLP and introduce the very basics.

NLP stands for 'Natural Language Processing'. Natural language, meaning it is concerned with languages that have evolved naturally and processing, meaning trying to uncover data that is not immediately visible.

Example of this data could be:

  • keywords
  • related concepts
  • sentiment

Some example use cases include:

  • grammar checking
  • spell correction
  • translation

Let's talk about a typical NLP pipeline (sequence of events)

tokenisation

First, we need to break apart our text, this is called tokenisation, a typical process would be to split text into words.

lemmatisation / stemming

Next is lemmatisation this is where we convert our tokens into their base form (lemma) by removing inflectional endings. e.g. Running -> Run.

part of speech tagging

Now we assign a tag to each token in the sentence, examples include: verbs, nouns, pronouns etc. It can be useful to have more 'fine-grained' tags such as 'noun-plural' also.

chunking

Chunk size can be roughly described as the level of detail of a phrase. Chunking, therefore, is the process of either increasing (chunking down) or decreasing (chunking up) this level of detail.

named entity recognition

NER is the process of classifying tokens and phrases in a sentence, it is best illustrated by an example:

<ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>.

parsing

This is the process of grammatical analysis, we know that sentences are composed of tokens, these tokens have grammatical tags (from POS tagging). By recursively grouping these tags we can form a parse tree, this tells us the grammatical structure of the sentence.

parse tree

information extraction

This is the overarching process of finding structure in unstructured text. NER is one such sub-task, other sub-tasks include:

  • relationship extraction - finding relations between entities
  • keyword extraction - finding the most relevant tokens

deep semantics

Semantic analysis is the process of encoding meaning from text (or speech). Deep semantics have evolved from new advances in deep learning, which can be used to derive a more meaningful and relevant semantic analysis.

That's it (I did say it would be basic). In my next post, I'll be writing a basic lemmatiser and frequency analyser using Python and NLTK (natural language toolkit), stay tuned!