NLP is a hot topic right now, but it has a barrier to entry that only the number of .ac.uk extensions for any google search can truly describe. Let's try and demystify NLP and introduce the very basics.
NLP stands for 'Natural Language Processing'. Natural language, meaning it is concerned with languages that have evolved naturally and processing, meaning trying to uncover data that is not immediately visible.
Example of this data could be:
- related concepts
Some example use cases include:
- grammar checking
- spell correction
Let's talk about a typical NLP pipeline (sequence of events)
First, we need to break apart our text, this is called tokenisation, a typical process would be to split text into words.
lemmatisation / stemming
Next is lemmatisation this is where we convert our tokens into their base form (lemma) by removing inflectional endings. e.g. Running -> Run.
part of speech tagging
Now we assign a tag to each token in the sentence, examples include: verbs, nouns, pronouns etc. It can be useful to have more 'fine-grained' tags such as 'noun-plural' also.
Chunk size can be roughly described as the level of detail of a phrase. Chunking, therefore, is the process of either increasing (chunking down) or decreasing (chunking up) this level of detail.
named entity recognition
NER is the process of classifying tokens and phrases in a sentence, it is best illustrated by an example:
<ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>.
This is the process of grammatical analysis, we know that sentences are composed of tokens, these tokens have grammatical tags (from POS tagging). By recursively grouping these tags we can form a parse tree, this tells us the grammatical structure of the sentence.
This is the overarching process of finding structure in unstructured text. NER is one such sub-task, other sub-tasks include:
- relationship extraction - finding relations between entities
- keyword extraction - finding the most relevant tokens
Semantic analysis is the process of encoding meaning from text (or speech). Deep semantics have evolved from new advances in deep learning, which can be used to derive a more meaningful and relevant semantic analysis.
That's it (I did say it would be basic). In my next post, I'll be writing a basic lemmatiser and frequency analyser using Python and NLTK (natural language toolkit), stay tuned!