Introduction to NLTK and Text Analytics

May 20, 2022

Introduction to NLTK

NLTK stands for Natural Language Toolkit. It is a powerful, leading platform for building Python programs to work among other NLP libraries; it consists of several packages that help machines understand human language data and reply to it with an appropriate response.

It helps practitioners by providing easy-to-use interfaces to over 50 lexical and corpora resources, with text processing libraries for classification, tokenization, tagging, stemming, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries. The corpora of data consist of data from various applications over the internet; for text analytics, we can get data. By analyzing tweets on Twitter, we can find trending news and people’s reactions to a particular event. Amazon can understand user feedback or review on the specific product. BookMyShow can discover people’s reviews about the movie, which can be both positive or negative. Youtube can also analyze and understand people’s viewpoints on a video.

Getting started with NLTK

Download NLTK - you can download NLTK with nltk.download()

Import NLTK

You can import NLTK directly using import.

Importing dataset from corpus

Some simple operations with NLTK

Tokenizing

Tokenizing means breaking down large words or sentences into smaller length form, which helps to clean the data for processing and classification.

Tokenize words

To tokenize words means to break down a sentence into an array of words.

Tokenize sentences

To tokenize sentences is to break down long sentences into parts that differ by breakpoints.

Part of speech (POS)

A part-of-speech tagger(POS-tagger) generally processes word-sequences; for each word, it attaches a part of speech. The tokenization concept should be used to achieve this. (Tokenization is the process of dividing the quantity of text into smaller pieces called tokens.)

Here’s a list of the tags, what they mean, and some examples:

CC coordinating conjunction

CD cardinal digit

DT determiner

EX existential there (like: “there is”… think of it like “there exists”)

FW foreign word

IN preposition/subordinating conjunction

JJ adjective ‘big’

JJR adjective, comparative ‘bigger’

JJS adjective, superlative ‘biggest’

LS list marker 1)

MD modal could, will

NN noun, singular ‘desk’

NNS noun plural ‘desks’

NNP proper noun, singular ‘Harrison’

NNPS proper noun, plural ‘Americans’

PDT predeterminer ‘all the kids’

POS possessive ending parent’s

PRP personal pronoun I, he, she

PRP$ possessive pronoun my, his, hers

RB adverb very, silently,

RBR adverb, comparative better

RBS adverb, superlative best

RP particle give up

TO to go ‘to ‘the store.

UH interjection errrrrrrrm

VB verb, base form take

VBD verb, past tense took

VBG verb, gerund/present participle taking

VBN verb, past participle taken

VBP verb, sing. present, non-3d take

VBZ verb, 3rd person sing. present takes

WDT wh-determiner which

WP wh-pronoun who, what

WP$ possessive wh-pronoun whose

WRB wh-adverb where, when

In this example, we use a state_union dataset that is available in the corpus.

You can directly download the dataset from nltk after importing all essential libraries.

After downloading the dataset, we have to select training and sample dataset.

After selecting the data, we have to train our model.

After we run our model, we will get the output in the combination of the words and the tag related to it.

Now we will learn how to use part of speech on our own text.

For this, we need to import nltk and use stopwords to cleans our data.

After importing essential libraries, we can embed our own data. We have written the data as “NLTK stands for Natural Language Toolkit. This is a suite of libraries and programs for symbolic and statistical NLP for English. It ships with graphical demonstrations and sample data. First getting to see the light in 2001, NLTK hopes to support research and teaching in NLP and other areas closely related. These include Artificial intelligence, empirical linguistics, cognitive science, information retrieval, and Machine Learning.”

We will get output in the array format, which is the combination of the word and the tag.

You can also find similar words, i.e., words that have related parts of speech. For example, “woman” is a noun. If we search “woman,” we will get all the nouns present in the data.

Similarly, “over” is a preposition. If we search “over,” we will get all the prepositions present in the data.

Click here to deploy your AI workloads on E2E GPU Cloud.

Sign up for Free Trial

Latest Blogs