Introduction to NLTK
NLTK stands for Natural Language Toolkit. It is a powerful, leading platform for building Python programs to work among other NLP libraries; it consists of several packages that help machines understand human language data and reply to it with an appropriate response.
It helps practitioners by providing easy-to-use interfaces to over 50 lexical and corpora resources, with text processing libraries for classification, tokenization, tagging, stemming, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries. The corpora of data consist of data from various applications over the internet; for text analytics, we can get data. By analyzing tweets on Twitter, we can find trending news and people’s reactions to a particular event. Amazon can understand user feedback or review on the specific product. BookMyShow can discover people’s reviews about the movie, which can be both positive or negative. Youtube can also analyze and understand people’s viewpoints on a video.
Getting started with NLTK
Download NLTK - you can download NLTK with nltk.download()
Import NLTK
You can import NLTK directly using import.
Importing dataset from corpus
Some simple operations with NLTK
Tokenizing
Tokenizing means breaking down large words or sentences into smaller length form, which helps to clean the data for processing and classification.
Tokenize words
To tokenize words means to break down a sentence into an array of words.
Tokenize sentences
To tokenize sentences is to break down long sentences into parts that differ by breakpoints.
Part of speech (POS)
A part-of-speech tagger(POS-tagger) generally processes word-sequences; for each word, it attaches a part of speech. The tokenization concept should be used to achieve this. (Tokenization is the process of dividing the quantity of text into smaller pieces called tokens.)
Here’s a list of the tags, what they mean, and some examples:
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is”… think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go ‘to ‘the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-adverb where, when
In this example, we use a state_union dataset that is available in the corpus.
You can directly download the dataset from nltk after importing all essential libraries.
After downloading the dataset, we have to select training and sample dataset.
After selecting the data, we have to train our model.
After we run our model, we will get the output in the combination of the words and the tag related to it.
Now we will learn how to use part of speech on our own text.
For this, we need to import nltk and use stopwords to cleans our data.
After importing essential libraries, we can embed our own data. We have written the data as “NLTK stands for Natural Language Toolkit. This is a suite of libraries and programs for symbolic and statistical NLP for English. It ships with graphical demonstrations and sample data. First getting to see the light in 2001, NLTK hopes to support research and teaching in NLP and other areas closely related. These include Artificial intelligence, empirical linguistics, cognitive science, information retrieval, and Machine Learning.”
We will get output in the array format, which is the combination of the word and the tag.
You can also find similar words, i.e., words that have related parts of speech. For example, “woman” is a noun. If we search “woman,” we will get all the nouns present in the data.
Similarly, “over” is a preposition. If we search “over,” we will get all the prepositions present in the data.
Click here to deploy your AI workloads on E2E GPU Cloud.