In this article, you will explore various tokenization methods provided by the Natural Language Toolkit (NLTK) to process and analyse textual data. Tokenization is a crucial preprocessing step in Natural Language Processing (NLP), where text is broken down into smaller units, such as sentences or words.
Tasks
1. Open a notebook a jupyter.
2. Start by installing the NLTK library.
!pip install nltk
3. Create a Corpus which is a simple multi-line string containing two sentences.
data:image/s3,"s3://crabby-images/cd586/cd586512ebe5f224b33f77c426a37d7f1bc5beca" alt=""
4. Import the NLTK library along with the pre-trained models for tokenization.
data:image/s3,"s3://crabby-images/b92c6/b92c61951eacf5e8c0866c5dbb9a9a61cd187a73" alt=""
Import the Natural Language Toolkit (NLTK) library and download the punkt package, which contains pre-trained models for tokenizing English text.
Imported the sent_tokenize function to see how you can convert the Corpus(paragraph) to sentences.
5. Sentence Tokenization
data:image/s3,"s3://crabby-images/ecce7/ecce76e3805ffa38620ad25c690d1aac2e676a4e" alt=""
This tokenizer identifies sentence boundaries based on punctuation like periods (.), exclamation marks (!), etc.
6. Word tokenize where each word from the corpus will be treated as a token.
data:image/s3,"s3://crabby-images/1fb53/1fb5387a0b17bf737a6c70a346b597f9eed213ab" alt=""
This tokenizer splits the text based on spaces and punctuation and preserves punctuation as separate tokens.
7. Using the wordpuct_tokenize() method
data:image/s3,"s3://crabby-images/ed392/ed3927e0cd06a97629782cbd79b48f8f41191e79" alt=""
Here it splits punctuation marks into separate tokens and It Does not treat contractions like run’s logically, instead splitting into [‘run’, “‘”, ‘s’] also each punctuation mark (‘, !, .) is considered a standalone token.
Whereas in word_tokenize
- Handles contractions like run’s by splitting into [‘run’, “‘s”].
- Treats punctuation (!, ., ,) as separate tokens but attaches them logically to sentence structure.
- Each word, including punctuations and abbreviations, is treated meaningfully.
8. Using the TreebankWordTokenizer: It uses regular expressions to tokenize text, assuming that the text has already been segmented into sentences.
data:image/s3,"s3://crabby-images/05783/05783d7fcccc9ef4142993e6050355f161062af7" alt=""
This tokenizer is particularly adept at handling English contractions and punctuation, ensuring that words like “don’t” are correctly split into “do” and “n’t”.
data:image/s3,"s3://crabby-images/9e33c/9e33c1dcb5482acf47498e67f2ee1852b0411f1b" alt=""
So you have explored different tokenization techniques where,
- For sentence splitting: Use sent_tokenize.
- For word tokenization with context awareness: Use word_tokenize.
- For simple word and punctuation separation: Use wordpunct_tokenize.
- For linguistically accurate tasks (e.g., parsing): Use TreebankWordTokenizer.