Home » Articles » News

NLTK Tokenize: How to Tokenize Words and Sentences with NLTK? - Holistic SEO (2023)

NLTK Tokenize: How to Tokenize Words and Sentences with NLTK? - Holistic SEO (2023)
"To tokenize sentences and words with NLTK, “nltk.word_tokenize()” function will be used. NLTK Tokenization is used for parsing a large amount of textual data into parts to perform an analysis of the character of the text. NLTK for tokenization can be used for training machine learning models, Natural Language Processing text cleaning. The tokenized words and sentences with NLTK can be turned into a data frame and vectorized. Natural Language Tool Kit (NLTK) tokenization involves punctuation cleaning, text cleaning, vectorization of parsed text data for better lemmatization, and stemming along with machine learning algorithm training.Natural Language Tool Kit Python Libray has a tokenization package is called “tokenize”. In the “tokenize” package of NLTK, there are two types of tokenization functions.“word_tokenize” is to tokenize words.“sent_tokenize” is to tokenize sentences.Contents of the Article show‒‒:‒‒/01:00







NextStay
How to Tokenize Words with Natural Language Tool Kit (NLTK)?Tokenization of words with NLTK means parsing a text into the words via Natural Language Tool Kit. To tokenize words with NLTK, follow the steps below.(adsbygoogle = window.adsbygoogle || []).push({});(adsbygoogle = window.adsbygoogle || []).push({});Import the “word_tokenize” from the “nltk.tokenize”.Load the text into a variable.Use the “word_tokenize” function for the variable.Read the tokenization result.Below, you can see a tokenization example with NLTK for a text.from nltk.tokenize import word_tokenizetext = ""Search engine optimization is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic.""print(word_tokenize(text))>>>OUTPUT['Search', 'engine', 'optimization', 'is', 'the', 'process', 'of', 'improving', 'the', 'quality', 'and', 'quantity', 'of', 'website', 'traffic', 'to', 'a', 'website', 'or', 'a', 'web', 'page', 'from', 'search', 'engines', '.', 'SEO', 'targets', 'unpaid', 'traffic', 'rather', 'than', 'direct', 'traffic', 'or', 'paid', 'traffic', '.']The explanation of the tokenization example above can be seen below.The first line is for importing the “word_tokenize” function.The second line of code is to provide text data for tokenization.Third line of code to print the output of the “word_tokenize”.What are the advantages of word tokenization with NLTK?The word tokenization benefits with NLTK involves the benefits of White Space Tokenization, Dictionary Based Tokenization, Rule-Based Tokenization, Regular Expression Tokenization, Penn Treebank Tokenization, Spacy Tokenization, Moses Tokenization, Subword Tokenization. All type of word tokenization is a part of the text normalization process. Normalizing the text with stemming and lemmatization improves the accuracy of the language understanding algorithms. The benefits and advantages of the word tokenization with NLTK can be found below.Removing the stop words easily from the corpora before the tokenization.Splitting words into the sub-words for understanding the text better.Removing the text disambiguate is faster and requires less coding with NLTK.Besides White Space Tokenization, Dictionary Based and Rule-based Tokenization can be implemented easily.Performing Byte Pair Encoding, Word Piece Encoding, Unigram Language Model, Setence Piece Encoding is easier with NLTK.NLTK has TweetTokenizer for tokenizing the tweets that including emojis and other Twitter norms.NLTK has PunktSentenceTokenizer has a pre-trained model for tokenization in multiple European Languages.NLTK has Multi Word Expression Tokenizer for tokenizing the compound words such as “in spite of”.NLTK has RegexpTokenizer to tokenize sentences based on the regular expressions.How to Tokenize Sentences with Natural Language Tool Kit (NLTK)?To tokenize the sentences with Natural Language Tool kit, the steps below should be followed.Import the “sent_tokenize” from “nltk.tokenize”.Load the text for sentence tokenization into a variable.Use the “sent_tokenize” for the specific variable.Print the output.Below, you can see an example of NLTK Tokenization for sentences.from nltk.tokenize import sent_tokenizetext = ""God is Great! I won a lottery.""print(sent_tokenize(text))Output: ['God is Great!', 'I won a lottery ']At the code block above, the text is tokenized into the sentences. By taking all of the sentences into a list with the sentence tokenization with NLTK can be used to see which sentence is connected to which one, average word per sentence, and unique sentence count.What are the advantages of sentence tokenization with NLTK?The advantages of sentence tokenization with NLTK are listed below.NLTK provides a chance to perform text data mining for sentences.NLTK sentence tokenization involves comparing different text corporas at the sentence level.Sentence tokenization with NLTK provides understanding how many sentences are used in a different sources of texts such as websites, or books, and papers.Thanks to NLTK “sent_tokenize” function, it is possible to see how the sentences are connected to each other, with what bridge words.Via NLTK sentence tokenizer, performing an overall sentiment analysis for the sentences is possible.Performing Semantic Role Labeling for the sentences to understand how the sentences are connected each other is one of the benefits of NLTK sentence tokenization.How to perform Regex Tokenization with NLTK?Regex Tokenization with NLTK is to perform tokenization based on regex rules. Regex Tokenization via NLTK can be used for extracting certain phrase patterns from a corpus. To perform regex tokenization with NLTK, the “tokenize.regexp()” method should be used. An example of the regex tokenization NLTK is below.from nltk.tokenize import RegexpTokenizerregex_tokenizer = RegexpTokenizer('\?', gaps = True)text = ""How to perform Regex Tokenization with NLTK? To perform regex tokenization with NLTK, the regex pattern should be chosen.""regex_tokenization = regex_tokenizer.tokenize(text)print(regex_tokenization)OUTPUT >>>['How to perform Regex Tokenization with NLTK', ' To perform regex tokenization with NLTK, the regex pattern should be chosen.']The Regex Tokenization example with NLTK demonstrates that how to take a question sentence and a sentence after it. By taking sentences that end with a question mark, and taking the sentences after it, matching the answers and questions, or taking the question formats from a corpus is possible.How to perform Rule-based Tokenization with NLTK?Rule-based Tokenization is tokenization based on certain rules that are generated from certain conditions. NLTK has three different rule-based tokenization algorithms as TweetTokenizer for Twitter Tweets, and MWET for Multi-word tokenization, along with the TreeBankTokenizer for the English Language rules. Rule-based Tokenization is helpful for performing the tokenization based on the best possible conditions for the nature of the textual data.(adsbygoogle = window.adsbygoogle || []).push({});(adsbygoogle = window.adsbygoogle || []).push({});An example of Rule-based tokenization with MWET for multi-word tokenization can be seen below.from nltk.tokenize import MWETokenizersentence = ""I have sent Steven Nissen to the new reserch center for the nutritional value of the coffee. This sentence will be tokenized while Mr. Steven Nissen is on the journey. The #truth will be learnt. And, it's will be well known thanks to this tokenization example.""tokenizer = MWETokenizer()tokenizer.add_mwe((""Steven"", ""Nissen""))result = tokenizer.tokenize(word_tokenize(sentence))result OUTPUT >>>['I', 'have', 'sent', 'Steven', 'Nissen', 'to', 'the', 'new', 'reserch', 'center', 'for', 'the', 'nutritional', 'value', 'of', 'the', 'coffee', '.', 'This', 'sentence', 'will', 'be', 'tokenized', 'while', 'Mr.', 'Steven', 'Nissen', 'is', 'on', 'the', 'journey', '.', 'The', '#', 'truth', 'will', 'be', 'learnt', '.', 'And', ',', 'it', ""'s"", 'will', 'be', 'well', 'known', 'thanks', 'to', 'this', 'tokenization', 'example', '.'] An example of Rule-based tokenization with TreebankWordTokenizer for English language text can be seen below.from nltk.tokenize import TreebankWordTokenizersentence = ""I have sent Steven Nissen to the new reserch center for the nutritional value of the coffee. This sentence will be tokenized while Mr. Steven Nissen is on the journey. The #truth will be learnt. And, it's will be well known thanks to this tokenization example.""tokenizer = TreebankWordTokenizer()result = tokenizer.tokenize(sentence)result OUTPUT >>>['I', 'have', 'sent', 'Steven', 'Nissen', 'to', 'the', 'new', 'reserch', 'center', 'for', 'the', 'nutritional', 'value', 'of', 'the', 'coffee.', 'This', 'sentence', 'will', 'be', 'tokenized', 'while', 'Mr.', 'Steven', 'Nissen', 'is', 'on', 'the', 'journey.', 'The', '#', 'truth', 'will', 'be', 'learnt.', 'And', ',', 'it', ""'s"", 'will', 'be', 'well', 'known', 'thanks', 'to', 'this', 'tokenization', 'example', '.'] An example of Rule-based tokenization with TweetTokenizer for Twitter Tweets’ tokenization can be seen below.from nltk.tokenize import TweetTokenizersentence = ""I have sent Steven Nissen to the new reserch center for the nutritional value of the coffee. This sentence will be tokenized while Mr. Steven Nissen is on the journey. The #truth will be learnt. And, it's will be well known thanks to this tokenization example.""tokenizer = TweetTokenizer()result = tokenizer.tokenize(sentence)result OUTPUT>>>['I', 'have', 'sent', 'Steven', 'Nissen', 'to', 'the', 'new', 'reserch', 'center', 'for', 'the', 'nutritional', 'value', 'of', 'the', 'coffee', '.', 'This', 'sentence', 'will', 'be', 'tokenized', 'while', 'Mr', '.', 'Steven', 'Nissen', 'is', 'on', 'the', 'journey', '.', 'The', '#truth', 'will', 'be', 'learnt', '.', 'And', ',', ""it's"", 'will', 'be', 'well', 'known', 'thanks', 'to', 'this', 'tokenization', 'example', '.']The most standard rule-based type of word tokenization is white space tokenization. White-space tokenization is basically taken spaces between words for the tokenization. White-space tokenization can be performed with the “split(” “)” method and argument as below.sentence = ""I have sent Steven Nissen to the new reserch center for the nutritional value of the coffee. This sentence will be tokenized while Mr. Steven Nissen is on the journey. The #truth will be learnt. And, it's will be well known thanks to this tokenization example.""result = sentence.split("" "")result OUTPUT >>>['I', 'have', 'sent', 'Steven', 'Nissen', 'to', 'the', 'new', 'reserch', 'center', 'for', 'the', 'nutritional', 'value', 'of', 'the', 'coffee.', 'This', 'sentence', 'will', 'be', 'tokenized', 'while', 'Mr.', 'Steven', 'Nissen', 'is', 'on', 'the', 'journey.', 'The', '#truth', 'will', 'be', 'learnt.', 'And,', ""it's"", 'will', 'be', 'well', 'known', 'thanks', 'to', 'this', 'tokenization', 'example.']In NLTK tokenization methods, there are other tokenization methodologies such as PunktSentenceTokenizerfor detecting the sentence boundaries, and Punctuation-based tokenization for tokenizing the punctuation-related words, and multiword properly.How to use Lemmatization with NLTK Tokenization?To use lemmatization with NLTK Tokenization, the “nltk.stem.wordnet.WordNetLemmetizer” should be used. WordNetLemmetizer from NLTK is to lemmatize the words within the text. The word lemmatization is the process of turning a word into its original dictionary form. Unlike stemming, lemmatization removes all of the suffixes, prefixes, and morphological changes for the word. NLTK Lemmatization is useful to see a word’s context and understand which words are actually the same during the word tokenization. Below, you will see word tokenization and lemmatization with NLTK example code block.from nltk.stem.wordnet import WordNetLemmatizerlemmatize = WordNetLemmatizer()lemmatized_words = []for w in tokens: rootWord = lemmatize.lemmatize(w) lemmetized_words.append(rootWord)counts_lemmetized_words = Counter(lemmatized_words)df_tokenized_lemmatized_words = pd.DataFrame.from_dict(counts_lemmatized_words, orient=""index"").reset_index()df_tokenized_lemmatized_words.sort_values(by=0, ascending=False, inplace=True)df_tokenized_lemmatized_words[:50]The NLTK Tokenization and Lemmatization example code bloc explanation is below.The “nltk.stem.wordnet” is called for importing WordNetLemmatizer.It is assigned to a variable which is “lemmatize”.An empty list is created for the “lemmatized_words”.A for loop is created for lemmatizing every word within the tokenized words with NLTK.The lemmatized and tokenized words are appended to the “lemmatized_words” list.The counter object has been used for counting them.The data frame has been created with lemamtized and tokenized word counts, sorted and called.You can see the lemmatization and tokenization with the NLTK example result below.The NLTK Tokenization and Lemmatization stats will be different than the NLTK Tokenization and Stemming. These differences will reflect their methodological differences for the statistical analysis for tokenized textual data with NLTK.See AlsoPreise in Kroatien Juli 2022 Preise in Restaurants, fuer Speisen und Getränke, Transport, Kraftstoff, Appartements, Hotels, Lebensmittel, Kleidung, WährungTélécharger Food & Drink Infographics. Le guide visuel des plaisirs
culinaires PDF eBook En Ligne Simone Klabin;Julius WiedemannFood and drink - Project 1999 WikiDesigner city unlimited money(adsbygoogle = window.adsbygoogle || []).push({});(adsbygoogle = window.adsbygoogle || []).push({});How to use Stemming with NLTK Tokenization?To use stemming with NLTK Tokenization, the “PorterStemmer” from the “NLTK.stem” should be imported. Stemming is reducing words to the stem forms. Stemming can be useful for a better NLTK Word Tokenization analysis since there are lots of suffixes in the words. Via the NLTK Stemming, the words that come from the same root can be counted as the same. Being able to see which words without suffixes are used is to create a more comprehensive look at the statistical counts of the concepts and phrases within a text. An example of stemming with NLTK Tokenization is below.from nltk.stem import PorterStemmerps = PorterStemmer()stemmed_words = []for w in tokens: rootWord = ps.stem(w) stemmed_words.append(rootWord)OUTPUT>>>['think', 'like', 'seo', ',', 'code', 'like', 'develop', 'python', 'seo', 'techseo', 'theoret', 'seo', 'on-pag', 'seo', 'pagespe', 'UX', 'market', 'think', 'like', 'seo', ',', 'code', 'like', 'develop', 'main',show more (open the raw output data in a text editor) ... 'in', 'bulk', 'with', 'python', '.', ...]counts_stemmed_words = Counter(stemmed_words)df_tokenized_stemmed_words = pd.DataFrame.from_dict(counts_stemmed_words, orient=""index"").reset_index()df_tokenized_stemmed_words.sort_values(by=0, ascending=False)df_tokenized_stemmed_wordsindex00think5291like10592seo53893,225644code1128………10342pixel.110343success.110344pages.110345free…110346almost.110347 rows × 2 columnsHow to Tokenize Content of a Website via NLTK?To tokenize the content of a website with NLTK on word, and sentence level the steps below should be followed.Crawling the website’s content.Extracting the website’s content from the crawl output.Using the “word_tokenize” of NLTK for word tokenization.Using “sent_tokenize” of NLTK for sentence tokenization.Interpreting and comparing the output of the tokenization of a website provides benefits for the overall evaluation of the content of a website. Below, you will see an example of a website content tokenization example. To perform NLTK Tokenization with a website’s content, the Python libraries below should be used.AdvertoolsPandasNLTKCollectionsStringBelow, you will see the importing process of the necessary libraries and functions for NLTK tokenization from Python for SEO.import advertools as advimport pandas as pdfrom nltk.tokenize import word_tokenizefrom nltk.tokenize import sent_tokenizefrom collections import Counterfrom nltk.tokenize import RegexpTokenizerimport string adv.crawl(""https://www.holisticseo.digital"", ""output.jl"", custom_settings={""LOG_FILE"":""output.log"", ""DOWNLOAD_DELAY"":0.5}, follow_links=True)df = pd.read_json(""output.jl"", lines=True)for i in df.columns: if i.__contains__(""text""): print(i)word_tokenize(df[""body_text""].explode())content_of_website = df[""body_text""].str.split().explode().str.cat(sep="" "")tokens = word_tokenize(content_of_website)tokenized_counts = Counter(tokens)df_tokenized = pd.DataFrame.from_dict(tokenized_counts, orient=""index"").reset_index()df_tokenized.nunique()df_tokenizedTo crawl the website’s content to perform an NLTK word and sentence tokenization, the Advertools’ “crawl” function will be used to take all of the content of the website into a “jl” extension file. Below, you will see an example of crawling a website with Python.adv.crawl(""https://www.holisticseo.digital"", ""output.jl"", custom_settings={""LOG_FILE"":""output.log"", ""DOWNLOAD_DELAY"":0.5}, follow_links=True)df = pd.read_json(""output.jl"", lines=True)In the first line, we have started the crawling process of the website, while in the second line we have started to read the “jl” file. Below, you can find the output of the crawled website’s output which is “output.jl” from the code block.In the third step, the website’s content should be found within the data frame. To do that, a for loop for filtering the data frame columns with the “boyd_text” is necessary. To find it, we will use the “__contains__” method of Python.for i in df.columns: if i.__contains__(""text""): print(i)At the next step of NLTK Tokenization for website content, we will use the Pandas library’s “str.cat” method to unite all of the content pieces across different web pages.content_of_website = df[""body_text""].str.split().explode().str.cat(sep="" "")Creating a variable “content_of_website” to assign the united content corpus of the website with the “sep=” parameter with a space value is necessary to decrease the computation need. Instead of performing NLTK Tokenization for every web page’s content separately and then uniting all of the tokenized output of text pieces, uniting all of the content pieces and then performing the NLTK tokenization for the united content piece is better for time and energy saving. At the next step, the “NLTK.word_tokenize” will be performed and the output of the tokenization process will be assigned to a variable.(adsbygoogle = window.adsbygoogle || []).push({});(adsbygoogle = window.adsbygoogle || []).push({});tokens = word_tokenize(content_of_website)To be able to see the counts of the tokenized words, and their counts, the “Counter” from the “collections” can be used as below.tokenized_counts = Counter(tokens)df_tokenized = pd.DataFrame.from_dict(tokenized_counts, orient=""index"").reset_index()“tokenized_counts = Counter(tokens)” is to provide a counting process for all of the counted objects. And, at the second line of the counting tokenized words with NLTK, the “from_dict” and “reset_index” methods of Pandas have been used to provide a data frame. Thanks to NLTK tokenization, the “unique word count” of a website can be found below.df_tokenized.nunique()OUTPUT>>>index 180560 471dtype: int64The “holisticseo.digital” has 18056 unique words within its content. These words can be seen below.df_tokenized.sort_values(by=0, ascending=False, inplace=True)df_tokenizedBelow, you can see the tokenization of words as an image.If the image of the word tokenization output is not clear for you, you can check the table of the word tokenization output is below.index023the3135426.230123,2256436and1281222of12349………14747NEL114748CSE114749recipe-related117753Plan118055almost.118056 rows × 2 columnsAfter the word tokenization with NLTK for website content, we see that the words from the header and footer appear more along with the stop words. The visualization of the counted word tokenization can be done as below.df_tokenized.sort_values(by=0, ascending=False, inplace=True)df_tokenized[:30].plot(kind=""bar"",x=""index"", orientation=""vertical"", figsize=(15,10), xlabel=""Tokens"", ylabel=""Count"", colormap=""viridis"", table=False, grid=True, fontsize=15, rot=35, position=1, title=""Token Counts from a Website Content with Punctiation"", legend=True).legend([""Tokens""], loc=""lower left"", prop={""size"":15})Below, you can see the output of the code block for visualizing the word tokenization output.To save the word tokenization output’s barplot as a PNG, you can use the code block below.(adsbygoogle = window.adsbygoogle || []).push({});(adsbygoogle = window.adsbygoogle || []).push({});df_tokenized[:30].plot(kind=""bar"",x=""index"", orientation=""vertical"", figsize=(15,10), xlabel=""Tokens"", ylabel=""Count"", colormap=""viridis"", table=False, grid=True, fontsize=15, rot=35, position=1, title=""Token Counts from a Website Content with Punctiation"", legend=True).legend([""Tokens""], loc=""lower left"", prop={""size"":15}).figure.savefig(""word-tokenization-2.png"")The word “the” appears more than 30000 times while some of the punctuations are also included within the word tokenization results. It appears that the words “content” and “Google” are the most appeared words besides the punctuation characters and the stop words. To have more insight when it comes to the word tokenization, the “TF-IDF Analysis with Python” can help to understand a word’s weight within a corpus. To create a better insight for SEO and content analysis via NLTK tokenization, the stop words should be removed.How to Filter out the Stop Words for Tokenization with NLTK?To remove the stop words from the NLTK Tokenization process’ output, a filter-out process should be performed with a repetitive loop with a list comprehension or a normal for loop. An example of NLTK Tokenization by removing the stop words can be seen below.stop_words_english = set(stopwords.words(""english""))df_tokenized[""filtered_tokens""] = pd.Series([w for w in df_tokenized[""index""] if not w.lower() in stop_words_english])To filter out the stop words during the word tokenization, the text cleaning methoıds should be used. To clean the stop words, the “stopwords.words(“english”)” method can be used from NLTK. In the code line above, the first line assigns the stop words within English to the “stop_words_english” variable. At the second line, we created a new column within the “df_tokenized” data frame which uses a list comprehension with the “pd.Series”. Basically, we take every word from the stop words list and filter the tokenized words output with NLTK according to the stop words. The “filtered_tokends” column doesn’t include any of the stop words.How to Count Tokenized Words with NLTK without Stop Words?To count the tokenized words with NLTK without the stop words, a list comprehension for subtracting the stop words should be used over the tokenized output. Below, you will see a specific example definition for the NLTK Tokenization tutorial.To count tokenized words with NLTK by subtracting the stop words in English, the “Counter” object will be used over the list that has been created over the “tokens_without_stop_words” with the list comprehension process of “[word for word in tokens if not a word in stopwords.words(“english”)]”.Below, you can see a code block to count the tokenized words and their output.See Alsowheel of fortune food and drink 4 words83 Best Words to Describe a Student (2022) - Helpful ProfessorWhere to Eat in Mallorca — Registered Dietitian Columbia SC - Rachael Hartley Nutrition50 History Multiple Choice Quiz Questions and Answers - Trivia QQtokenized_counts_without_stop_words = Counter(tokens_without_stop_words)tokenized_counts_without_stop_wordsOUTPUT>>>Counter({'Think': 381, 'like': 989, 'SEO': 5172, ',': 22564, 'Code': 467, 'Developer': 405, 'Python': 1583, 'TechSEO': 733, 'Theoretical': 700, 'On-Page': 670, 'PageSpeed': 645, 'UX': 699, 'Marketing': 1086, 'Main': 231, 'Menu': 192, 'X-Default': 232, 'value': 370, 'hreflang': 219, 'attribute': 178, 'link': 509, 'tag': 314, '.': 23012, 'An': 198, 'specify': 42, 'alternate': 109,show more (open the raw output data in a text editor) ... 'plan': 6, 'reward': 3, 'high-quality': 34, 'due': 82, 'back': 49, ...})The next step is creating a data frame from the Counter Object via the “from_dict” method of the “pd.DataFrame”.df_tokenized_without_stopwords = pd.DataFrame.from_dict(tokenized_counts_without_stop_words, orient=""index"").reset_index()df_tokenized_without_stopwordsBelow, you can see the output.Below, you can see the table output of the tokenization with NLTK without stop words in English.(adsbygoogle = window.adsbygoogle || []).push({});(adsbygoogle = window.adsbygoogle || []).push({});index00Think3811like9892SEO51723,225644Code467………17910success.117911pages.117912infinitely117913free…117914almost.117915 rows × 2 columnsEven if the stop words are removed from the text, still the punctuations exist. To clean the textual data completely for a healthier word tokenization process with NLTK, the stop words should be cleaned. Below, you will see the sorted version of the word tokenization with NLTK output without stop words.df_tokenized_without_stopwords.sort_values(by=0, ascending=False, inplace=True)df_tokenized_without_stopwordsYou can see the output of word tokenization with NLTK as an image.The table output of the word tokenization with NLTK without stop words and sorted values is below.index021.230123,22564298”5639296“56232SEO5172………12618gzip16968exited16969seduce16970collaborating117914almost.117915 rows × 2 columnsHow to visualize the Word Tokenization with NLTK without the Stop Words?To visualize the NLTK Word Tokenization without the stop words, the “plot” method of Pandas Python Library should be used. Below, you can see an example visualization of the word tokenization with NLTK without stop words.df_tokenized_without_stopwords[:30].plot(kind=""bar"",x=""index"", orientation=""vertical"", figsize=(15,10), xlabel=""Tokens"", ylabel=""Count"", colormap=""viridis"", table=False, grid=True, fontsize=15, rot=35, position=1, title=""Token Counts from a Website Content with Punctiation"", legend=True).legend([""Tokens""], loc=""lower left"", prop={""size"":15})You can see the output as an image below.The effect of the punctuations is more evident within the visualization of the word tokenization via NLTK without stop words.How to Calculate the Effect of the Stop Words for the Length of the Corpora?The corpora length represents the total word count of the textual data. To calculate the effect of the stop words for the length of the corpora, the stop words count should be subtracted from the total word counts. Below, you can see an example for the calculation of the total stop word count effect for the length of the corpora.tokens_without_stop_words = [word for word in tokens if not word in stopwords.words(""english"")]print(len(content_of_website))len(content_of_website) - len(tokens_without_stop_words)OUTPUT>>>307199268147The total word count of the website is 307199, the total word count without the stop words is 268147. And, 18056 of these words are unique. The unique word count can demonstrate a website’s potential query count since every different word is a representational score for relevance to a topic, or concept. Every unique word and n-gram can give a better chance to be relevant to a concept, or phrase in the search bar.(adsbygoogle = window.adsbygoogle || []).push({});(adsbygoogle = window.adsbygoogle || []).push({});How to Remove the Punctuation from Word Tokenization with NLTK?To remove the Punctuation from Word Tokenization with NLTK the “isalnum()” method should be used with a list comprehension. In the NLTK Word Tokenization tutorial the “isalnum()” method will be used on the “content_of_website_removed_punct” variable as below.content_of_website_removed_punct = [word for word in tokens if word.isalnum()]content_of_website_removed_punctOUTPUT >>>['Think', 'like', 'SEO', 'Code', 'like', 'Developer', 'Python', 'SEO', 'TechSEO', 'Theoretical', 'SEO', 'SEO', 'PageSpeed', 'UX', 'Marketing', 'Think', 'like', 'SEO', 'Code', 'like', 'Developer', 'Main', 'Menu', 'is', 'a',show more (open the raw output data in a text editor) ... 'server', 'needs', 'to', 'return', '304', ...]As you see all of the punctuations are removed from the tokenized words with NLTK. The next step is using the Counter object for creating a data frame so that the tokenized output with NLTK can be used for analysis and machine learning.content_of_website_removed_punct_counts = Counter(content_of_website_removed_punct)content_of_website_removed_punct_countsOUTPUT >>>Counter({'Think': 381, 'like': 989, 'SEO': 5172, 'Code': 467, 'Developer': 405, 'Python': 1583, 'TechSEO': 733, 'Theoretical': 700, 'PageSpeed': 645, 'UX': 699, 'Marketing': 1086, 'Main': 231, 'Menu': 192, 'is': 8965, 'a': 11483, 'value': 370, 'for': 8377, 'hreflang': 219, 'attribute': 178, 'of': 12349, 'the': 31354, 'link': 509, 'tag': 314, 'An': 198, 'can': 5065,show more (open the raw output data in a text editor) ... 'natural': 56, 'Due': 20, 'inaccuracies': 1, 'calculation': 63, 'always': 241, ...})The Counter object has been created for the NLTK Word Tokenization output without the punctuation. Below, you will see an example for creating a data frame methodology via “from_dict” for the result of the NLTK word tokenization without punctuation.content_of_website_removed_punct_counts_df = pd.DataFrame.from_dict(content_of_website_removed_punct_counts, orient=""index"").reset_index()content_of_website_removed_punct_counts_df.sort_values(by=0, ascending=False, inplace=True)content_of_website_removed_punct_counts_dfBelow, you can see the image output of the NLTK word tokenization by removing the punctuations. Below, you can see the table output of the NLTK word tokenization by removing the punctuations.index020the3135432and1281219of1234914a1148344to11063………11303ground16178Zurich16181Visa19184omitted114783infinitely114784 rows × 2 columnsHow to visualize the NLTK Word Tokenization result without punctuation?To visualize the NLTK Word Tokenization result within a data frame without punctuation, the “plot” method of the pandas should be used. An example visualization of the NLTK Word Tokenization without punctuation can be seen below.content_of_website_removed_punct_counts_df[:30].plot(kind=""bar"",x=""index"", orientation=""vertical"", figsize=(15,10), xlabel=""Tokens"", ylabel=""Count"", colormap=""viridis"", table=False, grid=True, fontsize=15, rot=35, position=1, title=""Token Counts from a Website Content with Punctiation"", legend=True).legend([""Tokens""], loc=""lower left"", prop={""size"":15})The output of the visualization of the NLTK Word Tokenization result without punctuation is below.How to Remove stop words and punctuations from text data for a better NLTK Word Tokenization?To remove the stop words and punctuations for cleaning the text in order to have a better NLTK Word Tokenization result, the list comprehension should be used with an “if” statement. Multiple conditional list comprehensions are to provide a faster text cleaning process for the removal of punctuation and stop words. An example of the cleaning of punctuation and stop words can be seen below.(adsbygoogle = window.adsbygoogle || []).push({});(adsbygoogle = window.adsbygoogle || []).push({});content_of_website_filtered_stopwords_and_punctiation = [w for w in tokens if not w in set(stopwords.words(""english"")) if w.isalnum()]content_of_website_filtered_stopwords_and_punctiation_counts = Counter(content_of_website_filtered_stopwords_and_punctiation)content_of_website_filtered_stopwords_and_punctiation_counts_df = pd.DataFrame.from_dict(content_of_website_filtered_stopwords_and_punctiation_counts, orient=""index"").reset_index()content_of_website_filtered_stopwords_and_punctiation_counts_df.sort_values(by=0, ascending=False, inplace=True)content_of_website_filtered_stopwords_and_punctiation_counts_df[:30].plot(kind=""bar"",x=""index"", orientation=""vertical"", figsize=(15,10), xlabel=""Tokens"", ylabel=""Count"", colormap=""viridis"", table=False, grid=True, fontsize=15, rot=35, position=1, title=""Token Counts from a Website Content with Punctiation"", legend=True).legend([""Tokens""], loc=""lower left"", prop={""size"":15})The explanation of the removal of the punctuation and the stop words in English from the tokenized words via NLTK code block is below.Remove the stop words and the punctuations via the “stopwods.words(“english”)” and “isalnum()”.Count the rest of the words with the “Counter” functions from Collections Python built-in module.Create the dataframe with the “from_dict” method of the “pd.DataFrame” with the “orientation”, “figsize”, “kind”, “x” parameters.Sort t" - https://www.affordablecebu.com/
 

Please support us in writing articles like this by sharing this post

Share this post to your Facebook, Twitter, Blog, or any social media site. In this way, we will be motivated to write articles you like.

--- NOTICE ---
If you want to use this article or any of the content of this website, please credit our website (www.affordablecebu.com) and mention the source link (URL) of the content, images, videos or other media of our website.

"NLTK Tokenize: How to Tokenize Words and Sentences with NLTK? - Holistic SEO (2023)" was written by Mary under the News category. It has been read 319 times and generated 0 comments. The article was created on and updated on 30 January 2023.
Total comments : 0