Short note on Stop Word Removal
Stop-Word Removal
Stop-word removal refers to the process of removing stop words from the list of tokens or stemmed words. Stop words are grammatical words that are irrelevant to text contents, so they need to be removed for more efficiency. The stop-word list is loaded from a file, and if they are registered in the list, they are removed. The stemming and the stop-word removal may be swapped; stop words are removed before stemming the tokens. Therefore, in this subsection, we provide a detailed description of stop words and the stop-word removal.
The stop word refers to the word which functions only grammatically and is irrelevant to the given text contents. Prepositions, such as "in," "on," "to"; and so on, typically belong to the stop-word group. Conjunctions such as "and," "or," "but," and "however" also belong to the group. The definite article, "the," and the infinite articles, "a" and "an," are also more frequent stop words. The stop words occur dominantly in all texts in the collection; removing them causes to improve very much efficiency in processing texts.
Let us explain the process of removing stop words in the ing. The stop-word list is prepared as a file and is loaded from it. For each word, if it is registered in the list, it is removed. The remaining words after removing stop words are usually nouns, verbs, and adjectives. Instead of loading the stop-word list from a file, we may consider using the classifier which decides whether each word is a stop word or not.
Some nouns, verbs, and adjectives as the remaining words may occur in other texts as well as in the current one. The words are called common words, and they are not useful for identifying text contents, so they need to be removed like stop words. The TF-IDF (Term Frequency Inverse Term Frequency) weight is the criteria for deciding whether each word is a common word, or not. The remaining words after removing stop words are useful in the news article collection, but not all of them are so in the technical one medical text collections. We may consider the further removal, depending on text lengths; it is applied to long texts.
Comments
Post a Comment