Short note on Text mining/Mining Text Data.
Mining Text Databases/Text mining
- Text mining is an interdisciplinary field that draws on information retrieval, data- mining, machine learning, statistics, and computational linguistics. Text mining has been very active. An important goal is to derive high-quality information from the text. Text databases consist of a huge collection of documents. They collect this information from several sources such as news articles, books, digital libraries, e-mail messages, web pages, etc.
- This is typically done through the discovery of patterns and trends by means such as statistical M pattern learning, topic modeling, and statistical language modeling. Text mining usually requires structuring the input text (e.g., parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database). This is followed by deriving patterns within the structured data, and evaluation and interpretation of the output. "High quality" in text mining usually refers to a combination of relevance, novelty, and interestingness.
- Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity-relation modeling (i.e., learning relations between named entities).
- Other examples include multilingual data mining, multidimensional text analysis, contextual text mining, and trust and evolution analysis in text data, as well as text mining applications in security, biomedical literature analysis, online media analysis, and analytical customer relationship management.
- Various kinds of text mining and analysis software and tools are available in academic institutions, open-source forums, and industry. Text mining often also uses WordNet, Semantic Web, Wikipedia, and other information sources to enhance the understanding and mining of text data.
- Due to the increase in the amount of information, text databases are growing rapidly. In many of the text databases, the data is semi-structured. For example, a document may contain a few structured fields, such as title, author, publishing date, etc. But along with the structure data, the document also contains unstructured text components, such as abstract and contents. Without knowing what could be in the documents, it is difficult to formulate effective queries for analyzing and extracting useful information from the data. Users require tools to compare the documents and rank their importance and relevance. Therefore, text mining has become a popular and essential theme in data mining.
OR,
Text mining is the procedure of synthesizing information, by analyzing relations, patterns, and rules among textual data. These procedures contain text summarization, text categorization, and text clustering.
1. Text summarization is the procedure to extract its partial content reflecting its whole contents automatically.
2. Text categorization is the procedure of assigning a category to the text among categories predefined by users
3. Text clustering is the procedure of segmenting texts into several clusters, depending on the substantial relevance.
Text Representation Issues
- Each word has a dictionary meaning or meanings
– Run – (1) the verb. (2) the noun, in cricket
– Cricket – (1) The game. (2) The insect.
– Apple (the company) or apple (the fruit)
- Ambiguity and context-sensitivity - Each word is used in various “senses”
– Tendulkar made 100 runs
– Because of an injury, Tendulkar can not run and will need a runner between the wickets
- Capturing the “meaning” of sentences is an important issue as well. Grammar, parts of speech, time sense could be easy!
- Order of words in the query
– hot dog stand in the amusement park
– hot amusement stand in the dog park
Comments
Post a Comment