Posts

Showing posts with the label Datawarehouse and Datamining

Short note on Stop Word Removal

 Stop-Word Removal Stop-word removal refers to the process of removing stop words from the list of tokens or stemmed words. Stop words are grammatical words that are irrelevant to text contents, so they need to be removed for more efficiency. The stop-word list is loaded from a file, and if they are registered in the list, they are removed. The stemming and the stop-word removal may be swapped; stop words are removed before stemming the tokens. Therefore, in this subsection, we provide a detailed description of stop words and the stop-word removal. The stop word refers to the word which functions only grammatically and is irrelevant to the given text contents. Prepositions, such as "in," "on," "to"; and so on, typically belong to the stop-word group. Conjunctions such as "and," "or," "but," and "however" also belong to the group. The definite article, "the," and the infinite articles, "a" and &quo

What do you mean by spatial data warehouse spatial data cube, and Spatial Data Warehouse for Data Mining?

SPATIAL DATA WAREHOUSE A Spatial Data Warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of both stable spatial and nonspatial data in support of spatial data mining and spatial-data-related decision-making processes. As a kind of data warehouse, it integrates meta-data, thematic data, data engines, and data tools (extraction, transformation, load, storage, analysis, index, mining, etc.). When characterizing spatial data warehouses from spatial databases, spatial dimension and temporal dimension are basic. A spatial data warehouse is a tool to effectively manage spatial data and distribute them to the public to enhance spatial data-referenced decision-making. A spatial data cube  A spatial data cube organizes the multi-dimensional data from different fields (i.e., geospatial data and other thematic data from a variety of spheres) into an accessible data cube or hypercube according to dimensions, using three or more dimensions to describe an object. Th

Write short notes: Real World Application of Spatial Mining

  Real-world Applications Spatial data mining is a technique that supports decision-making with geo-referenced datasets. It can provide decision-makers with valuable knowledge by guided data interpretation, database reconstruction, information retrieval, relationship discovery, new entity discovery, and spatiotemporal optimization. 

Explain the challenges of mining used in WWW.

 Challenges in Web Mining The web poses great challenges for resource and knowledge discovery based on the following observations: The web is too huge: The size of the web is very huge and rapidly increasing. The seems that the web is too huge for data warehousing and data mining. The complexity of web pages: The web pages do not have a unifying structure. They are very complex as compared to traditional text documents. There is a huge number of documents in the digital library of the web. These libraries are not arranged according to ar particular sorted order. The web is a dynamic information source: The information on the web is rapidly updated. The data such as news, stock markets, weather, sports, shopping, etc., are regularly updated. Diversity of user communities: The user community on the web is rapidly expanding. These users have different backgrounds, interests, and usage purposes. There are more than 100 million workstations that are connected to the internet and still rapid

What do you mean by web mining? Discuss the types of web mining.

Image
 WEB MINING Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page content, and usage data. Although Web mining uses many data mining techniques, as mentioned above it is not purely an application of traditional data mining due to the heterogeneity and semi-structured or unstructured nature of the Web data. Many new mining tasks and algorithms were invented in the past decade. Based on the primary kinds of data used in the mining process, Web mining tasks can be categorized into three types: Web content mining, Web structure, Web usage mining, Web Mining Web Content Mining: Web content mining extracts or mines useful information or knowledge from Web page content. Web content mining focuses on the content of the Web pages rather than t links. For example, we can automatically classify and cluster Web pages according to their topic Web content is a very rich information resource consisting of many types of information, for example, unstructure

Explain the role of HITS in web mining.

 The HITS Algorithm The HITS algorithm is based on the idea that if the creator of page p provides a link to q then p confers some authority on page q. One must however realize that not all links confer authority, since some may be paid advertisements and others may be for navigational purposes, for example, links to the homepage in a large website. In searching, often a very large number of items are retrieved. One way to order these items maybe by the number of in-links but it has been found that this approach does not always work. We present some definitions before describing the HITS algorithm. As noted earlier, a graph may be defined as a set of vertices and edges (V. E). The vertices correspond to the pages and the edges correspond to the links. A directed edge (p, q) indicates a link from page p to page q. The HITS algorithm has two major steps: 1. Sampling step: To collect a set of relevant Web pages given a topic Iterative step: To find hubs and authorities using the informati

Explain how tokenization and stemming is done in text mining.

Image
  Tokenization Tokenization is defined as the process of segmenting a text or text into tokens by the white space or punctuation marks. It is able to apply tokenization to the source codes in C, C++, and Java, as well as the texts which are written in a natural language. However, the scope is restricted to only text in this study, in spite of the possibility. The morphological analysis is required for tokenizing texts which are written in oriental languages: Chinese, Japanese, and Korean. So, here, omitting the morphological analysis, we explain the process of tokenizing texts which are written in English. Optional for writing point of view if asked explain process of tokenization then , only write all this that is the bracket (( The functional view of tokenization is illustrated in figure 9.3. A text is given as the input, and the list of tokens is generated as the output in the process. The text is segmented into tokens by the white space or punctuation marks. As the subsequent proce

What are the typical information extraction applications? Discuss.

  Information extraction is the process of extracting information from unstructured textual sources to enable finding entities as well as classifying and storing them in a database. Semantically enhanced information extraction (also known as semantic annotation) couples those entities with their semantic descriptions and connections from a knowledge graph. By adding metadata to the extracted concepts, this technology solves many challenges in enterprise content management and knowledge discovery. Typical Information Extraction Applications Information extraction can be applied to a wide range of textual sources: from emails and Web pages to reports, presentations, legal documents, and scientific papers. The technology successfully solves challenges related to content management and knowledge discovery in the areas of: Business intelligence ( for enabling analysts to gather structured information from multiple sources); .Financial investigation (for analysis and discovery of hidden rel

Advantages of NLP & Disadvantages of NLP

  A dvantages of NLP A list of advantages of NLP is given below: NLP helps users to ask questions about any subject and get a direct response within seconds. NLP offers exact answers to the question means it does not offer unnecessary and unwanted information. NLP helps computers to communicate with humans in their languages. It is very time efficient. Most of the companies use NLP to improve the efficiency of documentation processes, accuracy of documentation, and identify the information from large databases. Disadvantages of NLP A list of disadvantages of NLP is given below: NLP may not show context. NLP is unpredictable NLP may require more keystrokes. NLP is unable to adapt to the new domain, and it has a limited function that's why NLP is built for a single and specific task only.

Explain the phases NLP.

Image
1. lexical Analysis and Morphological: The first phase of NLP is the Lexical Analysis. This phase scans the source code as a stream of characters and converts it into meaningful lexemes. It divides the whole text into paragraphs, sentences, and words. 2. Syntactic Analysis (Parsing): Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship among the words. Example: Arjun goes to the Radha In the real world, Arjun goes to the Radha, which does not make any sense, so this sentence is rejected by the Syntactic analyzer.   3. Semantic Analysis: Semantic analysis is concerned with the meaning representation. It mainly focuses on the literal meaning of words, phrases, and sentences. 4. Discourse Integration: Discourse Integration depends upon the sentences that proceed it and also invoke the meaning of the sentences that follow it. 5. Pragmatic Analysis: Pragmatic is the fifth and last phase of NLP. It helps you to discover the intended effect by applying a

List the steps to build an NLP pipeline.

  NLP Pipeline There are the following steps to build an NLP pipeline: 1. Segmentation: It breaks the paragraph into separate sentences. For example, consider a paragraph "The sky is clear; the stars are twinkling at night." The segments are: a) The sky is clear. b) The stars are twinkling at night. 2. Tokenization: It is used to break the sentence into separate words or tokens. Tokenizer generates the following result: "The", "stars"," are"," twinkling","at"," night"  3. Stop word Removal: Words such as was, in, is, and, the, are called stop words and can be removed in this stage. For example, words after stop word removal are "stars", "twinkling", "night". 4. Stemming: It is the process of obtaining the word stem of a word. Word stem gives new words up on adding affixes to them. For example, celebrates, celebrated and celebrating, all these words are originated with a single root wor

Discuss Importance of NLP.

 Importance of NLP The importance of NLP is explained as follows: Large volumes of textual data: Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks. For example, NLP makes it possible for computers to read text, hear speech, interpret it, measure sentiment and determine which parts are important. Structuring a highly unstructured data source: Human language is astoundingly complex and diverse. We express ourselves in infinite ways, both verbally and in writing. Not only are there hundreds of languages and dialects, but within each language is a unique set of grammar and syntax rules, terms and slang. When we write, we often misspell or abbreviate words, or omit punctuation. When we speak, we have regional accents, and we mumble, stutter and borrow terms from other languages. While supervised and unsupervised learning, and specifically deep learning, are now widely used for modeling human language, there'

Discuss the major application of NLP.

 Application of NLP NLP is one of the ways that people have humanized machines and reduced the need for labor. It has led to the automation of speech-related tasks and human interaction. Some applications of NLP include: Translation Tools: T ools such as Google Translate, Amazon Translate, etc. translate sentences from one language to another using NLP. Chatbots: Chatbots can be found on most websites and are a way for companies to deal with common queries quickly. Virtual Assistants: Virtual Assistants like Siri, Cortana, Google Home, Alexa, etc. can not only talk to you but understand commands given to them. Targeted Advertising: Have you ever talked about a product or service or just googled something and then started seeing ads for it? This is called targeted advertising, and it helps generate tons of revenue for sellers as they can reach niche audiences at the right time. Autocorrect: Autocorrect will automatically correct any spelling mistakes you make, apart from this grammar ch

Define NLP. Differentiate between NLU and NLG.

Image
 Natural language processing (NLP) is the intersection of computer science, linguistics, and machine learning. The field focuses on communication between computers and humans in natural language and NLP is all about making computers understand and generate human language. Applications of NLP techniques include voice assistants like Alexa, Siri, Cortana, and Google Assistant, but also things like machine translation and text-filtering. NLP has heavily benefited from recent advances in machine learning, especially from deep learning techniques. The field is divided into three parts: Speech Recognition: The translation of spoken language into text. Natural Language Understanding: The computer's ability to understand what we say.  Natural Language Generation: The generation of natural language by a computer.

Explain some differences between mining association rules in multimedia databases and in transaction databases.

 To mine associations among multimedia objects, we can treat each image as a transaction and find frequently occurring patterns among different images. There are some differences between mining association rules in multimedia databases and in transaction databases. First, an image may contain multiple objects, each with many features such as color, shape, texture, keyword, and spatial location, so there could be many possible associations. In many cases, a feature may be considered as the same in two images at a certain level of resolution, but different at a finer resolution level. Therefore, it is essential to promote a progressive resolution refinement approach. Second, because a picture containing multiple recurrent objects is an important feature in image analysis, recurrence of the same objects should not be ignored in association analysis. For example, a picture containing two golden circles is treated quite differently from that containing only one. Third, there often exist imp

Importance of Spatial Data Mining(SPM)

 SDM plays an important role in everyday life as well, and its scope of application is ever-expanding under the remotely sensed aids of our Smart Planet. For example, Remote Sensing (RS) cannot be substituted for human monitoring of the sustainable development of the earth; however, it is one of the current areas that RS is capable of doing automatically and its sensed images would be valuable. SDM provides new access to automatically sensed knowledge-based object identification and classification from huge RS high-resolution images. In hyperspectral images of the earth's surface, there is a triple data gain of spatial, radial, and spectral information. By mining hyperspectral images, the objects with fine spectral features may be intelligently analyzed, extracted, identified, matched, and classified with high accuracy and thereby accelerating the application time of RS images. SDM may greatly improve the speed of knowledge-based information processing. From nighttime light images,

Explain TF-IDF (Term Frequency-Inverse Document Frequency) with example

Image
Term Weighting (TF-IDF) The term weighting refers to the process of calculating and assigning the weight of each word as its importance degree. We may need the process for removing words further as well as stop words for further efficiency. The term frequency and the TF-IDF weight are the popular schemes of weighting words, so they will be described formally, showing their mathematical equations. The term weights may be used as attribute values in encoding texts into numerical vectors.  We may use the term frequency which is the occurrence of each word in the given text as the scheme of weighting words. It is assumed that the stop words which occur most frequently in texts are completely removed in advance. The words are weighted by counting their occurrences in the given text in this scheme. There are two kinds of term frequency: the absolute term frequency as the occurrence of words and the relative term frequency as the ratio of their occurrences to the maximal ones. The relative te

Explain the text indexing techniques.

Image
 The three basic steps of text indexing are illustrated in figure 9.2. The first step is tokenization which is the process of segmenting a text by white spaces or punctuation marks into tokens. The second step is stemming which is the process of converting each token into its own root form using grammar rules. The last step is the stop-word removal which is the process of removing the grammatical words such as articles, conjunctions, and prepositions. Tokenization is the prerequisite for the next two steps, but the stemming and the stop-word removal may be swapped with each other, depending on the given situation. 1) Tokenization Tokenization is defined as the process of segmenting a text or text into tokens by the white space or punctuation marks. It is able to apply tokenization to the source codes in C, C++, and Java, as well as the texts which are written in a natural language. However, the scope is restricted to only text in this study, in spite of the possibility. The morphologic

Explain the text indexing

 Text indexing is defined as the process of converting a text or text into a list of words. Since a text or ants are given as unstructured forms by themselves or themselves essentially, it is almost impossible to press its raw form directly by using a computer program. In other words, text indexing means the process of segmenting a text which consists of sentences into included words. A list of words is the result of indexing a text as the output of text indexing and will become the input to the text representation. Let us consider the necessities of text indexing. Text is essentially the unstructured data unlikely the numerical one, so computer programs cannot process it in its raw form. It is impossible to apply numerical operations to texts and is not easy to encode a text into its own numerical value. A text is a too-long string which is different from a short string, so it is very difficult to give the text its own categorical value. Therefore, what is mentioned above becomes the

Explain the three categories observed while mining association in multimedia data.

  Mining Association in Multimedia data Association rules involving multimedia objects can be mined in image and video databases. At least three categories can be observed: 1. Associations between image content and nonimage content features: A rule like "If at least 50% of the upper part of the picture is blue, then it is likely to represent sky" belongs to this category since it links the image content to the keyword sky. 2. Associations among image contents that are not related to spatial relationships: A rule like "If a picture contains two blue squares, then it is likely to contain one red circle as well" belongs to this category since the associations are all regarding image contents. 3. Associations among image contents related to spatial relationships: A rule like "If a red triangle is between two yellow squares, then it is likely a big oval-shaped object is underneath" belongs to this category since it associates objects in the image with spatial