Posts

Showing posts with the label Datawarehouse and Datamining

How to generate correlation Analysis from Association Analysis?

As we have seen so far, the support and confidence measures are insufficient at filtering out uninteresting association rules. To tackle this weakness, a correlation measure can be used to augment the support-confidence framework for association rules. This leads to correlation rules of the form A⇒ B [support, confidence, correlation]. - equation 6.7 That is, a correlation rule is measured not only by its support and confidence but also by the correlation between itemsets A and B. There are many different correlation measures from which to choose. In this subsection, we study several correlation measures to determine which would be good for mining large data sets. Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is independent of the occurrence of itemset B if P(AUB) = P(A)P(B); otherwise, itemsets A and B are dependent and correlated as events. This definition can easily be extended to more than two itemsets. The lift between the occurrence of

How effective are Bayesian classifiers?

 Various empirical studies of this classifier in comparison to the decision tree and neural network classifiers have found it to be comparable in some domains. In theory, Bayesian classifiers have the minimum error rate in comparison to all other classifiers. However, in practice this is not always the case, owing to inaccuracies in the assumptions made for its use, such as class-conditional independence, and the lack of available probability data. Bayesian classifiers are also useful in that they provide a theoretical justification for other classifiers that do not explicitly use Bayes' theorem. For example, under certain assumptions, it can be shown that many neural network and curve-fitting algorithms output the maximum posterior hypothesis, as does the naïve Bayesian classifier.

What do you mean by knowledge discovery in database (KDD)?

Image
KNOWLEDGE DISCOVERY IN DATABASES (KDD) Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets, and interpreting accurate solutions from the observed results. Major KDD application areas include marketing, fraud detection, telecommunication, and manufacturing. Data mining is the core part of the knowledge discovery process. KDP is a process of finding knowledge in data, it does this by using data mining methods (algorithms) in order to extract demanding knowledge from a large amount of data. Simply stated, data mining refers to extracting or "mining" knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data, or

Why FP-growth approach is considered better than Apriori approach? Explain./How FP tree is better than Apriori Algorithm?

Image
 Apriori Algorithm : It is a classic algorithm for learning association rules. It uses a bottom-up approach where frequent subsets are extended one at a time. It uses a Breadth-first search and hash tree structure to count candidate item sets efficiently. FP Growth: It allows frequent itemset discovery without candidate generation. It builds a compact data structure called FP tree with two passes over the database. It extracts frequent itemsets directly from the FP tree and traverses through the FP tree.

Explain Data Mining Trends.

Image
 Data Mining Trends The diversity of data, data mining tasks, and data mining approaches pose many challenging research issues in data mining. The development of efficient and effective data mining methods, systems and services, and interactive and integrated data-mining environments is a key area of study. The use of data mining techniques to solve large or sophisticated application problems is an important task for data mining researchers and data mining systems and application developers.  This section describes some of the trends in data mining that reflect the pursuit of these challenges. Application exploration:  Early data mining applications put a lot of effort into helping businesses gain a competitive edge. The exploration of data mining for businesses continues to expand as e-commerce and e-marketing have become mainstream in the retail industry. Data mining is increasingly used for the exploration of applications in other areas such as web and text analysis, financial analy

Explain Privacy-preserving data mining methods .

 Privacy-preserving data mining is an area of data mining research in response to privacy protection in data mining. It is also known as privacy-enhanced or privacy-sensitive data mining. It deals with obtaining valid data mining results without disclosing the underlying sensitive data values. Most privacy-preserving data mining methods use some form of transformation on the data to perform privacy preservation. Typically, such methods reduce the granularity of representation to preserve privacy. For example, they may generalize the data from individual customers to customer groups. This reduction in granularity causes loss of information and possibly of the usefulness of the data mining results. This is the natural trade-off between information loss and privacy. Privacy-preserving data mining methods can be classified into the following categories. Randomization methods: These methods add noise to the data to mask some attribute values of records. The noise added should be sufficientl

Explain Privacy, Security, and Social Impacts of Data Mining.

  Privacy, Security, and Social Impacts of Data Mining With more and more information accessible in electronic forms and available on the Web, and with increasingly powerful data mining tools being developed and put into use, there are increasing concerns that data mining may pose a threat to our privacy and data security. However, it is important to note that many data mining applications do not even touch personal data. Prominent examples include applications involving natural resources, the prediction of floods and droughts, meteorology, astronomy, geography, geology, biology, and other scientific and engineering data. Furthermore, most studies in data mining research focus on the development of scalable algorithms and do not involve personal data. The focus of data mining technology is on the discovery of general or statistically significant patterns, not on specific information regarding individuals. In this sense, we believe that the real privacy concerns are with unconstrained a

What can we do to secure the privacy of individuals while collecting and mining data?

 Many data security-enhancing techniques have been developed to help protect data. Databases can employ a multilevel security model to classify and restrict data according to various security levels, with users permitted access to only their authorized level. It has been shown, however, that users executing specific queries at their authorized security level can still infer more sensitive information, and that a similar possibility can occur through data mining. Encryption is another technique in which individual data items may be encoded. This may involve blind signatures (which build on public-key encryption), biometric encryption (e.g., where the image of a person's iris or fingerprint is used to encode his or her personal information), and anonymous databases (which permit the consolidation of various databases but limit access to personal information only to those who need to know; personal information is encrypted and stored at different locations). Intrusion detection is ano

Explain Data Mining in Recommender Systems.

Image
 Data Mining in  Recommender Systems Today's consumers are faced with millions of goods and services when shopping online. Recommender systems help consumers by making product recommendations that are likely to be of interest to the user such as books, CDs, movies, restaurants, online news articles, and other services. Recommender systems may use either a content-based approach, a collaborative approach, or a hybrid approach that combines both content-based and collaborative methods. The content-based approach recommends items that are similar to items the user preferred or queried in the past. It relies on product features and textual item descriptions. The collaborative approach (or collaborative filtering approach) may consider a user's social environment. It recommends items based on the opinions of other customers who have similar tastes or preferences as the user. Recommender systems use a broad range of techniques from information retrieval, statistics, machine learning,

Explain Data Mining for Intrusion Detection and Prevention.

 Data Mining for Intrusion Detection and Prevention  The security of our computer systems and data is at continual risk. The extensive growth of the Internet and the increasing availability of tools and tricks for intruding and attacking networks have prompted intrusion detection and prevention to become a critical component of networked systems. An intrusion can be defined as any set of actions that threaten the integrity, confidentiality, or availability of a network resource (e.g., user accounts, file systems, system kernels, and so on). Intrusion detection systems and intrusion prevention systems both monitor network traffic and/or system executions for malicious activities. However, the former produces reports whereas the latter is placed in-line and is able to actively prevent/block intrusions that are detected. The main functions of an intrusion prevention system are to identify malicious activity, log information about said activity, attempt to block/stop activity, and report a

Short note on Data Mining in Science and Engineering.

  Data Mining in  Engineering Data mining in engineering shares many similarities with data mining in science. Both practices often collect massive amounts of data and require data preprocessing, data warehousing, and scalable mining of complex types of data. Both typically use visualization and make good use of graphs and networks. Moreover, many engineering processes need real-time responses, and so mining data streams in real-time often become a critical component. Massive amounts of human communication data pour into our daily life. Such communication exists in many forms, including news, blogs, articles, web pages, online discussions, product reviews, twitters, messages, advertisements, and communications, both on the Web and in various kinds of social networks. Hence, data mining in social science and social studies has become increasingly popular. Moreover, user or reader feedback regarding products, speeches, and articles can be analyzed to deduce general opinions and sentiment

Explain challenges brought about by emerging scientific applications of data mining.

  some of the challenges brought about by emerging scientific applications of data mining. Data warehouses and data preprocessing: Data preprocessing and data warehouses are critical for information exchange and data mining. Creating a warehouse often requires finding means for resolving inconsistent or incompatible data collected in multiple environments and at different time periods. This requires recognizing semantics, referencing systems, geometry, measurements, accuracy, and precision. Methods are needed for integrating data from heterogeneous sources and for identifying events. For instance, consider climate and ecosystem data, which are spatial and temporal and require cross-referencing geospatial data. A major problem in analyzing such data is that there are too many events in the spatial domain but too few in the temporal domain. For example, El Nino events occur only every four to seven years, and previous data on them might not have been collected as systematically as they a

Explain Data Mining for Retail Industry and and Telecommunication Industries.

Image
 Data Mining for Retail and Telecommunication Industries  The retail industry is a well-fit application area for data mining since it collects huge amounts of data on sales, customer shopping history, goods transportation, consumption, and service. The quantity of data collected continues to expand rapidly, especially due to the increasing availability, ease, and popularity of business conducted on the Web, or e-commerce. Today, most major chain stores also have websites where customers can make purchases online. Some businesses, such as Amazon.com (www.amazon.com), exist solely online, without any brick-and-mortar (i.e., physical) store locations. Retail data provide a rich source for data mining. Retail data mining can help identify customer buying behaviors, discover customer shopping patterns and trends, improve the quality of customer service, achieve better customer retention and satisfaction, enhance goods consumption ratios, design more effective goods transportation and distri

Explain Data Mining for Financial Data Analysis.

Mining for Financial Data Analysis Most banks and financial institutions offer a wide variety of banking, investment, and credit services (the latter include business, mortgage, and automobile loans and credit cards). Some also offer insurance and stock investment services. Financial data collected in the banking and financial industry are often relatively complete, reliable, and of high quality, which facilitates systematic data analysis and data mining. Here we present a few typical cases. Design and construction of data warehouses for multidimensional data analysis and data mining: Like many other applications, data warehouses need to be constructed for banking and financial data. Multidimensional data analysis methods should be used to analyze the general properties of such data. For example, a company's financial officer may want to view the debt and revenue changes by month, region, and sector, and other factors, along with maximum, minimum, total, average, trend, deviation,

Short note on Mining Spatiotemporal Data and Moving Objects.

 Spatiotemporal data are data that relate to both space and time. Spatiotemporal data mining refers to the process of discovering patterns and knowledge from spatiotemporal data. Typical examples of spatiotemporal data mining include discovering the evolutionary history of cities and lands, uncovering weather patterns, predicting earthquakes and hurricanes, and determining global warming trends. Spatiotemporal data mining has become increasingly important and has far-reaching implications, given the popularity of mobile services, and digital Earth, as well as satellite, RFID, sensor, wireless, and video technologies. Among many kinds of spatiotemporal data, moving-object data (i.e., data about moving to attach equipment on wildlife to analyze ecological behavior, mobility managers embed GPS in cars to better monitor and guide vehicles, and meteorologists use weather satellites and radars to observe hurricanes. Massive-scale moving-object data are becoming sich Javary rich, complex, and

Short note on Web Document Clustering.

 W eb Document Clustering Web document clustering is another approach to finding relevant documents on a topic or query keywords. As discussed earlier, the popular search engines often return a huge, unmanageable list of documents that contain the keywords that the user-specified. Finding the most useful documents from such a large list is usually tedious, often impossible. The user could apply clustering to a set of documents returned by a search engine in response to a query with the aim of finding semantically meaningful clusters, rather than a list of ranked documents, that are easier to interpret. It is not necessary to insist that a document can only belong to one cluster since in some cases it is justified to have the document belong to two or more clusters. Web clustering may be based on content alone, may be based on both content and links, or may be excluded based on links.

Short note on Fingerprinting.

 Fingerprinting An approach for comparing a large number of documents is based on the idea of fingerprinting documents. A document may be divided in all possible substrings of length. These substrings are called shingles. Based on the shingles we can define resemblance R (X, Y) and containment C (X, Y) between two documents X and Y as follows. We assume S (X) and S (Y) to be a set of shingles for documents X and Y respectively. R (X,Y)= (S(X) S (Y)} (S (X) US (Y)} and C (X, Y) = {S(X)S (Y) {S (X)} An algorithm like the following may now be used to find similar documents: 1. Collect all the documents that one wishes to compare 2. Choose a suitable shingle width and compute the shingles for each document 3. Compare the shingles for each pair of documents 4. Identify those documents that are similar

Short note on Web Structure Mining.

  Web Structure Mining Web structure mining studies the model underlying the link structures of the Web. It has been used for search engine result ranking and other Web applications. Web structure mining is the process of using graph and network mining theory and methods to analyze the nodes and connection structures on the Web. It extracts our patterns from hyperlinks, where a hyperlink is a structural component that connects a web page to another location. It can also mine the document structure within a page (e.g., analyze the treelike structure of page structures to describe HTML or XML tag usage). Both kinds of web structure mining help us understand web content and may also help transform web content into relatively structured data sets.                                        OR, Web Structure Mining The aim of Web structure mining is to discover the link structure or the model that is assumed to underlie the Web. The model may be based on the topology of the hyperlinks. This ca

Short note on Information Extraction

 Information Extraction    Information extraction  is the process of extracting information from unstructured textual sources to enable finding entities as well as classifying and storing them in a database. Semantically enhanced information extraction (also known as semantic annotation) couples those entities with their semantic descriptions and connections from a knowledge graph. By adding metadata to the extracted concepts, this technology solves many challenges in enterprise content management and knowledge discovery. Typical Information Extraction Applications Information extraction can be applied to a wide range of textual sources: from emails and Web pages to reports, presentations, legal documents, and scientific papers. The technology successfully solves challenges related to content management and knowledge discovery in the areas of: Business intelligence ( for enabling analysts to gather structured information from multiple sources); .Financial investigation  (for analysis a

Short note on Term Weighting (TF-IDF)

Image
Term Weighting (TF-IDF) The term weighting refers to the process of calculating and assigning the weight of each word as its importance degree. We may need the process for removing words further as well as stop words for further efficiency. The term frequency and the TF-IDF weight are the popular schemes of weighting words, so they will be described formally, showing their mathematical equations. The term weights may be used as attribute values in encoding texts into numerical vectors. We may use the term frequency which is the occurrence of each word in the given text as the scheme of weighting words. It is assumed that the stop words which occur most frequently in texts are completely removed in advance. The words are weighted by counting their occurrences in the given text in this scheme. There are two kinds of term frequency: the absolute term frequency as the occurrence of words and the relative term frequency as the ratio of their occurrences to the maximal ones. The relative ter