Posts

Showing posts with the label Datawarehouse and Datamining

Describe Multidimensional Data Cube

 Multidimensional data cube It basically helps in storing large amounts of data by making use of a multi-dimensional array. It increases its efficiency by keeping an index of each dimension. Thus, dimensional is able to retrieve data fast.  Multidimensional arrays are used to store data that assures a multidimensional view of the data. A multidimensional data cube helps in storing a large amount of data. Multidimensional data cube implements indexing to represent each dimension of a data cube which improves the accessing, retrieving, and storing data from the data cube.              OR, Centered on a structure where the cube is patterned as a multidimensional array, most OLAP products are created. Compared to other methods, these multidimensional OLAP (MOLAP) products typically provide a better performance, primarily because they can be indexed directly into the data cube structure to capture data subsets. The cube gets sparser as the number of dimensions is larger. That ensures that

What are the stages of knowledge discovery in database(KDD)?

Image
 Knowledge Discovery in Database(KKD) • Dating back to 1989, the namesake Knowledge Discovery in Database (KDD) represents the overall process of collecting data and methodically refining it. • KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. • Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial extraction of implicit, previously unknown, and potentially useful information from data stored in databases. Stages of knowledge discovery in database(KDD) IN LONG FORM 1. Data Cleaning: Data cleaning is defined as the removal of noisy and irrelevant data from the collection. • Cleaning in case of Missing values. • Cleaning noisy data, where noise is a random or variance error. • Cleaning with Data discrepancy detection and Data transformation tools. Ideally, it is a process of filtering noisy content and redundancies to eliminate irrelevancy from the records. All in all, this method gives you the glasses to see incon

Explain dynamic method dispatch with example.

 Dynamic method dispatch is the mechanism by which a call to an overridden method is resolved at run time, rather than compile time. Therefore, it is the mechanism of achieving run time polymorphism. When an overridden method is called through a superclass reference, Java determines which version of that method to execute based upon the type of the object being referred to at the time the call occurs. Thus, this determination is made at run time. Therefore, if a superclass contains a method that is overridden by a subclass, then when different types of objects are referred by a superclass reference variable, different versions of the method are executed.  Here is an example that illustrates dynamic method dispatch:  class A { void callme() { System.out.println("Inside A's callme method"); } }  class B extends A void callme() { System.out.println("Inside B's callme method");  } }  class C extends B { void callme() {  System.out.println("Inside C's ca

Explain Static keyword and its types.(Static Variable , Static Method , Static data)

 The static keyword in Java is used for memory management mainly. We can apply static keyword with variables, methods, blocks, and nested classes. The static keyword belongs to the class than an instance of the class. The static can be: 1)Variable (also known as a class variable) 2) Method (also known as a class method) 3) Block 4) Nested class Static in Java 1) Java static variable If you declare any variable as static, it is known as a static variable. The static variable can be used to refer to the common property of all objects (which is not unique for each object), for example, the company name of employees, college name of students, etc. The static variable gets memory only once in the class area at the time of class loading. Advantages of static variable It makes your program memory efficient (i.e., it saves memory). Example of static variable //Java Program to demonstrate the use of static variable   class Student{      int rollno;//instance variable      String name;      stat

How beam search and logic programming is used to mine graph? Explain.

 Beam search    A beam search is a heuristic search technique that combines elements of breadth-first and best-first searches. Like a breadth-first search, the beam search maintains a list of nodes that represent a frontier in the search space. Whereas the breadth-first adds all neighbors to the list, the beam search orders the neighboring nodes according to some heuristic and only keeps the n (or k or 3) best, where is the beam size. This can significantly reduce the processing and storage requirements for the search. Some applications are speech recognition, vision, planning, machine learning, graph mining, and so on. Beam Search Algorithm 1.Let beam width = n. 2 Create two empty lists: OPEN and CLOSED. 3. Start from the initial node (say N) and put it in the 'ordered' OPEN list. 4. Repeat steps 5 to 9 until the GOAL node is reached. 5. If the OPEN list is empty, then EXIT the loop returning 'False'.  6. Select the first/top node (N) in the OPEN list and move it to th

What is the concept mini batch k-means? How DBSCAN works?

Image
  MINI-BATCH K-MEANS The mini-batch k-means clustering algorithm is the modified version of the k-means algorithm. It uses min-batches to reduce the computation time in large datasets. In addition, it attempts to optimize the result of the clustering. To achieve this, the mini-batch k-means takes mini-batches as inputs which are subsets of the whole dataset, randomly.  The mini-batch k-means is considered faster thank-means and it is normally used for large datasets. Mini Batch K-means algorithm's main idea is to use small random batches of data of a fixed size, so they can be stored in memory. In each iteration, a new random sample from the dataset is obtained and used to update the clusters and this is repeated until convergence. Each mini-batch updates the clusters using a convex combination of the values of the prototypes and the data, applying a learning rate that decreases with the number of iterations. This learning rate is the inverse of the number of data assigned to a clu

Discuss about overfitting and underfitting. How precision and recall is used to evaluate classifier.

Image
   OverFitting Overfitting means the model has a High accuracy score on training data but a low score on test data. An overfit model has overly memorized the data set it has seen and is unable to generalize the learning to an unseen data set. That is why an overfit model results in very poor test accuracy. Poor test accuracy may occur when the model is highly complex, i.e., the input feature combinations are in a large number and affect the model's flexibility. Overfitting happens when the algorithm used to build a prediction model is very complex and it has overlearned the underlying patterns in training data. Overfitting is an error from sensitivity to small fluctuations in the training set. Overfitting can cause an algorithm to model the random noise in the training data, rather than the intended result. In classification, overfitting happens when algorithms are strongly influenced by the specifics of the training data and try to learn patterns that are noisy and not generalized

Illustrate the hierarchical clustering with an example.

 The hierarchical clustering method produces a set of nested clusters organized as a hierarchical tree by performing Hierarchical decomposition (merge or split) of data point's based on a similarity or distance matrix Depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting) fashion a hierarchical clustering method can be classified into two main categories. While partitioning methods meet the basic clustering requirement of organizing a set of objects into a number of exclusive groups, in some situations we may want to partition our data into groups at different levels such as in a hierarchy. A hierarchical clustering method works by grouping data objects into a hierarchy or "tree" of clusters. They are: Agglomerative hierarchical clustering, Divisive hierarchical clustering Examples of  hierarchical clustering is given below:- 1) While partitioning methods meet the basic clustering requirement of organizing a set of o

Describe any five types of OLAP operations.

Image
 Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows managers, and analysts to get an insight into the information through fast, consistent, and interactive access to information. Since OLAP servers are based on a multidimensional view of data, we will discuss OLAP operations in multidimensional data. Here is the list of OLAP operations: Roll-up Drill-down Slice and dice Pivot (rotate) Roll-Up The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on a data cube, by climbing up concept hierarchies, i.e., dimension reduction.  Roll-up is like zooming out on the data cubes. The figure shows the result of roll-up operations performed on the dimension location. The hierarchy for the location is defined as the Order Street, city, province, or state, country.  The roll-up operation aggregates the data by ascending the location hierarchy from the level of the city to the level of province. When a roll-up is pe

Why is data preprocessing important?

Image
 Virtually any type of data analytics, data science, or AI development requires some type of data preprocessing to provide reliable, precise, and robust results for enterprise applications. Good preprocessing can help align the way data is fed into various algorithms for building machine learning or deep learning models. Real-world data is messy and is often created, processed, and stored by a variety of humans, business processes, and applications. While it may be suitable for the purpose at hand, a data set may be missing individual fields, contain manual input errors, or have duplicate data or different names to describe the same thing. Although humans can often identify and rectify these problems in the line of business, this data needs to be automatically preprocessed when it is used to train machine learning or deep learning algorithms. Machine learning and deep learning algorithms work best when data is presented in a particular format that highlights the relevant aspects requir

Why data preprocessing is mandatory? Justify.

Incomplete, noisy, and inconsistent data are commonplace properties of large real-world databases and data warehouses. Incomplete data can occur for a number of reasons. Attributes of interest may not always be available, such as customer information for sales transactions important at the time of entry. Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions. Data that were inconsistent with other recorded data may have been deleted. Furthermore, recording of the history of modifications to the data may have been overlooked. Missing data, particularly for tuples with a missing value for some mining results. Therefore to improve the quality of data and, consequently, of the mining results, data preprocessing is needed/mandatory.                                                   OR,  Virtually any type of data analytics, data science, or AI development requires some type of data preprocessing to provide reliable, precise, and robust results for

Why You Need Data Preprocessing

 Incomplete, noisy, and inconsistent data are commonplace properties of large real-world databases and data warehouses. Incomplete data can occur for a number of reasons. Attributes of interest may not always be available, such as customer information for sales transactions important at the time of entry. Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions. Data that were inconsistent with other recorded data may have been deleted. Furthermore, recording of the history of modifications to the data may have been overlooked. Missing data, particularly for tuples with a missing value for some mining results. Therefore to improve the quality of data and, consequently, of the mining results, data preprocessing is needed.                                OR, By now, you’ve surely realized why your data preprocessing is so important. Since mistakes, redundancies, missing values, and inconsistencies all compromise the integrity of the set, you need t

How trust and distrust propagate in social network Explain.

Image
  Propagation of Trust and Distrust The end goal is to produce a final matrix F from which we can read off the computed trust or distrust of any two users. In the remainder of this section, we first propose two techniques for computing F from CB. Next, we complete the specification of how the original trust T and distrust D matrices can be combined to give B. We then describe some details of how the iteration itself is performed to capture two distinct views of how distrust should propagate. Finally, we describe some alternatives regarding how the final results should be interpreted. Approaches to Trust Propagation A natural approach to estimate the quality of a piece of information is to aggregate the opinions of many users. But this approach suffers from the same concerns around disinformation as the network at large: it is easy for a user or coalition of users to adopt many personas and together express a large number of biased opinions. Instead, we wish to ground our conclusions in

Explain four different multimedia mining models have been used.

 MODELS FOR MULTIMEDIA MINING The models which are used to perform multimedia data are very important in mining. Commonly four different multimedia mining models have been used. These are classification, association rule, clustering, and statistical modeling. 1. Classification  Classification is a technique for multimedia data analysis, can learn from every property of a specified set of multimedia. It is divided into a predefined class label, so as to achieve the purpose of classification. Classification is the process of constructing data into categories for its better effective and efficient use, it creates a function that well-planned data item into one of many predefined classes, by inputting a training data set and building a model of the class attribute based on the rest of the attributes. Decision tree classification has a perceptive nature that the users conceptual model without loss of exactness. Hidden Markov Model used for classifying the multimedia data such as images and

What are different kinds of applications of multimedia data mining?

 APPLICATIONS OF MULTIMEDIA MINING  There are different kinds of applications  of multimedia data mining, some of which are as follows: Digital Library: The collection of digital data is stored and maintained in a digital library, which is essential to convert different formats of digital data into text, images, video, audio, etc. Traffic Video Sequences: In order to determine important but previously unidentified knowledge from the traffic video sequences, detailed analysis, and mining are to be performed based on vehicle identification, traffic flow, and queue temporal relations of the vehicle at an intersection. This provides an economic approach for regular traffic monitoring processes. Medical Analysis: Multimedia mining is primarily used in the medical field and particularly for analyzing medical images. Various data mining techniques are used for image classification. For example, Automatic 3D delineation of highly aggressive brain tumors, Automatic localization, and identifi

Explain the categories of multimedia data mining.

Image
 CATEGORIES OF MULTIMEDIA DATA MINING  Multimedia data mining is classified into two broad categories as static media and dynamic media. Static media contains text (digital library, creating SMS and MMS) and images (photos and medical images). Dynamic media contains Audio (music and MP3 sounds) and Video (movies). Multimedia mining refers to the analysis of a large amount of multimedia information in order to extract patterns based on their statistical relationships. Figure 1 shows the categories of multimedia data mining.  Text mining  Text Mining also referred to as text data mining and it is used to find meaningful information from the unstructured texts that are from various sources. Text is the foremost general medium for the proper exchange of information. Text Mining is to evaluate a huge amount of usual language text and it detects exact patterns to find useful information. Image mining Image mining systems can discover meaningful information or image patterns from a huge colle

List any two challenge/issue of multimedia mining. Differentiate between web usage mining and web content mining.

Image
 Challenges/Issues in multimedia data mining include- c ontent-based retrieval and similarity search, generalization, and multidimensional analysis.  Challenges/ ISSUES IN MULTIMEDIA MINING Major Issues in multimedia data mining contain content-based retrieval, similarity search, dimensional analysis, classification, prediction analysis, and mining associations in multimedia data. 1. Content-based retrieval and Similarity search  Content-based retrieval in multimedia is a stimulating problem since multimedia data is required for detailed analysis from pixel values. We considered two main families of multimedia retrieval systems i.e. similarity search in multimedia data. (1) Description-based retrieval system created indices and makes object retrieval, based on image descriptions, for example, keywords, captions, size, and time of creation. (2) Content-based retrieval system supports retrieval of the image content, for example, color histogram, texture, shape, objects, and wavelet tra

Write the limitation of Apriori algorithm.

Limitations:  Apriori algorithm is easy to understand and its' join and Prune steps are easy to implement on large itemsets in large databases. Along with these advantages, it has a number of limitations. These are: 1. a Huge number of candidates:  The candidate generation is the inherent cost of the Apriori Algorithms, no matter what implementation technique is applied. It is costly to handle a huge number of candidate sets. For example, if there are 10^4 large 1-itemsets, the Apriori algorithm will need to generate more than 10^7 candidate 2-itemsets. Moreover, for 100 itemsets, it must generate more than 2^100 which is approximately 100 candidates in total.  2.  Multiple scans of transaction database so, to mine large data sets for long patterns this algorithm is not a good choice.  3. When the Database is scanned to check C_k, for creating F_k, a large number of transactions will be scanned even they do not contain any k-itemset.                                         OR, Lim

Explain the different components of data warehouse. How data cube precomputation is performed? Describe.

Image
Components of Data Warehouse  A typical data warehouse has four main components: a central database, ETL (extract, transform, and load) tools, metadata, and access tools. All of these components are engineered for speed so that we can get results quickly and analyze data on the fly.                                               Figure: Components of data warehouse The figure shows the essential elements of a typical warehouse. We see the ETL shown on the left. The Data staging element serves as the next building block. In the middle, we see the Data Storage component that handles the data warehouses data. This element not only stores and manages the data; it also keeps track of data using the metadata repository. The Information Delivery component is shown on the right consists of all the different ways of making the information from the data warehouses available to the users. The major four components of Datawarehouse are listed below: 1. Central database A database serves as the foun

Short note on Mining Data Streams.

 Mining Data Streams Stream data refer to data that flows into a system in vast volumes, change dynamically, are possibly infinite, and contain multidimensional features. Such data cannot be stored in traditional database systems. Moreover, most systems may only be able to read the stream once in sequential order. This poses great challenges for the effective mining of stream data. Substantial research has led to progress in the development of efficient methods for mining data streams, in the areas of mining frequent and sequential om patterns, multidimensional analysis (e.g., the construction of stream cubes), classification, clustering, outlier analysis, and online detection of rare events in data streams. The general philosophy is to develop single-scan or a-few-scan algorithms using limited computing and storage capabilities.   This includes collecting information about stream data in sliding windows or tilted time windows (where the most recent data are registered at the finest gr