What do you mean by knowledge discovery in database (KDD)?

KNOWLEDGE DISCOVERY IN DATABASES (KDD)

Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets, and interpreting accurate solutions from the observed results. Major KDD application areas include marketing, fraud detection, telecommunication, and manufacturing. Data mining is the core part of the knowledge discovery process. KDP is a process of finding knowledge in data, it does this by using data mining methods (algorithms) in order to extract demanding knowledge from a large amount of data.

Simply stated, data mining refers to extracting or "mining" knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery. Knowledge discovery consists of an iterative sequence of the following steps:



Data cleaning: The first step in the Knowledge Discovery Process is Data cleaning in which noise and inconsistent data are removed.

  • Cleaning in case of Missing values.
  • Cleaning noisy data, where noise is a random or variance error. 
  • Cleaning with Data discrepancy detection and Data transformation tools.

Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a common source(Data Warehouse). 

  • Data integration using Data Migration tools.
  • Data integration using Data Synchronization tools.
  • Data integration using ETL (Extract-Load-Transformation) process.


Data Selection: Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the database.

  •  Data selection using neural network. 
  • Data selection using Decision Trees. 
  • Data selection using Naive bayes. 
  • Data selection using Clustering, Regression, etc. 

Data Transformation: In Data Transformation, data are transformed into forms appropriate for mining by performing summary or aggregation operations. Data Transformation is a two-step process: 

  • Data Mapping: Assigning elements from source base to destination to capture transformations.
  • Code generation: Creation of the actual transformation program. 

Data Mining: In Data Mining, data mining methods (algorithms) are applied in order to extract data patterns.

  • Transforms task-relevant data into patterns. 
  • Decides purpose of the model using classification or characterization.

Pattern Evaluation: In Pattern Evaluation, data patterns are identified based on some interesting measures.

  • Find interesting scores for each pattern.
  • Uses summarization and Visualization to make data understandable by the user. 

Knowledge Presentation: In Knowledge Presentation, knowledge is represented to the user using many knowledge representation techniques.

  • Generate reports. 
  • Generate tables.
  • Generate discriminant rules, classification rules, characterization rules, etc.
                              OR,

Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets, and interpreting accurate solutions from the observed results.



1. Identify the goal of the KDD process from the customer's perspective.

2. Understand application domains involved and the knowledge that's required

3. Select a target data set or a subset of data samples on which discovery is be performed.

4. Cleanse and preprocess data by deciding strategies to handle missing fields and alter the data as per the requirements.

5. Simplify the data sets by removing unwanted variables. Then, analyze useful features that can be used to represent the data, depending on the goal or task.

6. Match KDD goals with data mining methods to suggest hidden patterns.

7. Choose data mining algorithms to discover hidden patterns. This process includes deciding which models and parameters might be appropriate for the overall KDD process.

8. Search for patterns of interest in a particular representational form, which include classification rules or trees, regression, and clustering.

9. Interpret essential knowledge from the mined patterns.

10. Use the knowledge and incorporate it into another system for further action.

11. Document it and make reports for interested parties.

Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

Explain network topology .Explain tis types with its advantages and disadvantges.