ISSUES IN CLASSIFICATION /What are the major issue is preparing the data for Classification and Prediction?
Issues regarding classification and prediction can be divided into two categories:
1. Data Preparation Issues: The following preprocessing steps may be applied to the data to help improve the accuracy, efficiency, and scalability of the classification or prediction process.
a. Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise and the treatment of missing values. Although most classification algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning.
b. Relevance analysis (feature selection): Many of the attributes in the data may be redundant. Correlation analysis can be used to identify whether any two given attributes are statistically related. For example, a strong correlation between attributes A1 and A2 would suggest that one of the two could be removed from further analysis. A database may also contain irrelevant attributes. Attribute subset selection can be used in these cases to find a reduced set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. Hence, relevance analysis, in the form of correlation analysis and attribute subset selection, can be used to detect attributes that do not contribute to the classification or prediction task. Such analysis can help improve classification efficiency and scalability.
c. Data transformation: The data may be transformed by normalization, particularly when neural networks or methods involving distance measurements are used in the learning step. Normalization involves scaling all values for a given attribute so that they fall within a small specified range, such as -1 to +1 or 0 to 1. The data can also be transformed by generalizing it to higher-level concepts. Concept hierarchies may be used for this purpose. This is particularly useful for continuous-valued attributes. For example, numeric values for the attributed income can be generalized to discrete ranges, such as low, medium, and high. Similarly, categorical attributes, like the street, can be generalized to higher-level concepts, like the city. Data can also be reduced by applying many other methods, ranging from wavelet transformation and principal components analysis to discretization techniques, such as binning, histogram analysis, and clustering.
2. Evaluation methods issues
a.Accuracy: The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data (i.e., tuples without class label information). The accuracy of a predictor refers to how well a given predictor can guess the value of the predicted attribute for new or previously unseen data.
b.Speed: Refers to the time to construct the model and time to use the model.
c.Robustness: Refers to the handling noise and missing values.
d. Scalability: Refers to the efficiency in disk-resident databases.
e.Interpretability: It is an understanding and insight provided by the model.
f.Goodness of rules (quality):It refers to the size and compactness of classification rules.
OR,
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. Preparing the data involves the following activities:
• Data Cleaning − Data cleaning involves removing the noise and treatment of missing values. The noise is removed by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with the most commonly occurring value for that attribute.
• Relevance Analysis − Database may also have irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are related.
• Data Transformation and reduction − The data can be transformed by any of the following methods:-
• Normalization − Data is transformed using normalization. Normalization involves scaling all values for a given attribute in order to make them fall within a small specified range. Normalization is used when in the learning step, the neural networks or the methods involving measurements are used.
• Generalization − Data can also be transformed by generalizing it to the higher concept. For this purpose, we can use the concept hierarchies.
OR,
Issues regarding classification and prediction
Issues (1): Data Preparation
- Data cleaning
- Preprocess data in order to reduce noise and handle missing values
- Relevance analysis (feature selection)
- Remove the irrelevant or redundant attributes
- Data transformation
- Generalize and/or normalize data
Issues (2): Evaluating Classification Methods
- Predictive accuracy
- Speed and scalability
- time to construct the model
- time to use the model
- Robustness
- handling noise and missing values
- Scalability
- efficiency in disk-resident databases
- Interpretability
- understanding and insight provided by the model
- Goodness of rules
- decision tree size
- compactness of classification rules
Comments
Post a Comment