Explain the various metrices used to measure the performance of classifier./Describe various evaluation metrics for the predictive accuracy of a classifier./METHODS FOR ESTIMATING A CLASSIFIER'S ACCURACY (VALIDATION METHODS)

 METHODS FOR ESTIMATING A CLASSIFIER'S ACCURACY (VALIDATION METHODS)

After the data gathering, cleansing, investigation, and feature engineering, but before starting to train models, it is crucial to have a method for evaluating models that don't involve the data used to train the model. Traning a model and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, Common methods include:

1. Holdout method: The available data set D is divided into two disjoint subsets, the training set for learning a model and the test set for testing the model. Normally, 2/3 of available dataset D is used as a training set and the remaining 1/3 of available dataset D is used as the test set. This method is mainly used when the data set D is large.


The advantage of the holdout method is a simple method. The disadvantage of the holdout method is that data loss as the test data can not be used for learning in the entire data and depending on how you divide your data, your results can vary a lot. The solution to this problem is random sampling.


2. Random Sampling: In this technique, multiple data items are randomly chosen from the dataset and combined to form a test dataset. The remaining data forms the training dataset. The following diagram represents the random subsampling validation technique. The error rate of the model is the average of error rate of each iteration.

 This method solves the problem of the holdout method but there is no limit on the number of times each sample is used for training and test. Some samples are used for learning more than others and can be reflected in the classifier.

When evaluating different hyperparameters for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can "leak" into the model, and evaluation metrics no longer report on generalization performance. The solution to this problem is to use a validation set. 


3. Validation set: In this method, the available data is divided into three subsets, a training set, a test set, and a validation set. Training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, a final evaluation can be done on the test set. A validation set is used frequently for estimating parameters in learning algorithms. In such cases, the values that give the best accuracy on the validation set are used as the final parameter values.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets. A solution to this problem is a procedure called cross-validation.


4. Cross-validation: In this method, the available data is partitioned into k equal-size disjoint subsets, and use each subset as the test set and combines the rest of n-1 subsets as the training set to learn a classifier. The procedure is run k times, which gives k accuracies. The final estimated accuracy of learning is the average of the k accuracies. 10-fold and 5-fold cross-validations are commonly used. This method is used when the available data is not large.

The main advantage of the cross-validation technique is that the entire data is used for training and testing. But the execution time and amount of computations increases as the value of k increases.


5. Leave-one-out cross-validation: It is a special case of cross-validation. Each fold of the cross-validation has only a single test example and all the rest of the data is used in training. If the original data has N examples, this is N-fold cross-validation. This method is used when the data set is very small. For example, The main advantage of leave-one-out cross-validation is that as much data as possible can be used for training. Accordingly, the possibility of accurate classification increases. But it takes a lot of execution time and calculations.



6. Bootstrapping: In this technique, the training dataset is randomly selected with replacement The remaining examples that were not selected for training are used for testing. Unlike K-fold cross-validation, the value is likely to change from the fold to fold. The error rate of the model average of the error rate of each iteration. The following diagram represents the same.



Comments

Popular posts from this blog

Discuss classification or taxonomy of virtualization at different levels.

What is RMI? Discuss stub and skeleton. Explain its role in creating distributed applications.

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?