Discuss about overfitting and underfitting. How precision and recall is used to evaluate classifier.

  OverFitting

  • Overfitting means the model has a High accuracy score on training data but a low score on test data. An overfit model has overly memorized the data set it has seen and is unable to generalize the learning to an unseen data set. That is why an overfit model results in very poor test accuracy. Poor test accuracy may occur when the model is highly complex, i.e., the input feature combinations are in a large number and affect the model's flexibility.
  • Overfitting happens when the algorithm used to build a prediction model is very complex and it has overlearned the underlying patterns in training data. Overfitting is an error from sensitivity to small fluctuations in the training set. Overfitting can cause an algorithm to model the random noise in the training data, rather than the intended result. In classification, overfitting happens when algorithms are strongly influenced by the specifics of the training data and try to learn patterns that are noisy and not generalized, and only limited to the training data set.

For example, As shown in the figure below, the model is trained to classify between the circles and crosses, and unlike last time, this time the model learns too well. It even tends to classify the noise in the data by creating an excessively complex model (right).

Underfitting

  • The underfitting means model has a low accuracy score on both training data and test data. An underfit model fails to significantly grasp the relationship between the input values and target variables. 
  • Underfitting happens when the algorithm used to build a prediction model is very simple and not able to learn the complex patterns from the training data. In that case, accuracy will be low on seen training data as well as unseen test data. Generally, It happens with Linear Algorithms.
  •  A underfit model makes incorrect assumptions about the dataset to make the target function easier to learn. If training data distribution is nonlinear and you apply the linear algorithm to build the prediction model, in that case, would not be able to learn the nonlinear relationship between the target value and features, and model accuracy will suffer.
  •  Underfit Model is very simple because of the assumptions made about the data, pays very little attention to the training data, and oversimplifies the model. The high Bias model always leads to a high error on training as well as on test data.
  •  For example, as shown in the figure below, the model is trained to classify between the circles and crosses. However, it is unable to do so properly due to the straight line, which fails to properly classify either of the two classes.


 Precision and recall are used to evaluate classifier by following ways:-

Precision: Precision refers to the measure of exactness that means what percentage of test data tuples labeled as class1 category is actually such. Based on the classification report given in the form of a confusion matrix it is calculated as:

Precision =TP /(TP + FP)

 Recall: Recall refers to the true positive rate that means the proportion of class1 test data tuples that are correctly identified. Based on the classification report given in the form of a confusion matrix it is calculated as:

Recall =TP /(TP + FN)

The precision and recall give a better sense of how a classifier is actually doing, especially when we have a highly imbalanced dataset. If the classifier predicts class 2 all the time and gets 99.5% accuracy, the recall and precision both will be 0. Because there are no true positives. So, the classifier is not a good classifier. When the precision and recall both are high, that is an indication that the classifier is doing very well.

However, Evaluating a model using recall and precision does not use all cells of the confusion matrix. Recall deals with true positives and false negatives and precision deals with true positives and false positives. Thus, using this pair of performance measures, true negatives are never taken into account. Thus, precision and recall should only be used in situations, where the correct identification of the class 2 class does not play a role.

Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

What is national data warehouse? What is census data?