Short note on Data cleaning.
Data cleaning routines work to "clean" the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied. Furthermore, dirty data can cause confusion for the mining procedure, resulting in unreliable output. Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust. Instead, they may concentrate on avoiding overfitting the data to the function being modeled. Therefore, a useful preprocessing step is to run your data through some data cleaning routines.
- Data cleaning is a technique that is applied to remove the noisy data and correct the inconsistencies in data. It involves transformations to correct the wrong data.
- Data cleaning is performed as a data preprocessing step while preparing the data for a data warehouse. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect outcomes and algorithms are unreliable, even though they may look correct.
- There is no one absolute way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset.
- But it is crucial to establish a template for your data cleaning process so you know you are doing it the right way every time. The data can have many irrelevant and missing parts.
- To handle this part, data cleaning is done. It involves handling missing data, noisy data, etc.
Comments
Post a Comment