Short note on Data cleaning.

- December 25, 2021

Data cleaning

Data cleaning routines work to "clean" the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied. Furthermore, dirty data can cause confusion for the mining procedure, resulting in unreliable output. Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust. Instead, they may concentrate on avoiding overfitting the data to the function being modeled. Therefore, a useful preprocessing step is to run your data through some data cleaning routines.

OR,

Data cleaning is a technique that is applied to remove the noisy data and correct the inconsistencies in data. It involves transformations to correct the wrong data.

Data cleaning is performed as a data preprocessing step while preparing the data for a data warehouse. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect outcomes and algorithms are unreliable, even though they may look correct.

There is no one absolute way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset.

But it is crucial to establish a template for your data cleaning process so you know you are doing it the right way every time. The data can have many irrelevant and missing parts.

To handle this part, data cleaning is done. It involves handling missing data, noisy data, etc.

a) Missing Data

This situation arises when some data is missing in the data. It can be handled in various ways Some of them are:

Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.

Fill the Missing values: There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value.

b)Noisy data

Noisy data is meaningless data that can't be interpreted by machines. It can be generated due to faulty data collection, data entry errors, etc. It can be handled in the following ways:

Binning Method: This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segment is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task.

Regression: Here data can be made smooth by fitting it to a regression function. The regression used may be linear (having one independent variable) or multiple (having multiple independent variables).

Clustering: This approach groups similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.

Search This Blog

Notes for BSc CSIT

Short note on Data cleaning.

Comments

Post a Comment

Popular posts from this blog

Legislations and IT in Nepal MCQ IT Officer(PSC)

Explain Aneka thread life cycle /Explain local thread and Aneka thread.

Explain advantages of authority delegation