Skip to main content

Effective ways to Handle Missing Values in Data - Part 1

 Introduction

The real world is not perfect and the same goes for real-world data. It often has a lot of missing values which leads to biased and inaccurate results if not handled properly. In addition to this many machine-learning algorithms do not support missing values in the data. Therefore, it is important to address missing values appropriately.

In this article, we will discuss different methods of missing value imputation. In the next part of the article, we will explain these concepts using python code. We will divide these into two parts.

1. Traditional Imputation Methods

2. Advance Imputation Methods

Traditional Imputation Methods

1. Imputation with Mean, Median, or Mode 

The missing values are imputed with the mean or median of the non-missing values in the same column. The important point to note here is that if there are outliers in the data then we should use the median instead of the mean as the median is more robust to outliers in the data. Likewise, for categorical data, we can use mode to fill in the missing values.

2. Hot Deck Imputation

The missing values are imputed with the values of similar data points in the dataset. It is useful for small to medium datasets. This method should be avoided when there are a lot of missing values in the dataset. To understand the hot deck imputation we consider the below dataset. 

Observation numbers 3 and 5 were missing. 

Observation  

Age

Salary (in $)

1

25

17000

2

32

21000

3

27

 

4

29

27000

5

36

 

6

38

47000


Now to fill the salary using hot deck imputation we will check the similar age to 27 from the age column. We found that age 29 is closer to 27 and hence we will fill the salary column against age 27 as that of age 29.  Similarly, age 36 is closer to 38 and hence we will fill the missing salary with the value corresponding to age 38. 


Observation  

Age

Salary (in $)

1

25

17000

2

32

21000

3

27

27000

4

29

27000

5

36

47000

6

38

47000

3. Regression Imputation

In this method, a regression model is used to predict the missing values based on the other variables of the dataset. This is a more accurate method to impute than previous methods but requires the knowledge of regression and its assumption. 

 Advanced Imputation Techniques

1. K-nearest neighbor (KNN imputation)

This is the widely used method to impute missing values. The idea behind this method is to find 'k' samples in the dataset that are similar. Let us assume that k=3; In this case, this algorithm will find the 3 similar data points and then the missing value will be replaced by the mean or mode of these 3 neighbors' found in the data.


2. Multiple Imputation

In this method different copies of the datasets are generated, each with different imputed values for the missing data. These different datasets are analyzed separately and the result of the analysis is combined to produce a final result.

Let us consider the below dataset

Observation

Age

Salary (in $)

1

25

17000

2

32

21000

3

27

 

4

29

27000

5

36

 

6

38

47000


We will use any simple model (linear regression, decision tree, etc) to fill these values for salary. Let us assume we use a decision tree and generate 5 different copies of this data. Salary_27 means the salary value for the age of 27. Below five points are for the five different generated datasets.

1. Salary_27 = 24000 & Salary_36 = 42000
2. Salary_27 = 23000 & Salary_36 = 41000
3. Salary_27 = 22500 & Salary_36 = 40000
4. Salary_27 = 23500 & Salary_36 = 40500
5. Salary_27 = 25700 & Salary_36 = 41700

Now all five datasets are analyzed separately(like a t-test or simple mean and standard deviation)  and we will pool their results to account for the uncertainty introduced by the imputation process.

This is the best method to impute the data but involves space and time complexity.




Comments