It’s not by just filling with mean, and median values, or by dropping rows with missing values, that the problem of missing data can be solved. In this particular case, we are going to see, it was solved by extrapolation. We were dealing with the problem of Tamilnadu’s population. Population as you know is counted only once in ten years, but then how do we fill in the missing stuff? The simple answer in this case is to extrapolate it.

Now let’s import the necessary libraries to do this work.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The data is in the file names tamilnadu_population_1981_to_2025.csv, let’s load it as a dataframe.

tamilnadu_population = pd.read_csv("tamilnadu_population_1981_to_2025.csv")

When we check its head, as shown below, it’s values contains zeros, for population for the year 1982 and so on.

tamilnadu_population.head()
  Year Population (in millions)
0 1981 48.41
1 1982 0.00
2 1983 0.00
3 1984 0.00
4 1985 0.00

Let’s plot the data as shown below:

x_axis = tamilnadu_population["Year"]
y_axis = tamilnadu_population["Population (in millions)"]
plt.scatter(x_axis, y_axis, color = 'green')
plt.title("TamilNadu Population",color = "blue")
plt.xlabel("Year",color = "blue")
plt.ylabel("Population (in millions)", color = "blue")
plt.figure(figsize = (8,5))

You see a lot of data points that are not there lies on the x-axis, this doesn’t mean that Tamilnadu’s population was wiped out from 1982 to 1990, and was magically restored by God during 1991.

Now to fill the missing values we replace the zeros with NaN.

tamilnadu_population["Population (in millions)"] = tamilnadu_population["Population (in millions)"].replace(0.00, np.nan)
tamilnadu_population.head()
  Year Population (in millions)
0 1981 48.41
1 1982 NaN
2 1983 NaN
3 1984 NaN
4 1985 NaN

Now we use the interpolation method to fill in the missing values:

tamilnadu_population["Population (in millions)"] = tamilnadu_population["Population (in millions)"].interpolate(method='linear')
tamilnadu_population.head()
  Year Population (in millions)
0 1981 48.41
1 1982 49.155
2 1983 49.900
3 1984 50.645
4 1985 51.390

Now let’s plot the refined data as shown below:

x_axis = tamilnadu_population["Year"]
y_axis = tamilnadu_population["Population (in millions)"]
plt.scatter(x_axis, y_axis, color = 'green')
plt.title("TamilNadu Population",color = "blue")
plt.xlabel("Year",color = "blue")
plt.ylabel("Population (in millions)", color = "blue")
plt.figure(figsize = (8,5))

Ah! Much better!!

Code

Credits