OVERFITTING

Higher order Linear Regression

In this blog I will explain the concept of Overfitting using polynomial Regression, how to detect it and different ways to prevent it.

Overview of Overfitting:

When a model fits more data than it needs, it starts catching noisy data and inaccurate values in the data. As a result, the efficiency and the accuracy of the model decrease.

Responsive image

When we run the training algorithm on the data set, we allow the cost to reduce with each number of iteration so running this algorithm for too long will result in reduced cost but it will also fit the noisy data from the data set.

The result will look like the graph below

Responsive image

It looks efficient but it’s not the main goal of the algorithm such as linear regression is to find a dominant trend and fit the data points accordingly but in this case, the line fits all the data points which is irrelevant to the efficiency of the model in predicting optimum outcomes for the new entry data points.

When we run the training algorithm on the data set, we allow the cost to reduce with each number of iteration so running this algorithm for too long will result in reduced cost but it will also fit the noisy data from the data set.

Underfitting

To avoid overfitting, we could stop the training at an initial stage. But it might also lead to the model not being able to learn enough from training data, which may find it difficult to capture the dominant trend. This is known as UNDERFITTING.

Underfitting is when you have a high bias with the new data

The graph below displays underfitting

Responsive image

The result is the same as overfitting like it will cause inefficiency in predicting the outcomes, but it will take fewer data to recognize the dominant trend inside the data set.

HOW TO DETECT OVERFITTING?

During the training, we will have the training error and the validation error, i.e, the error on the training set and error on the validation set, we have to observe how these errors are developing

For example, The graph below contains a Blue line for the training error and a red line for the Validation error

Responsive image

Initially, both are decreasing gradually which means both are getting lower and lower error rates, but after a while, the Validation error is going up while the training error is continuing to show the downward trend.

The model here is learning what the taring set looks like and it’s adapting itself to the training set all the time hence the reason the training error is showing the continuous downward trend whereas the validation error is increasing because it’s not caring about the validations had as much right it’s only caring about adapting itself to our training set here.

If you have a similar graph with training and validation errors then you’re probably suffering from overfitting

The main challenge with overfitting is to estimate the accuracy of the performance of our model with new data as we would not be able to estimate the accuracy until we test it.

HOW TO AVOID OVERFITTING?

There are several ways in which we can avoid overfitting, that are:
  • Cross-Validation
  • Training with more data
  • Regularization
  • Removing Features
  • Ensembling
  • Early Stopping

Lets' start the implementation

Overfitting with Higher order Linear Regression

Import the Libraries

Responsive image

Generate 20 random values

Responsive image

Splitting the data into two equal parts: train and test both having 10 pairs

Responsive image

Reshape the data(Reshaping is used to make an array column-wise)

Responsive image
Responsive image
Responsive image

Importing more dependencies

Responsive image

Using root mean square error, find weights of polynomial regression for order is 0, 1, 3, 9

Responsive image

Displaying weights in table

Responsive image

Draw a chart of fit data

sorted X and Y data simultaneously to get proper line graph ouput

Responsive image

The graph will look like below:

Responsive image
Responsive image
Responsive image
Responsive image
Responsive image

Calculate the Train error v/s Test error

Responsive image

Calculating the training and Testing error, we get:

Responsive image

Now generate 100 more data and fit 9th order model and draw fit

Responsive image

Generating the 9th order using the same previous method:

Responsive image

Selected 25 data and fit 9th order model and draw fit

Responsive image

Now we will regularize using the sum of weights

Using Regularization method, we can prevent overfitting

Responsive image

The Lambda graphs:

When Lambda = 1 :

Responsive image

When Lambda = 1/10 :

Responsive image

When Lambda = 1/100 :

Responsive image

When Lambda = 1/1000 :

Responsive image

When Lambda = 1/10000 :

Responsive image

When Lambda = 1/100000 :

Responsive image

The Training and Test error of the Lambda Values are:

Responsive image

Now according to the Lambda values, draw the Training and Test Values

Responsive image

Challenges and contributions:

There are large datasets as some of the independent data are correlated with other data. First I tried using LASSO which uses L1 regularization( does the feature selection automatically) but the result was not accurate as LASSO cannot be used as it can only be used if we have correlated variables and it turns one variable to zero and retains only one variable. This was resulting in low accuracy for the model and loss of Information.

So, I used Ridge Regression as it reduces model complexity and it shrinks the parameters. It shrinks the parameters which result in reducing the variance For the optimization function to be penalized, lambda regularizes the coefficients in a way that they(coefficients) take huge values. This was the more optimal method and we can prevent Overfitting using Ridge Regression.

References: