Create a machine learning model using linear regression and Boston housing dataset while following the machine learning workflow.
Create a machine learning model using linear regression and Boston housing dataset while following the machine learning workflow.
In machine learning we write computer programs which automatically improve with experience which are termed as machine learning models. It saves us from explicitly writing code for complex real world data.
In this project we are going to use supervised learning, which is a branch of machine learning where we teach our model by examples. Here we will first explore different attributes of Boston housing dataset then a part of dataset will be used to train the linear regression algorithm after that we will use the trained model to give predictions on remaining part of dataset.
The project consists of the following stages:
Create a machine learning model using linear regression and Boston housing dataset while following the machine learning workflow.
In machine learning we write computer programs which automatically improve with experience which are termed as machine learning models. It saves us from explicitly writing code for complex real world data.
In this project we are going to use supervised learning, which is a branch of machine learning where we teach our model by examples. Here we will first explore different attributes of Boston housing dataset then a part of dataset will be used to train the linear regression algorithm after that we will use the trained model to give predictions on remaining part of dataset.
The project consists of the following stages:
In this section we will load a few libraries which we will need to develop, visualize and test our model. We will also be loading our dataset for one of the imported libraries named Sklearn.
To start right away search on Google for "Google colab", click on the first link and then click on new notebook. [Colab is a cloud based environment which provides all the resources required for model development.]
Import the stated libraries:
Numpy
Pandas
Sklearn
matplotlib.plt
Seaborn
Import Boston housing dataset from Sklearn using the following command.
from sklearn.datasets import load_boston
var = load_boston()
A bunch object is returned by load_boston() on which we will do our further work.
Run the command %matplotlib inline
to get the view of plots in notebook itself.
print(var.keys())
should return a dictionary
In this section we will analyse our dataset using different methods and then we'll create a dataframe using the same. We will also carry out preprocessing on the dataframe for using the linear regression model.
Use print statement on each element of dictionary returned by the above statement and read the results Ex: print(var.DESCR)
to understand the dataset.
Create a pandas dataframe for creating a copy of the dataset on which we will carry out further preprocessing. Use the following command:
df = pd.DataFrame(var.data, columns=var.feature_names)
Checkout references available below to know more about other arguments which can be used in DataFrame()
function.
Use functions head()
and tail()
to see first and last five rows of the created dataframe.
Use describe()
to get even further insights on the created data frame.
Add another column to the dataframe and store the value of target attribute in it
df['MEDV'] = var.target
Confirm the addition of column using head()
.
Use df.dtype
or df.info
to know data type various features present in the dataset. If we find categorical data, then we'll require to use different encoding methods.
Use df.isnull().sum()
to check for missing values in each column. If we find missing values, then either we will place values there or we can drop the row or column.
Create a box plot using seaborn
to see the outliers in the dataset. Generally we remove the rows having outliers from our data but for small dataset like Boston housing it can lead to a loss of a significant percentage of data.
Create a heatmap using seaborn
to find corelation between different features and labels. In model creation we will be using features having a high corelation with our target label.
Create KDE
plot of different variables using seaborn
library.
In this section we will import linear regression model from Sklearn. Use features identified from heatmap and label to create training and testing set. Finally we will train our model using training set.
Select features for creating training and test set. In place of variable x put your features (for example, NOX which stands for Nitric Oxide content) and MEDV (Median value) as a label in y
x1 = bdf[['NOX','RM','DIS','PTRATIO','LSTAT' ]]
y1 = bdf['MEDV']
Use train_test split() from Sklearn to create train and test sets
x_train, x_test, y_train, y_test = train_test_split(X1, Y1, test_size =0.33,random_state = 5 )
Use regression on training data
regressor = LinearRegression()
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
In this section we will test our prediction with testing data and calculate R2 score to measure model accuracy. We will also plot the results of the linear regression model.
metrics.mean_absolute_error()
metrics.mean_squared_error()
np.sqrt(metrics.mean_squared_error())
metrics.r2_score()
The above code block is missing some arguments. Checkout the references to get a clear idea on what it's missing.
Use array function in numpy
to crate an array of target label and any one of the features. Pass this through polyfit()
function. polyfit()
will return the slope and intercept of regression line. Store the returned values.
x1 = numpy.array(x_test['NOX'])
y1 = numpy.array(y_pred)
m,b=numpy.polyfit()
The above code snippet is missing arguments for polyfit()
. Checkout the references for a solution.
Use the plot()
function to plot the regression line corresponding to the chosen feature.
plt.plot(x_test['NOX'], m*x_test['NOX'] + b)
plt.plot(x_test['NOX'],y_pred,'o')
plt.xlabel("Nitric Oxide Content")
plt.ylabel("Median Value")
Read about the functions and it's argument in the references section.
Use seaborn
library to create a pairplot
sns.pairplot(bdf, x_vars=['NOX','RM','DIS','PTRATIO','LSTAT'], y_vars='MEDV', height=9, aspect=0.6, kind='reg')
Read about the function and its arguments from references to know more about pairplot
and try to play around with various argument values.
polyfit()
function for Median value label and Nitric oxide content attribute.pairplot
of the attributes against the label where attributes with positive corelation with the label will have plots with positive slope and attributes having negative corelation will have negative slope.