In this article, we are going to see how we can easily build a machine learning model using python libraries. Here, we prepare data with pandas, fit and evaluate models with scikit-learn library.

The steps that I am going to discuss in this article are

1.Load a Dataset

2.Data preprocessing and Data analysis

3.Build Model using Linear Regression

4.Check and compare accuracy

Step 1: Load a Dataset

There’s enough data available in different platforms such as kaggle, UCI etc. and you can also use datasets from sklearn library. I choose Big Mart Sales dataset from kaggle. You can download it from this link. The dataset is in .csv format, I import it in my code by using pandas csv reader.

note: here I used jupyter notebook to write code and build model in python

Load CSV data using pandas

Load csv data using pandas
Load csv data using pandas 

Item_Outlet_Sales is the target class

Step 2: Data Preprocessing and analysis

First we explore the data

In above image, it gives information about the dataset. We can see that data set has seven categorical, four continuous and one discrete variable.

statistical description of the data
statistical description of the data

In above image we see the statistical description of the dataset

data plot in a histogram 

Correlation strengths

correlation between features
correlation between target and features
correlation between target and features

Here we can see that Item_MRP has the highest correlation with target class. That means the target class (i.e. Item_Outlet_Sales) is highly dependent on Item_MRP. Increasing and decreasing of MRP value will effect Item Outlet Sales.  and others columns are not very useful for predicting target value.

Data preprocessing

a. first of all we check the availability of null value in the dataset

 check the availability of null value in dataset
checking the availability of null value in dataset

here we can see that Item_Weight and Outlet_size has null value. There is many different way to handle missing value depending on the dataset.

b. Now we are going to count unique value for each categorical column. for analyse the data and get insights from it.

for cols in train.columns:    if (train[cols].dtype=='object'):        print(train[cols].value_counts())
code for counting each unique value of categorical attributes

c. Now We go through each column one by one

  1. In Item_identifier column there is no problem, no error, we can proceed this column for training our model.
  2. Item_Weight: Here we can see that it contains 1423 null values out of 8533 values and the correlation value with target class also low, So we can drop the column. i.e we didn’t use the attribute for training our model.
  3. Item_Fat_content: While counting unique we find this

here we can see that Low Fat, LF and low fat are same category and Regular and reg are in same category. So we replace the low fat, LF and reg to their category.Therefor we now have only two category for item fat content.

train.Item_Fat_Content=train.Item_Fat_Content.replace('LF','Low Fat')train.Item_Fat_Content=train.Item_Fat_Content.replace('low fat','Low Fat')train.Item_Fat_Content=train.Item_Fat_Content.replace('reg','Regular')train.Item_Fat_Content.value_counts()
Replacing the column name

4. Item_Visibility:In the above histogram we can see that there is many null values are there i.e. item with zero visibility. And  correlation value is negative with target class. So we didn’t know that the zero value is the actual visibility ratio or error and since correlation strength is also very low so we can drop the column.

So The final attributes by which we train our model are

attributes=['Item_MRP','Outlet_Type','Outlet_Location_Type','Outlet_Size','Outlet_Establishment_Year','Outlet_Identifier',            'Item_Type','Item_Outlet_Sales']data=train[attributes]
attributes to train the model

Handle categorical columns

Since it is a regression problem, we need to handle the categorical value. I used label encoding for convert categories to number. In below example we can see that the Outlet Types are converted to numerical values.

data.loc[data['Outlet_Type']=='Supermarket Type1','Outlet_Type'] = 1data.loc[data['Outlet_Type']=='Supermarket Type2','Outlet_Type'] = 2data.loc[data['Outlet_Type']=='Supermarket Type3','Outlet_Type'] = 3data.loc[data['Outlet_Type']=='Grocery Store','Outlet_Type'] = 4

We will do the same thing for other remaining categorical columns. i.e. for Outlet_Location_Type, Item_Type and Outlet_Identifier.

Step 3: Build Model

Train test split: first of all we split the train data in two part

from sklearn.model_selection import train_test_splittrain1,test1 = train_test_split(data,test_size=0.20,random_state=2019)#train1.shape , test1.shapetrain_label=train1['Item_Outlet_Sales']test_label=test1['Item_Outlet_Sales']del train1['Item_Outlet_Sales']del test1['Item_Outlet_Sales']

Now build our model using linear regression that are available in sklearn library. And fit the model using train data and predict the model using test data

from sklearn.linear_model import LinearRegressionlr=LinearRegression()# fit the model using train datalr.fit(train1,train_label)#predict the model using test datapredict_lr=lr.predict(test1)

Step 4: Check and compare accuracy

Now we check accuracy of our model using mean squared error

from sklearn.metrics import mean_squared_errormse=mean_squared_error(test_label,predict_lr)lr_score=np.sqrt(mse)print(lr_score)

Compare the actual and predicted data

data = pd.DataFrame({'Actual': test_label, 'Predicted': predict_lr})data1 = data.head(25)data1.plot(kind='bar',figsize=(16,8))plt.show()

here we can visually compare the actual and predicted value.

Though my model is not given up to the mark accuracy, but it is okay we are just learned how to build a model. By more focusing on data preprocessing we can get higher accuracy, this is the more important step for building a model with higher accuracy.