Sunday 23 September 2018

Machine Learning - Data Processing


Chapter 2  - Data Processing: It is crucial part of machine learning.

1.       Dataset : Get the dataset that we will use for pre-processing. Suppose you have the dataset like below.

It is in data.csv

In machine learning model we need to check that the Dependent Variable and Independent variable.

In Below example there first three colum are independent variable and Purchased colum is dependent variable

Name
Age
Salary
Purchased
Bon
44
72000
No
Ram
27
48000
Yes
Sohan
30
54000
No
eric
38
61000
No
Mat
40

Yes
Denie
35
58000
Yes
Andie

52000
No
Rus
48
79000
Yes
Mak
50
83000
No
Mark
37
67000
Yes


2.       Importing Library :

So from here let start with python. We are using the Spyder(Python) that you can get it from the Anaconda.

Now we are importing the below three libraries as below:

#import libraries

  import numpy as np

  import matplotlib.pyplot as py

  import pandas as pd


Numpy where Np is shortcut it will generally use for mathematical operations

Matpotlib is the library where we are using the sublibrary as  pyplot and use shortcut as Py this will use to draw the charts.

Pandas will use to import the dataset, it is best library to import dataset.

Now select the code and  press ctrl + enter, you can see the output in console as like below.



3.       Importing Dataset :

Before importing dataset you need to set the working directory in spyder.  In right side window there is fileexplorer tab select the folders where you want to move and  than Just run your application by Run File (F5) command, by saving the file in that folder.


Now need to import the data.csv as given in 1st step.

# import the dataset

  dataset = pd.read_csv('Data.csv')


Now run that line using ctrl + Enter. Now you can see in the variable Explorer tab and see the dataset exported like below.
In python the index will start from ), if you will see the output there in below.



Now we need to distinguish matrix of feature and dependent variable.
Now we need to create the matrix from the three variable given there which is independent variable.
Now creating the matrix of feature as X and Y. X is for independent variable and y for dependent variable.

X = dataset.iloc[:,:-1].values

Y = dataset.iloc[:,3].values


[:,:-1] – first is denoted by coloums. So by : we are taking all the column.
:-1, Taking row from 0 positions to before last one.
Run that line using the Ctrl + Enter. And see the output value of X you can see all the independent variable.


[:,3] : from first it taking all the colum value and last row as 3. [Colum,row]
Now see the output values there as like below  Ctrl + enter and check variable Y.
All dependent variable.


4.       Missing Data :
The first problem for data is missing of data, so how we will deal with it.  That is quite happen in real life.  If you seen the dataset given above there are two data are missing for Mat Salary is missing and Denie Age is missing.
There are option that you can remove the line and deal with missing problem but it is not a good option. There are another idea is to take the mean of the column and replace the missing data.

We will use the libray to get the missing data, we are not going to create any mean data method.
We will use Scikit Learn pre-processing sub library Imputer to make it done.

# Fill the missing data and which will the mean of values

  from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values = 'NaN', strategy= 'mean',axis = 0 )

imputer = imputer.fit(X[:, 1:3])

X[:, 1:3] = imputer.transform(X[:, 1:3])


Run that code using the Ctrl + Enter, and see the value of X.


If you will see the code Mat and Andie value is filled by mean values.

5.       Test, Train and Split
In machine learning model machine is learning from data, what ever the data provided to it, it learn from that. So in machine learning model we split it into two part, training set of data and test set of data.
Here we import the library as   train_test_split of sklearn.model_selection .
# split the dataset into train and test

  from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.30,random_state=42)


Here the Test_size is 0.30 means its 30% of the total data use for test data.



Full code Along with the Encoder and  and scalling.

# -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

#import libraries
import numpy as np
import matplotlib.pyplot as py
import pandas as pd


# import the dataset
dataset = pd.read_csv('Data.csv')

X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,3].values


# Fill the missing data and which will the mean of values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy= 'mean',axis = 0 )
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

print(X)


# Catgarical the data in python
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#label encoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
labelencoder_y = LabelEncoder()
Y = labelencoder_y.fit_transform(Y)

# One hot encoder
onehotencoder = OneHotEncoder(categorical_features= [0])
X = onehotencoder.fit_transform(X).toarray()


#split the data into train and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.30,random_state=42)



#feature scalling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)



########################################
import quandl

df = quandl.get("NSE/SBIN")
print(df.tail())










No comments:

Post a Comment