Chapter 2 - Data Processing: It is crucial part of
machine learning.
1.
Dataset
: Get the dataset that we will use for pre-processing. Suppose you have the
dataset like below.
It is in data.csv
In machine learning model we need to check
that the Dependent Variable and Independent variable.
In Below example there first three colum are
independent variable and Purchased colum is dependent variable
Name
|
Age
|
Salary
|
Purchased
|
Bon
|
44
|
72000
|
No
|
Ram
|
27
|
48000
|
Yes
|
Sohan
|
30
|
54000
|
No
|
eric
|
38
|
61000
|
No
|
Mat
|
40
|
Yes
|
|
Denie
|
35
|
58000
|
Yes
|
Andie
|
52000
|
No
|
|
Rus
|
48
|
79000
|
Yes
|
Mak
|
50
|
83000
|
No
|
Mark
|
37
|
67000
|
Yes
|
2.
Importing
Library :
So from here let start with python. We are
using the Spyder(Python) that you can get it from the Anaconda.
Now we are importing the below three
libraries as below:
#import libraries import numpy as np import matplotlib.pyplot as py import pandas as pd |
Numpy
where Np is shortcut it will generally use for mathematical operations
Matpotlib
is the library where we are using the sublibrary as pyplot and use shortcut as Py this will use
to draw the charts.
Pandas
will use to import the dataset, it is best library to import dataset.
Now
select the code and press ctrl + enter,
you can see the output in console as like below.
3.
Importing
Dataset :
Before
importing dataset you need to set the working directory in spyder. In right side window there is fileexplorer
tab select the folders where you want to move and than Just run your application by Run File
(F5) command, by saving the file in that folder.
Now need to import the data.csv as given in 1st
step.
# import the dataset dataset = pd.read_csv('Data.csv') |
Now run that line using ctrl + Enter. Now
you can see in the variable Explorer tab and see the dataset exported like below.
In python the index will start from ), if
you will see the output there in below.
Now we need to distinguish matrix of
feature and dependent variable.
Now we need to create the matrix from the
three variable given there which is independent variable.
Now creating the matrix of feature as X and
Y. X is for independent variable and y for dependent variable.
X = dataset.iloc[:,:-1].values Y = dataset.iloc[:,3].values |
[:,:-1] – first is denoted by coloums. So
by : we are taking all the column.
:-1, Taking row from 0 positions to before
last one.
Run that line using the Ctrl + Enter. And
see the output value of X you can see all the independent variable.
[:,3] : from first it taking all the colum
value and last row as 3. [Colum,row]
Now see the output values there as like
below Ctrl + enter and check variable Y.
All dependent variable.
4.
Missing
Data :
The first problem for data is missing of
data, so how we will deal with it. That
is quite happen in real life. If you
seen the dataset given above there are two data are missing for Mat Salary is
missing and Denie Age is missing.
There are option that you can remove the
line and deal with missing problem but it is not a good option. There are
another idea is to take the mean of the column and replace the missing data.
We will use the libray to get the missing
data, we are not going to create any mean data method.
We will use Scikit Learn pre-processing sub
library Imputer to make it done.
# Fill the missing data and which will the mean of values from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = 'NaN', strategy= 'mean',axis = 0 ) imputer = imputer.fit(X[:, 1:3]) X[:, 1:3] = imputer.transform(X[:, 1:3]) |
Run that code using the Ctrl + Enter, and
see the value of X.
If you will see the code Mat and Andie
value is filled by mean values.
5. Test, Train and Split
In machine learning model machine is
learning from data, what ever the data provided to it, it learn from that. So
in machine learning model we split it into two part, training set of data and
test set of data.
Here we import the library as train_test_split
of sklearn.model_selection .
# split the dataset into train and test from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.30,random_state=42) |
Here the Test_size is 0.30 means its 30% of
the total data use for test data.
Full code Along with the Encoder and and scalling.
# -*- coding: utf-8 -*-
"""
Spyder Editor
This is a temporary script
file.
"""
#import libraries
import numpy as np
import matplotlib.pyplot as py
import pandas as pd
# import the dataset
dataset =
pd.read_csv('Data.csv')
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,3].values
# Fill the missing data and
which will the mean of values
from sklearn.preprocessing
import Imputer
imputer =
Imputer(missing_values = 'NaN', strategy= 'mean',axis = 0 )
imputer = imputer.fit(X[:,
1:3])
X[:, 1:3] =
imputer.transform(X[:, 1:3])
print(X)
# Catgarical the data in
python
from sklearn.preprocessing
import LabelEncoder, OneHotEncoder
#label encoder
labelencoder_X =
LabelEncoder()
X[:,0] =
labelencoder_X.fit_transform(X[:,0])
labelencoder_y =
LabelEncoder()
Y =
labelencoder_y.fit_transform(Y)
# One hot encoder
onehotencoder =
OneHotEncoder(categorical_features= [0])
X = onehotencoder.fit_transform(X).toarray()
#split the data into train and
test data
from sklearn.model_selection
import train_test_split
X_train, X_test, y_train,
y_test = train_test_split(X,Y,test_size=0.30,random_state=42)
#feature scalling
from sklearn.preprocessing
import StandardScaler
sc_X = StandardScaler()
X_train =
sc_X.fit_transform(X_train)
X_test =
sc_X.fit_transform(X_test)
########################################
import quandl
df =
quandl.get("NSE/SBIN")
print(df.tail())
|
No comments:
Post a Comment