Asp.Net, C#.Net, Angularjs, Entity Framework, JQuery, MVC, Interview Question: Machine Learning

Chapter 2 - Data Processing: It is crucial part of machine learning.

1. Dataset : Get the dataset that we will use for pre-processing. Suppose you have the dataset like below.

It is in data.csv

In machine learning model we need to check that the Dependent Variable and Independent variable.

In Below example there first three colum are independent variable and Purchased colum is dependent variable

Name	Age	Salary	Purchased
Bon	44	72000	No
Ram	27	48000	Yes
Sohan	30	54000	No
eric	38	61000	No
Mat	40		Yes
Denie	35	58000	Yes
Andie		52000	No
Rus	48	79000	Yes
Mak	50	83000	No
Mark	37	67000	Yes

2. Importing Library :

So from here let start with python. We are using the Spyder(Python) that you can get it from the Anaconda.

Now we are importing the below three libraries as below:

#import libraries

  import numpy as np

  import matplotlib.pyplot as py

  import pandas as pd

Numpy where Np is shortcut it will generally use for mathematical operations

Matpotlib is the library where we are using the sublibrary as pyplot and use shortcut as Py this will use to draw the charts.

Pandas will use to import the dataset, it is best library to import dataset.

Now select the code and press ctrl + enter, you can see the output in console as like below.

3. Importing Dataset :

Before importing dataset you need to set the working directory in spyder. In right side window there is fileexplorer tab select the folders where you want to move and than Just run your application by Run File (F5) command, by saving the file in that folder.

Now need to import the data.csv as given in 1^st step.

# import the dataset

  dataset = pd.read_csv('Data.csv')

Now run that line using ctrl + Enter. Now you can see in the variable Explorer tab and see the dataset exported like below.

In python the index will start from ), if you will see the output there in below.

Now we need to distinguish matrix of feature and dependent variable.

Now we need to create the matrix from the three variable given there which is independent variable.

Now creating the matrix of feature as X and Y. X is for independent variable and y for dependent variable.

X = dataset.iloc[:,:-1].values

Y = dataset.iloc[:,3].values

[:,:-1] – first is denoted by coloums. So by : we are taking all the column.

:-1, Taking row from 0 positions to before last one.

Run that line using the Ctrl + Enter. And see the output value of X you can see all the independent variable.

[:,3] : from first it taking all the colum value and last row as 3. [Colum,row]

Now see the output values there as like below Ctrl + enter and check variable Y.

All dependent variable.

4. Missing Data :

The first problem for data is missing of data, so how we will deal with it. That is quite happen in real life. If you seen the dataset given above there are two data are missing for Mat Salary is missing and Denie Age is missing.

There are option that you can remove the line and deal with missing problem but it is not a good option. There are another idea is to take the mean of the column and replace the missing data.

We will use the libray to get the missing data, we are not going to create any mean data method.

We will use Scikit Learn pre-processing sub library Imputer to make it done.

# Fill the missing data and which will the mean of values

  from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values = 'NaN', strategy= 'mean',axis = 0 )

imputer = imputer.fit(X[:, 1:3])

X[:, 1:3] = imputer.transform(X[:, 1:3])

Run that code using the Ctrl + Enter, and see the value of X.

If you will see the code Mat and Andie value is filled by mean values.

5. Test, Train and Split

In machine learning model machine is learning from data, what ever the data provided to it, it learn from that. So in machine learning model we split it into two part, training set of data and test set of data.

Here we import the library as train_test_split of sklearn.model_selection .

# split the dataset into train and test

  from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.30,random_state=42)

Here the Test_size is 0.30 means its 30% of the total data use for test data.

Full code Along with the Encoder and and scalling.

# -*- coding: utf-8 -*-

"""

Spyder Editor

This is a temporary script file.

"""

#import libraries

import numpy as np

import matplotlib.pyplot as py

import pandas as pd

# import the dataset

dataset = pd.read_csv('Data.csv')

X = dataset.iloc[:,:-1].values

Y = dataset.iloc[:,3].values

# Fill the missing data and which will the mean of values

from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values = 'NaN', strategy= 'mean',axis = 0 )

imputer = imputer.fit(X[:, 1:3])

X[:, 1:3] = imputer.transform(X[:, 1:3])

print(X)

# Catgarical the data in python

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#label encoder

labelencoder_X = LabelEncoder()

X[:,0] = labelencoder_X.fit_transform(X[:,0])

labelencoder_y = LabelEncoder()

Y = labelencoder_y.fit_transform(Y)

# One hot encoder

onehotencoder = OneHotEncoder(categorical_features= [0])

X = onehotencoder.fit_transform(X).toarray()

#split the data into train and test data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.30,random_state=42)

#feature scalling

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)

X_test = sc_X.fit_transform(X_test)

########################################

import quandl

df = quandl.get("NSE/SBIN")

print(df.tail())

Labels

Sunday, 23 September 2018

Machine Learning - Data Processing

No comments:

Post a Comment

Labels

Sunday, 23 September 2018

Machine Learning - Data Processing

Related Posts:

No comments:

Post a Comment