Machine Learning
Train a machine learning model with scaled training data, predict with test data, and visualize predictions.
Let's create a machine learning model using a linear regression module from scikit-learn to suggest the house price based on the selected features.
Get started
Let’s say we have cleaned our data, treated the missing values and categorical variables, removed outliers, and created required new features (if needed). Now, our data is ready to feed into the machine learning model. The very first thing to do now is to separate our data into the following:
X
: Will contain the selected features, also called independent variables.y
: Will be the target values; in this case, the house price is also called the dependent variable.
X = df[['CRIM','RM','DIS','NOX']]y = df['price'] # target# Might be a good idea to recheck what is in X and yprint(X.head(2))print('=======')print(y.head(2))
Note: Uppercase
X
and lowercasey
are just conventions, and it is recommended to use these variables for features and target, respectively.
Standardization: Feature scaling
Let's see what X
(original unscaled features) looks like.
# This is how X (original unscaled features) looks like!print("Original unscaled features:")print("CRIM mean:",round(X.CRIM.mean(),3), "CRIM var:",round(np.var(X.CRIM),3))print(X.head(2))
Remember, the machine learning algorithms that employ gradient descent as an optimization strategy, such as linear regression, logistic regression, and neural networks, require data to be scaled. Let’s scale our features and check the difference.
from sklearn.preprocessing import StandardScalerimport pickle # need this importscaler = StandardScaler() # Creating instance 'scaler'scaler.fit(X) # fitting the featurespickle.dump(obj=scaler, file=open(file='transformation.pkl', mode='wb')) # Saving the transformationscaler = pickle.load(file=open(file='transformation.pkl', mode='rb')) # Loading saved transformationX_scaled = scaler.transform(X) # transforming features# check the difference!X_scaled=pd.DataFrame(X_scaled,columns=X.columns) #just creating a dataframe for scaled featuresprint("Scaled features (0 mean, 1 variance):")print("CRIM mean:",round(X_scaled.CRIM.mean(),3), "NZ var:",round(np.var(X_scaled.CRIM),3))print(X_scaled.head(2))
We have standardized all the features in the code above before splitting them into train and test datasets. It’s important to know that the model trained on standardized features needs unseen features to make predictions. So, it’s recommended and considered a good practice to serialize/save the transformation from the training dataset. We can then load it and transform the unseen data before making predictions.
Linear regression model training
Let's train our very first machine learning model.
Train test split
Now, we have features in X
and target (price) in y
. The next step is to split the data into:
A training set (
X_train
andy_train
)A testing set (
X_test
andy_test
) ...