Scikit-Learn for Machine Learning/

...

Solution: Tips and Tricks

Follow the instructions for building pipelines, calculating feature importance, and exporting models.

We'll cover the following...

Press + to interact

Python 3.8

Files

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import joblib
# Load the dataset
data = pd.read_csv('data.csv')
# Defining column types
id_col = 'customerID'
num_cols = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']
cat_cols = [
    'gender', 'Partner', 'Dependents',
    'PhoneService', 'MultipleLines', 'InternetService',
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
    'TechSupport', 'StreamingTV', 'StreamingMovies',
    'Contract', 'PaperlessBilling', 'PaymentMethod'
]
# Define X (features) and y (target variable)
X = data.drop(['Churn', id_col], axis=1)
y = data['Churn']
# Convert labels to numeric form
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# Define transformers
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='error', drop='first'))
])
# Create the column transformer
preprocessor = ColumnTransformer(transformers=[
    ('numeric', numeric_transformer, num_cols),
    ('categorical', categorical_transformer, cat_cols)
])
# Create the pipeline with the preprocessor and classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])
# Fit the pipeline
pipeline.fit(X_train, y_train)
# Calculate feature importance using permutation
result = permutation_importance(
    pipeline, X_test, y_test, scoring='f1', n_repeats=10, random_state=42
)
importance_scores = result.importances_mean
for col, score in zip([col for col in list(X)], importance_scores):
    print(f"{col}: {score}")
# Export the pipeline using joblib
joblib.dump(pipeline, 'pipeline.joblib')
# Get a new observation for testing
new_data = pd.read_csv('new_data.csv')
new_data.drop(id_col, axis=1, inplace=True)
# Apply the pipeline to the new observation
prediction = pipeline.predict(new_data)
final_prediction=label_encoder.inverse_transform(prediction)
print("Final prediction:", final_prediction)

Course Overview

Introduction to Machine Learning

Preprocessing

Supervised Learning

Unsupervised Learning

Model Evaluation

How to Predict the Traffic Volume Using Machine Learning

Tips and Tricks

Conclusion

Customer Segmentation with K-Means Clustering

Solution: Tips and Tricks