...

/

Solution: Tips and Tricks

Solution: Tips and Tricks

Follow the instructions for building pipelines, calculating feature importance, and exporting models.

We'll cover the following...

Again, there are many possible solutions to this challenge, depending on the type of preprocessing you want to use.

Here’s a simple solution using only a few preprocessing tools:

Press + to interact
main.py
new_data.csv
data.csv
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import joblib
# Load the dataset
data = pd.read_csv('data.csv')
# Defining column types
id_col = 'customerID'
num_cols = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']
cat_cols = [
'gender', 'Partner', 'Dependents',
'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies',
'Contract', 'PaperlessBilling', 'PaymentMethod'
]
# Define X (features) and y (target variable)
X = data.drop(['Churn', id_col], axis=1)
y = data['Churn']
# Convert labels to numeric form
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Define transformers
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='error', drop='first'))
])
# Create the column transformer
preprocessor = ColumnTransformer(transformers=[
('numeric', numeric_transformer, num_cols),
('categorical', categorical_transformer, cat_cols)
])
# Create the pipeline with the preprocessor and classifier
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
# Fit the pipeline
pipeline.fit(X_train, y_train)
# Calculate feature importance using permutation
result = permutation_importance(
pipeline, X_test, y_test, scoring='f1', n_repeats=10, random_state=42
)
importance_scores = result.importances_mean
for col, score in zip([col for col in list(X)], importance_scores):
print(f"{col}: {score}")
# Export the pipeline using joblib
joblib.dump(pipeline, 'pipeline.joblib')
# Get a new observation for testing
new_data = pd.read_csv('new_data.csv')
new_data.drop(id_col, axis=1, inplace=True)
# Apply the pipeline to the new observation
prediction = pipeline.predict(new_data)
final_prediction=label_encoder.inverse_transform(prediction)
print("Final prediction:", final_prediction)
  • Lines 1–11: We import all the required libraries to implement the challenge.

  • Lines 41–43: We define numeric_transformer, which includes imputation using the mean strategy and standardization of numerical features using StandardScaler. ...