Solution: Tips and Tricks
Follow the instructions for building pipelines, calculating feature importance, and exporting models.
We'll cover the following...
Again, there are many possible solutions to this challenge, depending on the type of preprocessing you want to use.
Here’s a simple solution using only a few preprocessing tools:
Press + to interact
main.py
new_data.csv
data.csv
import pandas as pdimport numpy as npfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.impute import SimpleImputerfrom sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoderfrom sklearn.linear_model import LogisticRegressionfrom sklearn.inspection import permutation_importancefrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import f1_scoreimport joblib# Load the datasetdata = pd.read_csv('data.csv')# Defining column typesid_col = 'customerID'num_cols = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']cat_cols = ['gender', 'Partner', 'Dependents','PhoneService', 'MultipleLines', 'InternetService','OnlineSecurity', 'OnlineBackup', 'DeviceProtection','TechSupport', 'StreamingTV', 'StreamingMovies','Contract', 'PaperlessBilling', 'PaymentMethod']# Define X (features) and y (target variable)X = data.drop(['Churn', id_col], axis=1)y = data['Churn']# Convert labels to numeric formlabel_encoder = LabelEncoder()y = label_encoder.fit_transform(y)# Split the dataset into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Define transformersnumeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler())])categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('encoder', OneHotEncoder(handle_unknown='error', drop='first'))])# Create the column transformerpreprocessor = ColumnTransformer(transformers=[('numeric', numeric_transformer, num_cols),('categorical', categorical_transformer, cat_cols)])# Create the pipeline with the preprocessor and classifierpipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', LogisticRegression())])# Fit the pipelinepipeline.fit(X_train, y_train)# Calculate feature importance using permutationresult = permutation_importance(pipeline, X_test, y_test, scoring='f1', n_repeats=10, random_state=42)importance_scores = result.importances_meanfor col, score in zip([col for col in list(X)], importance_scores):print(f"{col}: {score}")# Export the pipeline using joblibjoblib.dump(pipeline, 'pipeline.joblib')# Get a new observation for testingnew_data = pd.read_csv('new_data.csv')new_data.drop(id_col, axis=1, inplace=True)# Apply the pipeline to the new observationprediction = pipeline.predict(new_data)final_prediction=label_encoder.inverse_transform(prediction)print("Final prediction:", final_prediction)
Lines 1–11: We import all the required libraries to implement the challenge.
Lines 41–43: We define
numeric_transformer
, which includes imputation using the mean strategy and standardization of numerical features usingStandardScaler
. ...