A Simple Model
This lesson will introduce predictive modeling and will focus on how to construct a simple model with loss functions.
We'll cover the following...
In the previous chapters of this course, we have looked at data cleaning, data exploration, and statistical inference. Now it is time to move to the last stage of the data science lifecycle, which is making predictions that help us in decision making. Until now, we have discovered interesting patterns in the data that we know were significant, but how do we use these patterns to predict future events? With this objective, we make predictions using models.
Modeling
A model is a representation of a system. It tries to approximate real-world phenomena. For instance, Isaac Newton gave us a model for tries to approximate gravity. We can make predictions using the model that how far or high a ball will go if we throw it with a certain force. In the gravity model, there are certain factors that affect the outcome such as the force with which the ball is thrown, or the mass of the ball, and so on. In the same way, we can make models to predict whether a certain client will default on the credit card payment next month or not, where the client’s history of payments and some other factors may affect the outcome of our model.
There are many different ways of making models and measuring their effectiveness. But first, let’s start with a very simple model.
Predicting waiter tips
We have the data of customers that paid a tip at a restaurant. We will try to make a model that predicts the tip paid by the customer. Let’s load the dataset and look at it first.
# Tips Dataset# total_bill: Total bill (cost of the meal), including tax, in US dollars# tip: Tip (gratuity) in US dollars# gender: Gender of person paying for the meal (male, female)# smoker: Smoker in party? (0=No, 1=Yes)# day: name of day of the visit# time: time of visit (Lunch,Dinner)# people: number of people of the partyimport pandas as pddf = pd.read_csv('tips.csv')print(df.head())print(df.describe())
Now a very simple model can be that the customer always pays of the total bill as the tip. So mathematically:
Here is called our model parameter. If we denote the model parameter with , total_bill value with and the predicted tip with then the above equation becomes
So, our simple model becomes a mathematical function, , that takes in an input and gives us the output of predicted tip.
Let’s first create a column to check the percent tips.
import pandas as pdimport matplotlib.pyplot as pltdf = pd.read_csv('tips.csv')# Calculate percent tipdf['percent_tip'] = df['tip'] * 100.0 / df['total_bill']df['percent_tip'] = df['percent_tip'].round()# Plot value counts of percent tipdf['percent_tip'].value_counts().plot(kind = 'bar')
We calculate the percentage tips in line 6 by multiplying the tip
by 100.0
and then dividing it by the total_bill
. Afterward, we round the values so that they are easy to distinguish in line 7. Then we plot the counts of each value of percent_tip
in line 10. We first retrieve the counts of each value in percent_tip
by using the value_counts
function and then plot them using the plot
function.
By looking at the plot, we find out that our model that always predicted the tip to be does not predict accurately in all cases. So, we can say that our model is not performing well. At this point, we need some way of measuring how far off the predictions are from the actual values. This is where loss functions come in.
Loss functions
A loss function is a mathematical function that takes in predicted values that we predicted using our model parameter and a set of actual values () and tells us how well our model performs on the entire input data (set of values: ...