Example: Train/Test Split with SQL
Learn to split a dataset for testing and training using SQL.
We'll cover the following
A common task when analyzing data is to split a dataset for training and testing. The training dataset is used to train the model, and the test dataset is used to evaluate the model.
Creating a transactions table
To practice what we’ve learned so far, let’s start by generating a transaction table with some random data:
CREATE TABLE transaction ASSELECTid,'2021-01-01'::date + interval '1 day' * ceil(random() * 365) as billed_at,10 + ceil(90 * random()) as charged_amount,random() > 0.6 as reported_as_fraudFROMgenerate_series(1, 10) AS idORDER BY1;
The transaction table includes the date and amount of the transaction, and an indication whether the transaction was reported as fraudulent.
To generate the table, use the techniques we learned in this chapter:
SELECT '2021-01-01'::date + interval '1 day' * ceil(random() * 365) as billed_at;
Produce a random date in 2021 by adding a random number of days between 1 and 365 to January 1st, 2021.
Next, produce a random charged amount between 11 and 100:
SELECT 10 + ceil(90 * random()) as charged_amount;
Finally, produce the parameter we want to predict. Let’s see whether the transaction will be reported as a fraud or not:
SELECT random() > 0.6 as reported_as_fraud;
In our fake data, we want to have 40% fraudulent transactions. Using an expression we produce a boolean value which will evaluate to true ~40% of the times.
This is what the data looks like:
SELECT * FROM transaction;
To test a model that classifies transactions as fraudulent, we want to create two tables—one for training and another for testing.
Splitting the data to test and train
To create a table similar to an existing table in PostgreSQL, we can use the following commands:
CREATE TABLE transaction_training AS TABLE transaction WITH NO DATA;CREATE TABLE transaction_test AS TABLE transaction WITH NO DATA;
This syntax is really useful. We simply tell PostgreSQL to create a table similar to another table, but with no data. In this case, we created two tables, transaction_train
and transaction_test
with a similar structure to transaction
, but with no data.
Next, we want to split the data in the transaction table between transaction_training
and transaction_test
. We want our training set to include 80% of the rows—in this case 8 rows:
WITHtraining_transaction_ids AS (INSERT INTO transaction_trainingSELECT * FROM transactionORDER BY random() LIMIT 8RETURNING id)INSERT INTO transaction_testSELECT * FROM transactionWHERE id NOT IN (SELECT idFROM training_transaction_ids);
To populate data for training we select from the transaction table, use ORDER BY random()
to shuffle the rows, and then insert into transaction_training
just the first 8 rows.
To insert only the remaining rows into the test table, we keep the IDs of the training rows by specifying RETURNING id
in a common table expression. We then insert rows into transaction_test
and exclude rows in training_transaction_ids
.
This is the result:
SELECT * FROM transaction_training;SELECT * FROM transaction_test;
And there you have it, a training dataset and a test dataset with SQL, directly in the database!