Sampling
Learn to sample a subset of a table using SQL.
We'll cover the following...
Extracting a small subset of a table is often called sampling. There are various reasons to use sampling, for example:
- 
Performing estimations on large datasets: When working on large tables, we are sometimes willing to compromise accuracy in favor of speed. By sampling a portion of the table we can produce less accurate results more quickly. 
- 
Producing a training set: When doing data analysis using machine learning models, it is often necessary to train the model on a portion of the data. This portion is known as a training set. The training set can be produced by sampling the table. 
Sampling with LIMIT
A simple way to fetch a random portion of a table is combining random with LIMIT:
WITH sample AS (SELECT *FROM usersORDER BY random()LIMIT 10000)SELECT count(*) FROM sample;count───────10000(1 row)Time: 205.643 ms
To sample 10,000 random rows from the table users we do the following:
- Sort the table in a random order using random().
- Take the first 10,000 rows using LIMIT 10000.
This method of sampling forces the database to sort the entire dataset, and then pick the first N rows. This method is fine for small datasets, but for very large datasets it can be very inefficient and might result in high memory consumption and CPU usage.