Generating Synthetic Data

Learn to generate synthetic data to fit a line using linear regression.

In the exercise in the next lesson, you will walk through the model fitting process on your own. We’ll motivate this process using a linear regression, one of the best-known mathematical models, which should be familiar from basic statistics. It’s also called a line of best fit. If you don’t know what it is, you could consult a basic statistics resource, although the intent here is to illustrate the mechanics of model fitting in sci-kit learn, as opposed to understanding the model in detail. We’ll work on that later in the course for other mathematical models that we’ll apply to the case study, such as logistic regression.

Synthetic data generation with NumPy and Matplotlib

In order to have data to work with, you will generate your own synthetic data. Synthetic data is a valuable learning tool for exploring models, illustrating mathematical concepts, and for conducting thought experiments to test various ideas. In order to make synthetic data, we will again illustrate here how to use NumPy’s random library to generate random numbers, as well as Matplotlib’s scatter and plot functions to create scatter and line plots. In the exercise, we’ll use scikit-learn for the linear regression part.

To get started, we use NumPy to make a one-dimensional array of feature values, X, consisting of 1,000 random real numbers (in other words, not just integers but decimals as well) between 0 and 10. We again use a seed for the random number generator. Next, we use the uniform method of default_rng (random number generator), which draws from the uniform distribution: it’s equally likely to choose any number between low (inclusive) and high (exclusive), and will return an array of whatever size you specify. We create a one-dimensional array (that is, a vector) with 1,000 elements, then examine the first 10. All of this can be done using the following code:

from numpy.random import default_rng
rg = default_rng(12345)
X = rg.uniform(low=0.0, high=10.0, size=(1000,))
X[0:10]

The output should appear as follows:

# array([2.27336022, 3.1675834 , 7.97365457, 6.76254671, 3.91109551, 3.32813928, 5.98308754, 1.86734186, 6.72756044, 9.41802865])

Data for linear regression

Now we need a response variable. For this example, we’ll generate data that follows the assumptions of linear regression: the data will exhibit a linear trend against the feature, but have normally distributed errors:

y=ax + b + N(μ, σ)y = ax \space + \space b \space + \space N(μ, \space σ)

Here, aa is the slope, bb is the intercept, and the Gaussian noise has a mean of µµ with a standard deviation of σσ. In order to write code to implement this, we need to make a corresponding vector of responses, y, which are calculated as the slope times the feature array, X, plus some Gaussian noise (again using NumPy), and an intercept. The noise will be an array of 1,000 data points with the same shape (size) as the feature array, X, where the mean of the noise (loc) is 0 and the standard deviation (scale) is 1. This will add a little “spread” to our linear data:

slope = 0.25
intercept = -1.25
y = slope * X + rg.normal(loc=0.0, scale=1.0, size=(1000,)) + intercept

Now we’d like to visualize this data. We will use matplotlib to plot y against the feature X as a scatter plot. First, we use rcParams to set the resolution (dpi = dots per inch) for a nice crisp image. Then we create the scatter plot with plt.scatter, where X and y are the first two arguments, respectively, and the s argument specifies a size for the dots.

This code can be used for plotting:

mpl.rcParams['figure.dpi'] = 400
plt.scatter(X,y,s=1)
plt.xlabel('X')
plt.ylabel('y')

After executing these cells, you should see something like this:

Get hands-on with 1300+ tech skills courses.