Generating Synthetic Data
Learn to generate synthetic data to fit a line using linear regression.
We'll cover the following...
In the exercise in the next lesson, you will walk through the model fitting process on your own. We’ll motivate this process using a linear regression, one of the best-known mathematical models, which should be familiar from basic statistics. It’s also called a line of best fit. If you don’t know what it is, you could consult a basic statistics resource, although the intent here is to illustrate the mechanics of model fitting in sci-kit learn, as opposed to understanding the model in detail. We’ll work on that later in the course for other mathematical models that we’ll apply to the case study, such as logistic regression.
Synthetic data generation with NumPy and Matplotlib
In order to have data to work with, you will generate your own synthetic data. Synthetic data is a valuable learning tool for exploring models, illustrating mathematical concepts, and for conducting thought experiments to test various ideas. In order to make synthetic data, we will again illustrate here how to use NumPy’s random
library to generate random numbers, as well as Matplotlib’s scatter
and plot
functions to create scatter and line plots. In the exercise, we’ll use scikit-learn for the linear regression part.
To get started, we use NumPy to make a one-dimensional array of feature values, X
, consisting of 1,000 random real numbers (in other words, not just integers but decimals as well) between 0 and 10. We again use a seed for the random number generator. Next, we use the uniform
method of default_rng
(random ...