In Python machine learning programming, we have software called scikit-learn. This software contains some small datasets that are very easy to access, one of which is the load_breast_cancer
dataset.
This dataset uses a machine learning algorithm to classify cancer scans as
return_X_yboolean
: The default value for this parameter is False
.
from sklearn.datasets import load_breast_cancer
This is a binary classification dataset.
It has no Missing attribute or Null values.
The class distribution is as follows.
This is a commonly used dataset. Machine learning papers have also used this dataset to address regression problems.
All the data types are numerical.
Load the dataset:
from sklearn.datasets import load_breast_cancerdata = load_breast_cancer()print(data)print(data.keys())
After we execute the code, we get the following.
data
: It is mostly features in the dataset that would help classify a scan as benign or malignant. It can also be called feature data.
key
: All the variable data that would help us classify a scan as benign or malignant. It is mostly the key data. For example, the data classifies the scan as benign or malignant by 1 or 0.
target name
: Name of the target variable.
feature name
: All the features available in this dataset:
radius, texture, compactness, concavity, concave points, perimeter,
area, smoothness, etc.
DESCR
: Data description.
filename
: Data is in CSV format.