The Dataset and Exploratory Data Analysis

Proceed with the Cleveland database and perform exploratory data analysis.

We'll cover the following

Let's move on and work with another famous dataset on heart disease in Cleveland. This original and full dataset is a part of the UCI machine learning repository and contains four databases: Cleveland, Hungary, Switzerland, and the VA Long Beach. This dataset was donated in 1988 to the public. The original database contains 76 attributes, but all published experiments by machine learning researchers refer to using a subset of 14 of them.

The dataset

In particular, the Cleveland database is the only one widely used by machine learning researchers. In the original database, the goal field refers to a patient’s presence of heart disease. It’s an integer value from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1, 2, 3, and 4) from absence (value 0). Information on the 14 attributes that we’re going to use is provided below:

  • age: Years

  • sex: 1 = male and 0 = female

  • cp: Chest pain type (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, and 4 = asymptomatic)

  • trestbps: Resting blood pressure

  • chol: Serum cholesterol in mg/dl

  • fbs: Fasting blood sugar > 120 mg/dl (1 = true, and 0 = false)

  • restecg: Resting ECG (electrocardiographic) results

  • thalach: Maximum heart rate achieved in beats per minute (bpm)

  • exang: Exercise-induced angina (1 = yes and 0 = no)

  • oldpeak: ST depression induced by exercise relative to rest

  • slope: The slope of the peak exercise ST segment (1 = up-sloping, 2 = flat, and 3 = down-sloping)

  • ca: Number of major vessels (0-3) colored by fluoroscopy

  • thal: 3 = normal, 6 = fixed defect, and 7 = reversible defect

    • That represents β\beta-Thalassemia, which is an inherited hemoglobin disorder resulting in chronic hemolytic anemia that typically requires lifelong transfusion therapy

  • target: The predicted attribute (0, 1, 2, 3, and 4). In the processed dataset, this one is added as a new column target with N for 0 and Y for 1, 2, 3, and 4.

Now that, let's read the required libraries and load the data file in a data frame df.

Get hands-on with 1300+ tech skills courses.