This device is not compatible.

Predict Cancer Using Machine Learning Models

PROJECT


Predict Cancer Using Machine Learning Models

In this project, we’ll perform multiclass classification to assign a given genetic mutation into one of nine categories. In the process, we’ll learn to perform graphical data analysis, data preprocessing, feature encoding, hyperparameter tuning, and evaluation of the models created using the data.

Predict Cancer Using Machine Learning Models

You will learn to:

Preprocess medical textual data for use with ML models.

Perform effective data analysis to check for class imbalances.

Evaluate the importance of individual features for prediction.

Interpret ambiguous evaluation metrics (e.g., the log-loss value).

Contrast one-hot encoding with response encoding for various models.

Understand the criteria for accepting an ML model in medical applications.

Skills

Machine Learning

Data Cleaning

Data Plotting

Data Science

Data Visualisation

Prerequisites

Familiarity with machine learning theory and fundamentals

Familiarity with programming in Python

Familiarity with Numpy, pandas, Seaborn, Matplotlib, Scikit-learn, and Natural Language Toolkit (NLTK) libraries

Technologies

NumPy

Python

Pandas

seaborn

Scikit-learn

Project Description

When a patient shows symptoms of cancer (often a conspicuous tumor in an internal organ), the tumor cell is taken out and genetically sequenced. There can be thousands of genetic mutations in a tumor. Now, if we skip the biological technicalities, each genetic mutation like this has a unique ID consisting of two fields: gene and variation.

Based on these two fields and some corresponding medical text data, we’ll classify genetic mutations into nine categories through multiclass classification. Some are malignant (drivers leading to tumor growth), and some are benign (passenger). The presence of any malignant mutation in the tumor cell puts the patient at significant risk of having cancer.

In this project, we’ll also perform data analysis by cleaning the textual data, checking for feature importance, comparing different machine learning (ML) models, and also different encodings.

Project Tasks

1

Getting Started

Task 0: Get Started

Task 1: Import Modules and Libraries

2

Loading Data

Task 2: Load and Explore the Genes and Variations Datasets

Task 3: Loading the Textual Genes Dataset

3

Text Pre-processing

Task 4: Define the Function for Preprocessing

Task 5: Preprocess the Data

Task 6: Merge Datasets, Clean, and Impute Values

4

Train-test Split

Task 7: Perform the Train-Test Split

Task 8: Check the Distributions of the Datasets

5

Measure Performance Using Random Model

Task 9: Define a Function to Plot Performance Matrices

Task 10: Measure Metrics from a Dummy Baseline Model

6

Encode the Features

Task 11: Define the Functions for Response Coding

Task 12: Run the Function on the Gene and Variation Features

Task 13: Count Words in the Text Field

Task 14: Define a Function for Response Coding

Task 15: Run the Function on the Text Field

Task 16: One-Hot Encode the Features

Task 17: Normalize the Text Feature

7

Check Feature Importances

Task 18: Train Single-Feature Models

8

Model Training

Task 19: Stack the Features

Task 20: Train a Logistic Regression Model

Task 21: Train a Random Forest Model

Congratulations!

has successfully completed the Guided ProjectPredict Cancer Using Machine Learning Models

Relevant Courses

Use the following content to review prerequisites or explore specific concepts in detail.