This device is not compatible.

Predict Cancer Using Machine Learning Models

PROJECT

Predict Cancer Using Machine Learning Models

In this project, we’ll perform multiclass classification to assign a given genetic mutation into one of nine categories. In the process, we’ll learn to perform graphical data analysis, data preprocessing, feature encoding, hyperparameter tuning, and evaluation of the models created using the data.

You will learn to:

Preprocess medical textual data for use with ML models.

Perform effective data analysis to check for class imbalances.

Evaluate the importance of individual features for prediction.

Interpret ambiguous evaluation metrics (e.g., the log-loss value).

Contrast one-hot encoding with response encoding for various models.

Understand the criteria for accepting an ML model in medical applications.

Skills

Machine Learning

Data Cleaning

Data Plotting

Data Science

Data Visualisation

Prerequisites

Familiarity with machine learning theory and fundamentals

Familiarity with programming in Python

Familiarity with Numpy, pandas, Seaborn, Matplotlib, Scikit-learn, and Natural Language Toolkit (NLTK) libraries

Technologies

NumPy

Python

Pandas

seaborn

Scikit-learn

Project Description

When a patient shows symptoms of cancer (often a conspicuous tumor in an internal organ), the tumor cell is taken out and genetically sequenced. There can be thousands of genetic mutations in a tumor. Now, if we skip the biological technicalities, each genetic mutation like this has a unique ID consisting of two fields: gene and variation.

Based on these two fields and some corresponding medical text data, we’ll classify genetic mutations into nine categories through multiclass classification. Some are malignant (drivers leading to tumor growth), and some are benign (passenger). The presence of any malignant mutation in the tumor cell puts the patient at significant risk of having cancer.

In this project, we’ll also perform data analysis by cleaning the textual data, checking for feature importance, comparing different machine learning (ML) models, and also different encodings.

Project Tasks

Getting Started

Task 0: Get Started

Task 1: Import Modules and Libraries

Loading Data

Task 2: Load and Explore the Genes and Variations Datasets

Task 3: Loading the Textual Genes Dataset

Text Pre-processing

Task 4: Define the Function for Preprocessing

Task 5: Preprocess the Data

Task 6: Merge Datasets, Clean, and Impute Values

Train-test Split

Task 7: Perform the Train-Test Split

Task 8: Check the Distributions of the Datasets

Measure Performance Using Random Model

Task 9: Define a Function to Plot Performance Matrices

Task 10: Measure Metrics from a Dummy Baseline Model

Encode the Features

Task 11: Define the Functions for Response Coding

Task 12: Run the Function on the Gene and Variation Features

Task 13: Count Words in the Text Field

Task 14: Define a Function for Response Coding

Task 15: Run the Function on the Text Field

Task 16: One-Hot Encode the Features

Task 17: Normalize the Text Feature

Check Feature Importances

Task 18: Train Single-Feature Models

Model Training

Task 19: Stack the Features

Task 20: Train a Logistic Regression Model

Task 21: Train a Random Forest Model

Congratulations!

Subscribe to project updates

Hear what others have to say

Join 1.4 million developers working at companies like

"Another great hands on project to apply your knowledge learned. Thank you Educative ❤️"

Atabek BEKENOV

Senior Software Engineer

"Super excited to learn E-commerce website for my own startup venture. Thanks for your great learning platform."

Pradip Pariyar

Senior Software Engineer

"This was an excellent lesson. I learned a lot working through the process. I enjoyed it so much that I rebuilt it my AWS account to see how hard it would be to deploy to a production environment."

Renzo Scriber

Senior Software Engineer

"It was my first proper data engineering project and it was amazing."

Vasiliki Nikolaidi

Senior Software Engineer

"It's a fantastic way to do hands-on practice; I enjoy this way of learning."

Juan Carlos Valerio Arrieta

Senior Software Engineer

Relevant Courses

Use the following content to review prerequisites or explore specific concepts in detail.