This device is not compatible.
PROJECT
Predict Cancer Using Machine Learning Models
In this project, we’ll perform multiclass classification to assign a given genetic mutation into one of nine categories. In the process, we’ll learn to perform graphical data analysis, data preprocessing, feature encoding, hyperparameter tuning, and evaluation of the models created using the data.
You will learn to:
Preprocess medical textual data for use with ML models.
Perform effective data analysis to check for class imbalances.
Evaluate the importance of individual features for prediction.
Interpret ambiguous evaluation metrics (e.g., the log-loss value).
Contrast one-hot encoding with response encoding for various models.
Understand the criteria for accepting an ML model in medical applications.
Skills
Machine Learning
Data Cleaning
Data Plotting
Data Science
Data Visualisation
Prerequisites
Familiarity with machine learning theory and fundamentals
Familiarity with programming in Python
Familiarity with Numpy, pandas, Seaborn, Matplotlib, Scikit-learn, and Natural Language Toolkit (NLTK) libraries
Technologies
NumPy
Python
Pandas
seaborn
Scikit-learn
Project Description
When a patient shows symptoms of cancer (often a conspicuous tumor in an internal organ), the tumor cell is taken out and genetically sequenced. There can be thousands of genetic mutations in a tumor. Now, if we skip the biological technicalities, each genetic mutation like this has a unique ID consisting of two fields: gene and variation.
Based on these two fields and some corresponding medical text data, we’ll classify genetic mutations into nine categories through multiclass classification. Some are malignant (drivers leading to tumor growth), and some are benign (passenger). The presence of any malignant mutation in the tumor cell puts the patient at significant risk of having cancer.
In this project, we’ll also perform data analysis by cleaning the textual data, checking for feature importance, comparing different machine learning (ML) models, and also different encodings.
Project Tasks
1
Getting Started
Task 0: Get Started
Task 1: Import Modules and Libraries
2
Loading Data
Task 2: Load and Explore the Genes and Variations Datasets
Task 3: Loading the Textual Genes Dataset
3
Text Pre-processing
Task 4: Define the Function for Preprocessing
Task 5: Preprocess the Data
Task 6: Merge Datasets, Clean, and Impute Values
4
Train-test Split
Task 7: Perform the Train-Test Split
Task 8: Check the Distributions of the Datasets
5
Measure Performance Using Random Model
Task 9: Define a Function to Plot Performance Matrices
Task 10: Measure Metrics from a Dummy Baseline Model
6
Encode the Features
Task 11: Define the Functions for Response Coding
Task 12: Run the Function on the Gene and Variation Features
Task 13: Count Words in the Text Field
Task 14: Define a Function for Response Coding
Task 15: Run the Function on the Text Field
Task 16: One-Hot Encode the Features
Task 17: Normalize the Text Feature
7
Check Feature Importances
Task 18: Train Single-Feature Models
8
Model Training
Task 19: Stack the Features
Task 20: Train a Logistic Regression Model
Task 21: Train a Random Forest Model
Congratulations!
Relevant Courses
Use the following content to review prerequisites or explore specific concepts in detail.