Home/Blog/Cloud Computing/Hands-on with Amazon SageMaker: Your first ML model in AWS

Hands-on with Amazon SageMaker: Your first ML model in AWS

16 min read

Apr 10, 2025

content

Overview of machine learning

Supervised learning

Unsupervised learning

Reinforcement learning

Amazon SageMaker: The battleground for AI innovation

Key features of SageMaker

SageMaker Studio

Data Wrangler

Feature Store

Model monitoring

SageMaker pipelines and projects

Popular machine learning algorithms in Sagemaker

Prebuilt algorithms

Custom algorithms

AutoML capabilities with SageMaker Autopilot

Deploy your first model using SageMaker

Step 1: Prepare the data

Data sources and formats

Types of data you can use in SageMaker

Techniques for data cleaning and preprocessing

Amazon SageMaker features for data preparation

Step 2: Train the model

Step 2: Create an endpoint configuration

Step 3: Create an API endpoint to invoke the model

Create a Lambda function

Create an API Gateway

Create the POST method

SageMaker pipelines

Monitor a model using SageMaker Model Monitor

Conclusion

FAQs

Key takeaways:

SageMaker allows us to train machine learning models using Jupyter Notebooks.
It seamlessly integrates with other storage services, including S3, RDS, Redshift, etc.
It allows us to create SageMaker endpoints for real-time inference.
SageMaker pipelines allow us to automate the process of model training, evaluation, and deployment.
SageMaker model monitor allows us to monitor a model’s accuracy. We can generate alarms and trigger pipelines if the accuracy falls below a certain threshold.

Machine learning transforms industries by enabling systems to learn from data and make intelligent decisions. From predictive analytics to AI-powered automation, businesses are leveraging ML to drive efficiency and innovation. However, building, training, and deploying ML models requires large computing and storage resources. The costs can quickly pile up if the model training infrastructure is not managed properly.

Companies often struggle with setting up the environment for their machine learning workflows. The head of data science at NatWeat Group, one of the largest banks in the United Kingdom, pens down this problem:

“If you want to launch an environment for data science work, it could take 2–4 weeks. On AWS, we can spin up that environment within a few hours. At most, it takes 1 day.”

Greig Cowan
Head of data science for data innovation, NatWest Group

NatWest group has a user base of almost 19 million users. However, its legacy systems were slow due to inconsistencies.

Therefore, they decided to accelerate their time to business value using Machine learning. In April 2022, they launched an enterprise-wide centralized ML workflow using SageMaker and S3. Switching to AWS allowed their data science team to train up to 30 ML models in the first 4 months.

In this blog, we’ll unravel SageMaker’s power and learn how it can be used to build and deploy a simple machine learning workflow in minutes. But first, let’s quickly overview machine learning.

Overview of machine learning#

Machine learning is about teaching computers to learn from data and make predictions without being explicitly programmed. By identifying patterns and relationships, ML models automate decision-making, powering applications like recommendation systems, fraud detection, and AI assistants. Machine learning can be broadly categorized into supervised, unsupervised, and reinforcement learning.

Supervised learning#

In supervised learning, we train the models using labeled data. The inputs are mapped to known outputs. For example, spam detection, in which the model learns the patterns of spam and nonspam emails and makes a decision.

The target value can be discrete as well as continuous. For example, predicting house prices in an area would have a continuous target value. Similarly, spam detection would have a discrete output. The problems with discrete target values are called classification problems. There are two types of classification problems:

Binary classification: The model predicts one of two possible outcomes in binary classification. For example, an email spam filter classifies emails based on their content as spam or not spam.
Multiclass classification: Multiclass classification is a supervised learning task where a model predicts one of three or more possible categories. For example, a handwriting recognition system classifies images of digits into 0-9 based on their features.

Supervised learning, though effective, has one simple drawback. It requires labeled data, which might not be available at all times. To directly address this problem, we have unsupervised learning.

Unsupervised learning#

Unsupervised learning deals with unlabeled data, discovering hidden patterns and structures. A straightforward example of unsupervised learning is customer segmentation. We can typically not assign labels to different customer segments. However, analysis of customers’ buying patterns suggests that we can discretely classify them into segments.

Unsupervised machine learning algorithms analyze data to identify patterns ranging from obvious insights to groundbreaking discoveries.

Reinforcement learning#

Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions, and over time, it optimizes its strategy to maximize cumulative rewards.

Amazon SageMaker: The battleground for AI innovation#

SageMaker is designed to streamline the ML workflow by offering everything from data preprocessing to model training, tuning, and deployment—all in a single platform. Simply put, automated ML (AutoML), built-in Jupyter notebooks, and scalable computing make machine learning much faster and more accessible.

What makes SageMaker the ideal solution for machine learning engineers’ workloads is there is no need to worry about servers—SageMaker handles infrastructure management, scaling, and security so you can focus on ML. It also provides optimized built-in algorithms, reducing the time and effort needed to get models up and running.

Additionally, SageMaker seamlessly integrates with AWS S3, Lambda, Glue, Redshift, and more to build a complete ML pipeline. SageMaker integrates with AWS Step Functions for automation, CloudWatch for monitoring, and IAM for security, making it a powerhouse for AI development.

Deploying a Machine Learning Model with Amazon SageMaker

In this Cloud Lab, you’ll learn how to deploy a machine learning model with Amazon SageMaker, provide access to it with a Lambda function, and trigger the Lambda function with API Gateway.

Key features of SageMaker #

Let’s quickly discuss the key features of Amazon SageMaker.

SageMaker Studio#

Developers often struggle to set up and maintain Jupyter notebooks for machine learning projects. Local environments can become messy, lack version control, and require manual dependency management. Additionally, scaling notebooks to leverage powerful cloud-based GPUs/TPUs is complex and time-consuming.

SageMaker Studio provides a fully integrated development environment (IDE) for ML workflows. It provides fully managed Jupyter Notebooks that automatically handle dependencies and configurations. It also allows easy scaling of compute resources (CPUs/GPUs) without restarting the notebook. Furthermore, it offers collaborative notebooks, enabling multiple developers to work on the same project seamlessly.

Data Wrangler #

Preparing and cleaning large datasets for machine learning can be tedious, requiring multiple tools like pandas, SQL, or PySpark. Developers often face challenges in handling missing values, transforming data, and visualizing distributions. Writing custom scripts for these tasks increases development time and complexity.

Data Warnagler provides a no-code/low-code interface for data transformation, reducing manual scripting. It automatically detects data quality issues like missing values and outliers and supports automated feature engineering and built-in visualizations to speed up EDA.

Feature Store#

Machine learning teams often recompute the same features across different models and projects, leading to inconsistencies, inefficiencies, and versioning issues. Manually managing features across training and inference pipelines increases the risk of data leakage and inconsistencies.

To manage features effectively, SageMaker offers Feature Store, a centralized repository to store, update, and share ML features for consistent training and inference.

Model monitoring#

Machine learning models often experience data driftData drift occurs when the statistical properties of the input data change over time, making the model's assumptions about the data no longer valid., concept driftConcept drift happens when the relationship between input features and the target variable changes over time, meaning the underlying rules that the model learned no longer hold., or performance degradation over time due to changing data distributions. Without continuous monitoring, businesses risk deploying models that make inaccurate predictionsMonitor deployed models for data drift, bias, and prediction quality.

Model monitor automates drift detection by continuously tracking input data, feature distributions, and model predictions. It supports custom monitoring metrics and integrates with CloudWatch for real-time notifications.

SageMaker pipelines and projects#

Machine learning pipelines involve multiple steps, such as data preprocessing, feature engineering, model training, evaluation, and deployment. Manually orchestrating these steps can be time-consuming and difficult to scale.

SageMaker pipelines orchestrate these workflows, which can be triggered using events. Additionally, SageMaker projects allow us to integrate Sagemaker pipelines with CodeCommit, CodeBuild, and CodePipeline and build complete CI/CD pipelines for ML workflows.

Algorithm Name	Description
Linear Learner	Efficient for binary and multiclass classification tasks.
XGBoost	A powerful gradient-boosting algorithm for regression and classification.
K-Means	A clustering algorithm for segmenting unlabeled data.
Factorization Machines	Ideal for sparse data, commonly used in recommendation systems.
DeepAR	A forecasting algorithm using deep learning for time-series predictions.
Image Classification	A deep learning-based model for categorizing images.
Object Detection	Detects and localizes multiple objects within an image.
BlazingText	An efficient implementation of Word2Vec for natural language processing.
Seq2Seq	Used for sequence-to-sequence tasks like machine translation and text summarization.
Random Cut Forest (RCF)	Detects anomalies in time series and streaming data.

AutoML capabilities with SageMaker Autopilot#

Machine learning engineers select the machine learning model for their use case by comparing its performance on various algorithms. However, this process is computationally expensive and time-consuming.

Amazon SageMaker Autopilot simplifies this process. It is an AutoML (Automated machine learning) feature that automatically explores different machine learning models, selects the best one, and optimizes it for deployment. It allows users to build high-performing models with minimal manual effort while maintaining full visibility into the training process.

SageMaker Autopilot automates the machine learning process, allowing users to:

Automatically preprocess and transform raw data.
Select the best model and hyperparameters.
Deploy models with minimal manual intervention.

But SageMaker is not just limited to these features. The SageMaker management console’s sidebar menu lists several options. It can be difficult to navigate the management console even if you want to train a simple classification model for a university assignment.

Deploy your first model using SageMaker#

Suppose we want to train a model to predict weather given the temperature, humidity, wind speed, etc. The dataset you chose for this training model contains weather-related information recorded at different timestamps. It includes:

Formatted date: The timestamp of the record
Summary: A brief description of the weather conditions
Precip type: Type of precipitation (e.g., rain, snow)
Temperature (C) and apparent temperature (C): Actual and perceived temperatures in Celsius
Humidity: Moisture level in the air (0 to 1 scale)
Wind speed (km/h) and wind bearing (degrees): Wind speed and direction
Visibility (km): How far one can see in kilometers
Cloud cover: Fraction of the sky covered by clouds (0-1 scale)
Pressure (millibars): Atmospheric pressure
Daily summary: A summary of the day’s weather conditions

Each row represents an hourly weather observation. However, this is data in the raw format. To train a model using this data, we must first prepare it for a machine learning algorithm.

Step 1: Prepare the data#

Quality data is the foundation of any successful machine learning model. Before training, data must be collected, stored, cleaned, and preprocessed to ensure accuracy and efficiency.

Data sources and formats#

Amazon SageMaker supports multiple data sources, including Amazon S3, Amazon RDS, Amazon Redshift, and on-premises databases. Data can be stored in various formats such as CSV, JSON, Parquet, and RecordIO-Protobuf.

For model training, let’s upload our dataset to an S3 bucket. Download the dataset to your local storage and upload it to the S3 bucket.

Types of data you can use in SageMaker#

Structured data: Tabular data from databases and spreadsheets
Unstructured data: Text, images, and audio files used in NLP and computer vision models
Time-series data: Sequential data used in forecasting applications
Streaming data: Real-time data from IoT devices and logs

Techniques for data cleaning and preprocessing#

Data cleaning and preprocessing involve transforming raw data into a structured, high-quality format suitable for machine learning. Here are some essential techniques:

Technqiue	Description
Handling Missing Data	We drop rows or columns with too many missing values. Another technique is to fill missing values with mean, median, mode, or predictive modeling.
Handling Outliers	We use Z-score or IQR (Interquartile Range) to detect and remove outliers.
Encoding Categorical Variables	We use one-hot encoding to convert categorical variables into binary values, and label encoding assigns numbers to categories.
Data Normalization and Standardization	Normalization includes scaling the values between 0 and 1. Meanwhile, standardization transforms data such that the mean is 0 and the standard deviation is 1.
Tokenization and Stopword Removal	It includes splitting text into words (tokens) and removing unnecessary words (stopwords).

Service Name	Description
SageMaker Data Wrangler	It automates data selection, cleaning, transformation, and visualization and offers over 300+ built-in transformations (e.g., handling missing values, normalization, and encoding).
SageMaker Processing	It runs large-scale data transformations using Spark, scikit-learn, TensorFlow, and PyTorch. It also supports distributed data processing by integrating Amazon EMR and AWS Glue.
SageMaker Feature Store	It provides a centralized repository for storing, managing, and sharing features across ML models.
SageMaker Ground Truth	It assists in data labeling for supervised learning using human and machine-assisted labeling. It supports image, text, and video annotation.

These features help streamline the entire ML workflow, from raw data ingestion to preprocessing and feature engineering, making model training more efficient.

Step 2: Train the model#

We need a Jupyter Notebook to preprocess the data using Python libraries. Follow the steps below to open the Jupyter Notebook on SageMaker:

Search for “SageMaker” in the AWS Management Console and navigate to the SageMaker dashboard.
Under “Applications and IDEs,” click “Notebooks” and select “Create notebook instance.”
Name the notebook instance based on your use case.
Choose the instance type for compute resources, considering performance and cost.
Select “Amazon Linux 2, Jupyter Lab 3” as the platform identifier.
Assign an IAM role for permissions to access other services (e.g., S3).

You have successfully created a Jupyter Notebook instance using SageMaker. Now select the notebook and click the “Actions” button. Select “Open Jupyter” from the actions drop-down menu. This will open the Jupyter Notebook console.

You can use this Jupyter Notebook like a regular notebook. Paste your data training and model preprocessing code in cells and execute them. You can also import your existing .ipynb notebooks into the environment.

Step 2: Create an endpoint configuration#

Once you create a SageMaker model, you must set up an endpoint configuration. This configuration defines the instance type and count needed to host your endpoint. You can create it using the Amazon SageMaker console or API, ensuring optimal deployment for your model.

You can create an endpoint configuration using the single line of code below. Amazon SageMaker will host the endpoint on the single ml.t2.medium instance.

This AWS Lambda function invokes a deployed SageMaker endpoint using boto3, passing input data as a JSON request and returning the model’s prediction as a response. It integrates with API Gateway to enable real-time inference from the SageMaker model.

Create an API Gateway#

Follow the steps given below to create an API Gateway:

Search for “API” in the AWS Management Console and select “API Gateway.”
Click “Build” in the REST API section and name the API.
Select the endpoint type and click the “Create API” button.

Create the POST method#

Follow the steps given below to create a POST method:

On the “Resources” page, click “Create resource” and name the resource.
Add a method by selecting “Create method” and “POST” as the method type.
Set “Lambda Function” as the integration type and select the ARN of your Lambda function.
Click the “Create method” button to finalize.

Now, you can invoke your trained model for real-time inference using the invoke URL of the API Gateway.

SageMaker pipelines#

The most crucial aspect of any machine learning model is that it should remain current. We produce tons of data every second, and a model is outdated when the number of epochs is completed. Therefore, it is difficult to maintain the accuracy of the model. To solve this issue, SageMaker offers SageMaker pipelines.

Amazon SageMaker Pipelines is a fully managed CI/CD service for building, automating, and managing machine learning (ML) workflows. It helps streamline data preprocessing, model training, evaluation, and deployment while ensuring scalability and reproducibility.

To ensure that our model is updated whenever we update the dataset in the S3 bucket or push new code to our repository, we can create SageMaker pipelines. These pipelines orchestrate the data preprocessing, training, evaluation, and deployment of models and can be triggered when required. SageMaker pipelines save data scientists and machine learning engineers the hassle of running all the notebook cells to retrain a model.

Monitor a model using SageMaker Model Monitor#

We can trigger the SageMaker pipelines to retrain a machine learning model. However, we must monitor its efficiency to determine the best time to retrain it.

Amazon SageMaker Model Monitor helps track, detect, and mitigate data quality and model drift issues in deployed machine learning models. It monitors predictions in real time and alerts when deviations occur, ensuring model accuracy and compliance over time.

Key features of the SageMaker model monitor include:

Detect data drift: Monitors changes in input data distribution over time.
Ensure model performance: Identifies discrepancies between training and inference data.
Automated alerts: Sends notifications via Amazon CloudWatch when anomalies occur.
Regulatory compliance: Maintains audit logs for governance and explainability.

You can use SageMaker model monitor for your model endpoint through the following simple steps:

Baseline dataset: Collect training or inference data samples to establish a baseline.
Baseline statistics and constraints: Generate statistics using SageMaker Processing and store constraints in S3.
Create a monitoring schedule: Set up a scheduled job to compare real-time data with the baseline.
Analyze reports and set alerts: Monitor detailed logs in CloudWatch for drift detection.
Take corrective action: Retrain or fine-tune models when deviations are detected.

Thus, we can use the model monitor to continuously or at regular intervals monitor our model’s accuracy and trigger the model training pipeline as soon as it falls below a certain threshold.

Conclusion#

Amazon SageMaker Notebooks simplify the entire machine learning life cycle, from data preparation to model deployment. By leveraging its fully managed infrastructure, built-in algorithms, and seamless integration with AWS services, you can efficiently train, evaluate, and deploy ML models without worrying about the underlying infrastructure.

Frequently Asked Questions

Is SageMaker the same as Jupyter?

SageMaker is not exactly similar to Jupyter; however, it enables us to create Jupyter Notebooks. We can also work in Jupyter Labs using SageMaker Studio. However, this is just one aspect of SageMaker, and it offers many other services to simplify machine learning processes.

Does SageMaker use EC2?

Amazon SageMaker Studio Classic notebooks run on Amazon Elastic Compute Cloud (Amazon EC2) instances.

Does SageMaker use S3?

We can use S3 buckets with SageMaker as a data source to store training and testing data. We can also use S3 to store model artifacts after the model is trained using SageMaker.

What makes Bedrock different from SageMaker?

Bedrock offers simple deployment and is used by AI engineers looking to quickly incorporate AI into their workflows. Amazon SageMaker Jumpstart, though complex, offers more flexibility in deployment options.

Can I use SageMaker for free?

AWS Free Tier offers Amazon SageMaker free for 2 initial months.

Written By:

Zainab Mohsin

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

Framework	Description
TensorFlow	Deep learning framework widely used for neural networks.
PyTorch	A flexible deep learning framework is preferred for research and production.
MXNet	An efficient deep learning library optimized for scalability.
Scikit-learn	A popular library for traditional machine learning models.
Keras	A high-level API for building deep learning models.