Home/Blog/Cloud Computing/Hands-on with Amazon SageMaker: Your first ML model in AWS
Home/Blog/Cloud Computing/Hands-on with Amazon SageMaker: Your first ML model in AWS

Hands-on with Amazon SageMaker: Your first ML model in AWS

16 min read
Apr 10, 2025
content
Overview of machine learning
Supervised learning
Unsupervised learning
Reinforcement learning
Amazon SageMaker: The battleground for AI innovation
Key features of SageMaker
SageMaker Studio
Data Wrangler
Feature Store
Model monitoring
SageMaker pipelines and projects
Popular machine learning algorithms in Sagemaker
Prebuilt algorithms
Custom algorithms
AutoML capabilities with SageMaker Autopilot
Deploy your first model using SageMaker
Step 1: Prepare the data
Data sources and formats
Types of data you can use in SageMaker
Techniques for data cleaning and preprocessing
Amazon SageMaker features for data preparation
Step 2: Train the model
Step 2: Create an endpoint configuration
Step 3: Create an API endpoint to invoke the model
Create a Lambda function
Create an API Gateway
Create the POST method
SageMaker pipelines
Monitor a model using SageMaker Model Monitor
Conclusion
FAQs

Key takeaways:

  • SageMaker allows us to train machine learning models using Jupyter Notebooks.

  • It seamlessly integrates with other storage services, including S3, RDS, Redshift, etc.

  • It allows us to create SageMaker endpoints for real-time inference.

  • SageMaker pipelines allow us to automate the process of model training, evaluation, and deployment.

  • SageMaker model monitor allows us to monitor a model’s accuracy. We can generate alarms and trigger pipelines if the accuracy falls below a certain threshold.

Machine learning transforms industries by enabling systems to learn from data and make intelligent decisions. From predictive analytics to AI-powered automation, businesses are leveraging ML to drive efficiency and innovation. However, building, training, and deploying ML models requires large computing and storage resources. The costs can quickly pile up if the model training infrastructure is not managed properly.

Companies often struggle with setting up the environment for their machine learning workflows. The head of data science at NatWeat Group, one of the largest banks in the United Kingdom, pens down this problem:

“If you want to launch an environment for data science work, it could take 2–4 weeks. On AWS, we can spin up that environment within a few hours. At most, it takes 1 day.”

Greig Cowan
Head of data science for data innovation, NatWest Group

NatWest group has a user base of almost 19 million users. However, its legacy systems were slow due to inconsistencies.

Therefore, they decided to accelerate their time to business value using Machine learning. In April 2022, they launched an enterprise-wide centralized ML workflow using SageMaker and S3. Switching to AWS allowed their data science team to train up to 30 ML models in the first 4 months.

In this blog, we’ll unravel SageMaker’s power and learn how it can be used to build and deploy a simple machine learning workflow in minutes. But first, let’s quickly overview machine learning.

Overview of machine learning#

Machine learning is about teaching computers to learn from data and make predictions without being explicitly programmed. By identifying patterns and relationships, ML models automate decision-making, powering applications like recommendation systems, fraud detection, and AI assistants. Machine learning can be broadly categorized into supervised, unsupervised, and reinforcement learning.

Supervised learning#

In supervised learning, we train the models using labeled data. The inputs are mapped to known outputs. For example, spam detection, in which the model learns the patterns of spam and nonspam emails and makes a decision.

The target value can be discrete as well as continuous. For example, predicting house prices in an area would have a continuous target value. Similarly, spam detection would have a discrete output. The problems with discrete target values are called classification problems. There are two types of classification problems:

  • Binary classification: The model predicts one of two possible outcomes in binary classification. For example, an email spam filter classifies emails based on their content as spam or not spam.

  • Multiclass classification: Multiclass classification is a supervised learning task where a model predicts one of three or more possible categories. For example, a handwriting recognition system classifies images of digits into 0-9 based on their features.

Supervised learning, though effective, has one simple drawback. It requires labeled data, which might not be available at all times. To directly address this problem, we have unsupervised learning.

Unsupervised learning#

Unsupervised learning deals with unlabeled data, discovering hidden patterns and structures. A straightforward example of unsupervised learning is customer segmentation. We can typically not assign labels to different customer segments. However, analysis of customers’ buying patterns suggests that we can discretely classify them into segments.

Unsupervised machine learning algorithms analyze data to identify patterns ranging from obvious insights to groundbreaking discoveries.

Reinforcement learning#

Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions, and over time, it optimizes its strategy to maximize cumulative rewards.

Amazon SageMaker: The battleground for AI innovation#

SageMaker is designed to streamline the ML workflow by offering everything from data preprocessing to model training, tuning, and deployment—all in a single platform. Simply put, automated ML (AutoML), built-in Jupyter notebooks, and scalable computing make machine learning much faster and more accessible.

What makes SageMaker the ideal solution for machine learning engineers’ workloads is there is no need to worry about servers—SageMaker handles infrastructure management, scaling, and security so you can focus on ML. It also provides optimized built-in algorithms, reducing the time and effort needed to get models up and running.

Additionally, SageMaker seamlessly integrates with AWS S3, Lambda, Glue, Redshift, and more to build a complete ML pipeline. SageMaker integrates with AWS Step Functions for automation, CloudWatch for monitoring, and IAM for security, making it a powerhouse for AI development.

Deploying a Machine Learning Model with Amazon SageMaker

Deploying a Machine Learning Model with Amazon SageMaker

In this Cloud Lab, you’ll learn how to deploy a machine learning model with Amazon SageMaker, provide access to it with a Lambda function, and trigger the Lambda function with API Gateway.

In this Cloud Lab, you’ll learn how to deploy a machine learning model with Amazon SageMaker, provide access to it with a Lambda function, and trigger the Lambda function with API Gateway.

Key features of SageMaker #

Let’s quickly discuss the key features of Amazon SageMaker.

SageMaker Studio#

Developers often struggle to set up and maintain Jupyter notebooks for machine learning projects. Local environments can become messy, lack version control, and require manual dependency management. Additionally, scaling notebooks to leverage powerful cloud-based GPUs/TPUs is complex and time-consuming.

SageMaker Studio provides a fully integrated development environment (IDE) for ML workflows. It provides fully managed Jupyter Notebooks that automatically handle dependencies and configurations. It also allows easy scaling of compute resources (CPUs/GPUs) without restarting the notebook. Furthermore, it offers collaborative notebooks, enabling multiple developers to work on the same project seamlessly.

Data Wrangler #

Preparing and cleaning large datasets for machine learning can be tedious, requiring multiple tools like pandas, SQL, or PySpark. Developers often face challenges in handling missing values, transforming data, and visualizing distributions. Writing custom scripts for these tasks increases development time and complexity.

Data Warnagler provides a no-code/low-code interface for data transformation, reducing manual scripting. It automatically detects data quality issues like missing values and outliers and supports automated feature engineering and built-in visualizations to speed up EDA.

Feature Store#

Machine learning teams often recompute the same features across different models and projects, leading to inconsistencies, inefficiencies, and versioning issues. Manually managing features across training and inference pipelines increases the risk of data leakage and inconsistencies.

To manage features effectively, SageMaker offers Feature Store, a centralized repository to store, update, and share ML features for consistent training and inference.

Model monitoring#

Machine learning models often experience data driftData drift occurs when the statistical properties of the input data change over time, making the model's assumptions about the data no longer valid., concept driftConcept drift happens when the relationship between input features and the target variable changes over time, meaning the underlying rules that the model learned no longer hold., or performance degradation over time due to changing data distributions. Without continuous monitoring, businesses risk deploying models that make inaccurate predictionsMonitor deployed models for data drift, bias, and prediction quality.

Model monitor automates drift detection by continuously tracking input data, feature distributions, and model predictions. It supports custom monitoring metrics and integrates with CloudWatch for real-time notifications.

SageMaker pipelines and projects#

Machine learning pipelines involve multiple steps, such as data preprocessing, feature engineering, model training, evaluation, and deployment. Manually orchestrating these steps can be time-consuming and difficult to scale.

SageMaker pipelines orchestrate these workflows, which can be triggered using events. Additionally, SageMaker projects allow us to integrate Sagemaker pipelines with CodeCommit, CodeBuild, and CodePipeline and build complete CI/CD pipelines for ML workflows.

These are just a few features to mention; the list is non-exhaustive.

Amazon SageMaker provides a wide range of built-in and customizable machine learning algorithms, enabling developers to efficiently train and deploy models at scale.

Prebuilt algorithms#

Amazon SageMaker provides various built-in machine learning algorithms optimized for scalability and performance. These include:

Algorithm Name

Description

Linear Learner

Efficient for binary and multiclass classification tasks.

XGBoost

A powerful gradient-boosting algorithm for regression and classification.

K-Means

A clustering algorithm for segmenting unlabeled data.

Factorization Machines

Ideal for sparse data, commonly used in recommendation systems.

DeepAR

A forecasting algorithm using deep learning for time-series predictions.

Image Classification

A deep learning-based model for categorizing images.

Object Detection

Detects and localizes multiple objects within an image.

BlazingText

An efficient implementation of Word2Vec for natural language processing.

Seq2Seq

Used for sequence-to-sequence tasks like machine translation and text summarization.

Random Cut Forest (RCF)

Detects anomalies in time series and streaming data.

Custom algorithms#

For more flexibility, SageMaker allows the use of custom machine learning models with popular frameworks such as:

Framework

Description

TensorFlow

Deep learning framework widely used for neural networks.

PyTorch

A flexible deep learning framework is preferred for research and production.

MXNet

An efficient deep learning library optimized for scalability.

Scikit-learn

A popular library for traditional machine learning models.

Keras

A high-level API for building deep learning models.

AutoML capabilities with SageMaker Autopilot#

Machine learning engineers select the machine learning model for their use case by comparing its performance on various algorithms. However, this process is computationally expensive and time-consuming.

Amazon SageMaker Autopilot simplifies this process. It is an AutoML (Automated machine learning) feature that automatically explores different machine learning models, selects the best one, and optimizes it for deployment. It allows users to build high-performing models with minimal manual effort while maintaining full visibility into the training process.

SageMaker Autopilot automates the machine learning process, allowing users to:

  • Automatically preprocess and transform raw data.

  • Select the best model and hyperparameters.

  • Deploy models with minimal manual intervention.

But SageMaker is not just limited to these features. The SageMaker management console’s sidebar menu lists several options. It can be difficult to navigate the management console even if you want to train a simple classification model for a university assignment.

Deploy your first model using SageMaker#

Suppose we want to train a model to predict weather given the temperature, humidity, wind speed, etc. The dataset you chose for this training model contains weather-related information recorded at different timestamps. It includes:

  • Formatted date: The timestamp of the record

  • Summary: A brief description of the weather conditions

  • Precip type: Type of precipitation (e.g., rain, snow)

  • Temperature (C) and apparent temperature (C): Actual and perceived temperatures in Celsius

  • Humidity: Moisture level in the air (0 to 1 scale)

  • Wind speed (km/h) and wind bearing (degrees): Wind speed and direction

  • Visibility (km): How far one can see in kilometers

  • Cloud cover: Fraction of the sky covered by clouds (0-1 scale)

  • Pressure (millibars): Atmospheric pressure

  • Daily summary: A summary of the day’s weather conditions

Each row represents an hourly weather observation. However, this is data in the raw format. To train a model using this data, we must first prepare it for a machine learning algorithm.

Step 1: Prepare the data#

Quality data is the foundation of any successful machine learning model. Before training, data must be collected, stored, cleaned, and preprocessed to ensure accuracy and efficiency.

Data sources and formats#

Amazon SageMaker supports multiple data sources, including Amazon S3, Amazon RDS, Amazon Redshift, and on-premises databases. Data can be stored in various formats such as CSV, JSON, Parquet, and RecordIO-Protobuf.

For model training, let’s upload our dataset to an S3 bucket. Download the dataset to your local storage and upload it to the S3 bucket.

Types of data you can use in SageMaker#

  1. Structured data: Tabular data from databases and spreadsheets

  2. Unstructured data: Text, images, and audio files used in NLP and computer vision models

  3. Time-series data: Sequential data used in forecasting applications

  4. Streaming data: Real-time data from IoT devices and logs

Techniques for data cleaning and preprocessing#

Data cleaning and preprocessing involve transforming raw data into a structured, high-quality format suitable for machine learning. Here are some essential techniques:

Technqiue

Description

Handling Missing Data

We drop rows or columns with too many missing values. Another technique is to fill missing values with mean, median, mode, or predictive modeling.

Handling Outliers

We use Z-score or IQR (Interquartile Range) to detect and remove outliers.

Encoding Categorical Variables

We use one-hot encoding to convert categorical variables into binary values, and label encoding assigns numbers to categories.

Data Normalization and Standardization

Normalization includes scaling the values between 0 and 1. Meanwhile, standardization transforms data such that the mean is 0 and the standard deviation is 1.

Tokenization and Stopword Removal

It includes splitting text into words (tokens) and removing unnecessary words (stopwords).

Data scientists use languages such as R and Python for data preprocessing. SageMaker takes this a step ahead and offers features to simplify this process.

Amazon SageMaker features for data preparation#

Amazon SageMaker provides several tools and capabilities to streamline data preparation before training machine learning models.

Service Name

Description

SageMaker Data Wrangler

It automates data selection, cleaning, transformation, and visualization and offers over 300+ built-in transformations (e.g., handling missing values, normalization, and encoding).

SageMaker Processing

It runs large-scale data transformations using Spark, scikit-learn, TensorFlow, and PyTorch. It also supports distributed data processing by integrating Amazon EMR and AWS Glue.

SageMaker Feature Store

It provides a centralized repository for storing, managing, and sharing features across ML models.

SageMaker Ground Truth

It assists in data labeling for supervised learning using human and machine-assisted labeling. It supports image, text, and video annotation.

These features help streamline the entire ML workflow, from raw data ingestion to preprocessing and feature engineering, making model training more efficient.

Step 2: Train the model#

We need a Jupyter Notebook to preprocess the data using Python libraries. Follow the steps below to open the Jupyter Notebook on SageMaker:

  • Search for “SageMaker” in the AWS Management Console and navigate to the SageMaker dashboard.

  • Under “Applications and IDEs,” click “Notebooks” and select “Create notebook instance.”

  • Name the notebook instance based on your use case.

  • Choose the instance type for compute resources, considering performance and cost.

  • Select “Amazon Linux 2, Jupyter Lab 3” as the platform identifier.

  • Assign an IAM role for permissions to access other services (e.g., S3).

You have successfully created a Jupyter Notebook instance using SageMaker. Now select the notebook and click the “Actions” button. Select “Open Jupyter” from the actions drop-down menu. This will open the Jupyter Notebook console.

Open Jupyter Notebook
Open Jupyter Notebook

You can use this Jupyter Notebook like a regular notebook. Paste your data training and model preprocessing code in cells and execute them. You can also import your existing .ipynb notebooks into the environment.

Step 2: Create an endpoint configuration#

Once you create a SageMaker model, you must set up an endpoint configuration. This configuration defines the instance type and count needed to host your endpoint. You can create it using the Amazon SageMaker console or API, ensuring optimal deployment for your model.

You can create an endpoint configuration using the single line of code below. Amazon SageMaker will host the endpoint on the single ml.t2.medium instance.

predictor = linear.deploy(
initial_instance_count=1,
instance_type='ml.t2.medium'
)

This can also be done via the Amazon SageMaker console or API.

Step 3: Create an API endpoint to invoke the model#

We can create an API endpoint to allow users to invoke the model for inference. The architecture diagram shows the infrastructure required on AWS to create an API endpoint using SageMaker endpoint, Lambda functions, and API Gateway.

Architecture diagram
Architecture diagram

Create a Lambda function#

A Lambda function directly integrates with the API gateway, which unpacks the request and invokes the SageMaker endpoint. Create a Lambda function and copy the following Python code into it.

import json
import boto3
runtime_client = boto3.client("sagemaker-runtime")
def lambda_handler(event, context):
response = runtime_client.invoke_endpoint(
EndpointName="<ENDPOINT-NAME>",
ContentType="application/json",
Body=json.dumps(event["body"])
)
result = json.loads(response["Body"].read().decode())
return {
"statusCode": 200,
"body": json.dumps(result)
}
}

This AWS Lambda function invokes a deployed SageMaker endpoint using boto3, passing input data as a JSON request and returning the model’s prediction as a response. It integrates with API Gateway to enable real-time inference from the SageMaker model.

Create an API Gateway#

Follow the steps given below to create an API Gateway:

  • Search for “API” in the AWS Management Console and select “API Gateway.”

  • Click “Build” in the REST API section and name the API.

  • Select the endpoint type and click the “Create API” button.

Create the POST method#

Follow the steps given below to create a POST method:

  • On the “Resources” page, click “Create resource” and name the resource.

  • Add a method by selecting “Create method” and “POST” as the method type.

  • Set “Lambda Function” as the integration type and select the ARN of your Lambda function.

  • Click the “Create method” button to finalize.

Methods page
Methods page

Now, you can invoke your trained model for real-time inference using the invoke URL of the API Gateway.

SageMaker pipelines#

The most crucial aspect of any machine learning model is that it should remain current. We produce tons of data every second, and a model is outdated when the number of epochs is completed. Therefore, it is difficult to maintain the accuracy of the model. To solve this issue, SageMaker offers SageMaker pipelines.

Amazon SageMaker Pipelines is a fully managed CI/CD service for building, automating, and managing machine learning (ML) workflows. It helps streamline data preprocessing, model training, evaluation, and deployment while ensuring scalability and reproducibility.

To ensure that our model is updated whenever we update the dataset in the S3 bucket or push new code to our repository, we can create SageMaker pipelines. These pipelines orchestrate the data preprocessing, training, evaluation, and deployment of models and can be triggered when required. SageMaker pipelines save data scientists and machine learning engineers the hassle of running all the notebook cells to retrain a model.

Monitor a model using SageMaker Model Monitor#

We can trigger the SageMaker pipelines to retrain a machine learning model. However, we must monitor its efficiency to determine the best time to retrain it.

Amazon SageMaker Model Monitor helps track, detect, and mitigate data quality and model drift issues in deployed machine learning models. It monitors predictions in real time and alerts when deviations occur, ensuring model accuracy and compliance over time.

Key features of the SageMaker model monitor include:

  • Detect data drift: Monitors changes in input data distribution over time.

  • Ensure model performance: Identifies discrepancies between training and inference data.

  • Automated alerts: Sends notifications via Amazon CloudWatch when anomalies occur.

  • Regulatory compliance: Maintains audit logs for governance and explainability.

You can use SageMaker model monitor for your model endpoint through the following simple steps:

  • Baseline dataset: Collect training or inference data samples to establish a baseline.

  • Baseline statistics and constraints: Generate statistics using SageMaker Processing and store constraints in S3.

  • Create a monitoring schedule: Set up a scheduled job to compare real-time data with the baseline.

  • Analyze reports and set alerts: Monitor detailed logs in CloudWatch for drift detection.

  • Take corrective action: Retrain or fine-tune models when deviations are detected.

Thus, we can use the model monitor to continuously or at regular intervals monitor our model’s accuracy and trigger the model training pipeline as soon as it falls below a certain threshold.

Conclusion#

Amazon SageMaker Notebooks simplify the entire machine learning life cycle, from data preparation to model deployment. By leveraging its fully managed infrastructure, built-in algorithms, and seamless integration with AWS services, you can efficiently train, evaluate, and deploy ML models without worrying about the underlying infrastructure.

FAQs#

Frequently Asked Questions

Is SageMaker the same as Jupyter?

SageMaker is not exactly similar to Jupyter; however, it enables us to create Jupyter Notebooks. We can also work in Jupyter Labs using SageMaker Studio. However, this is just one aspect of SageMaker, and it offers many other services to simplify machine learning processes.

Does SageMaker use EC2?

Does SageMaker use S3?

What makes Bedrock different from SageMaker?

Can I use SageMaker for free?


Written By:
Zainab Mohsin
Join 2.5 million developers at
Explore the catalog

Free Resources