How to sanitize user input in Python

Sanitizing user input is a critical step in development. Since our clients come from all over the world, we need to be cautious and ensure our operations are secure. It's the developer's responsibility to ensure that the program remains secure and free of any malicious input so that the service can function properly.

We must take the following two precautions to ensure that the input is valid and secure for our system:

  • Input validation: Ensuring that the input is well-formed and in the expected structure.

  • Input sanitization: Ensuring data is semantically and logically correct and safe to use in the system's workflow.

Note: The implementation of the above measures differ based on the technology and context of the service. For example, input sanitization for web applications may involve stripping HTML and JavaScript tags from user input, while IoT devices may need to sanitize input data from sensors or other sources.

Let's see how to sanitize user input using different techniques available in Python.

Sanitizing user input in Python

We can remove unnecessary or malformed data from our input using different techniques, some of which are listed in the table below:

Technique

Description

Escape characters

Escape special characters from the input using html.escape(), isalnum(), etc., methods to prevent accidental code execution.

Third-party libraries

Sanitizing inputs using third-party libraries and frameworks, such as bleach, validators, etc.

Regular expressions

Only allowing expected data by blocklisting or allowlisting inputs, such as using the re module of Python

Let's take a simple example from each of the above categories and implement it in code for better understanding.

Sanitizing using html.escape()

We use the html module of Python to escape characters that have special meanings in HTML.

import html
def sanitize_input(input_str):
# Escaping special characters in HTML
sanitized_str = html.escape(input_str)
return sanitized_str
user_input = '<span>Hi!</span>'
sanitized_input = sanitize_input(user_input)
print(sanitized_input)

Explanation

  • Lines 3–6: We define sanitize_input, which returns sanitized data after processing it.

  • Line 5: We call the html.escape() function to replace each character with a special meaning with its alternate escape value.

  • Line 8: We store input data from the user in the user_input.

  • Line 9: We use the sanitize_input method to process the input data.

  • Line 10: We display the sanitized output. The output only contains &lt;span&gt;Hi!&lt;/span&gt; instead of <span>Hi!</span>, which means the special characters were replaced successfully.

Sanitizing using bleach

We use Python's bleach library to allow only allowlisted HTML tagsAllowlisted HTML tags are a set of HTML tags that are permitted for usage in web applications. in the input.

import bleach
# List of allowed HTML tags
allowed_tags = ['span', 'b']
def sanitize_input(input_str):
# Allowing only allowlisted tags using bleach library
sanitized_str = bleach.clean(input_str, tags=allowed_tags)
return sanitized_str
user_input = '<span>Hi!</span> <style> span { color: #ff0000; } </style>'
sanitized_input = sanitize_input(user_input)
print(sanitized_input)

Explanation

  • Line 1: We import the bleach library to our code.

  • Line 4: We define the list of allowed tags in the user input.

  • Lines 6–9: We define sanitize_input, which returns sanitized data after processing it.

  • Line 8: We call the bleach.clean() function to only allow only allowlisted tags and replace the rest with their alternate escape value.

  • Line 11: We store input data from the user in the user_input.

  • Line 12: We use the sanitize_input method to process the input data.

  • Line 13: We display the sanitized output. The color of the output is not changed, which means only the allowlisted tags were interpreted successfully by the browser.

Sanitizing using re

We use the re module of Python to blocklist script tags using regular expressions from the user input.

import re
def sanitize_input(input_str):
# Regular expression to blocklist script tags
sanitized_str = re.sub(r'<script\b[^>]*>(.*?)</script>', '', input_str, flags=re.IGNORECASE)
return sanitized_str
user_input = '<span>Hi!</span> <script>alert("Hello from script!");</script>'
sanitized_input = sanitize_input(user_input)
print(sanitized_input)

Explanation

  • Line 1: We import the re module in our code.

  • Lines 3–6: We define sanitize_input, which returns sanitized data after processing it.

  • Line 5: We call the re.sub() function to remove unnecessary data that matches the regular repression provided in r. The flags=re.IGNORECASE tells the re module to ignore the case of characters when matching the regular expression.

  • Line 8: We store input data from the user in the user_input.

  • Line 9: We use the sanitize_input method to process the input data.

  • Line 10: We display the sanitized output. The output only contains <span>Hi!</span>, which means the script tag was removed successfully.

Copyright ©2024 Educative, Inc. All rights reserved