Sanitizing user input is a critical step in development. Since our clients come from all over the world, we need to be cautious and ensure our operations are secure. It's the developer's responsibility to ensure that the program remains secure and free of any malicious input so that the service can function properly.
We must take the following two precautions to ensure that the input is valid and secure for our system:
Input validation: Ensuring that the input is well-formed and in the expected structure.
Input sanitization: Ensuring data is semantically and logically correct and safe to use in the system's workflow.
Note: The implementation of the above measures differ based on the technology and context of the service. For example, input sanitization for web applications may involve stripping HTML and JavaScript tags from user input, while IoT devices may need to sanitize input data from sensors or other sources.
Let's see how to sanitize user input using different techniques available in Python.
We can remove unnecessary or malformed data from our input using different techniques, some of which are listed in the table below:
Technique | Description |
Escape characters | Escape special characters from the input using |
Third-party libraries | Sanitizing inputs using third-party libraries and frameworks, such as |
Regular expressions | Only allowing expected data by blocklisting or allowlisting inputs, such as using the |
Let's take a simple example from each of the above categories and implement it in code for better understanding.
html.escape()
We use the html
module of Python to escape characters that have special meanings in HTML.
import htmldef sanitize_input(input_str):# Escaping special characters in HTMLsanitized_str = html.escape(input_str)return sanitized_struser_input = '<span>Hi!</span>'sanitized_input = sanitize_input(user_input)print(sanitized_input)
Lines 3–6: We define sanitize_input
, which returns sanitized data after processing it.
Line 5: We call the html.escape()
function to replace each character with a special meaning with its alternate escape value.
Line 8: We store input data from the user in the user_input
.
Line 9: We use the sanitize_input
method to process the input data.
Line 10: We display the sanitized output. The output only contains <span>Hi!</span>
instead of <span>Hi!</span>
, which means the special characters were replaced successfully.
bleach
We use Python's bleach
library to allow only
import bleach# List of allowed HTML tagsallowed_tags = ['span', 'b']def sanitize_input(input_str):# Allowing only allowlisted tags using bleach librarysanitized_str = bleach.clean(input_str, tags=allowed_tags)return sanitized_struser_input = '<span>Hi!</span> <style> span { color: #ff0000; } </style>'sanitized_input = sanitize_input(user_input)print(sanitized_input)
Line 1: We import the bleach
library to our code.
Line 4: We define the list of allowed tags in the user input.
Lines 6–9: We define sanitize_input
, which returns sanitized data after processing it.
Line 8: We call the bleach.clean()
function to only allow only allowlisted tags and replace the rest with their alternate escape value.
Line 11: We store input data from the user in the user_input
.
Line 12: We use the sanitize_input
method to process the input data.
Line 13: We display the sanitized output. The color of the output is not changed, which means only the allowlisted tags were interpreted successfully by the browser.
re
We use the re
module of Python to blocklist script tags using regular expressions from the user input.
import redef sanitize_input(input_str):# Regular expression to blocklist script tagssanitized_str = re.sub(r'<script\b[^>]*>(.*?)</script>', '', input_str, flags=re.IGNORECASE)return sanitized_struser_input = '<span>Hi!</span> <script>alert("Hello from script!");</script>'sanitized_input = sanitize_input(user_input)print(sanitized_input)
Line 1: We import the re
module in our code.
Lines 3–6: We define sanitize_input
, which returns sanitized data after processing it.
Line 5: We call the re.sub()
function to remove unnecessary data that matches the regular repression provided in r
. The flags=re.IGNORECASE
tells the re
module to ignore the case of characters when matching the regular expression.
Line 8: We store input data from the user in the user_input
.
Line 9: We use the sanitize_input
method to process the input data.
Line 10: We display the sanitized output. The output only contains <span>Hi!</span>
, which means the script tag was removed successfully.