When storing raw HTML in databases or variables, we need to escape special characters that are not markup text but might be confused as such.
These characters include <, >, ", ', and &.
If not escaped, these characters may lead the browser to display a web page incorrectly. For example, the following text in HTML contains quotation marks around “Edpresso shots” that could confuse the end and opening of a new string.
I love reading "Edpresso shots".
HTML provides special entity names and entity numbers which are essentially escape sequences that replace these characters. Escape sequences in HTML always start with an ampersand and end with a semicolon.
Provided below is a table of special characters that HTML 4 suggests to escape and their respective entity names and entity numbers:
Character | Entity name | Entity number |
> | > | > |
< | < | < |
" | " | " |
& | & | & |
To escape these characters, we can use the html.escape()
method in Python to encode your HTML in ascii string. html.escape()
takes HTML script as an argument, as well as one optional argument quote
that is set to True
by default. To use html.escape()
, you need to import the html
module that comes with Python 3.2 and above. Here is how you would use this method in code:
import htmlmyHtml = """& < " ' >"""encodedHtml = html.escape(myHtml)print(encodedHtml)encodedHtml = html.escape(myHtml, quote=False)print(encodedHtml)
First, import the html
module. Pass your HTML script to the html.escape()
function and it will return you the encoded version of your HTML script. If you do not wish to escape the quotes, you can set the quote
flag to False
.