...

Introducing The chardet Module

We'll cover the following...

UTF-N with a BOM
Escaped encodings
Multi-byte encodings
Single-byte encodings
windows-1252

Before we set off porting the code, it would help if you understood how the code worked! This is a brief guide to navigating the code itself. The chardet library is too large to include inline here, but you can download it from chardet.feedparser.org.

Encoding detection is really language detection in drag.

The main entry point for the detection algorithm is universaldetector.py, which has one class, UniversalDetector. (You might think the main entry point is the detect function in chardet/__init__.py, but that’s really just a convenience function that creates a UniversalDetector object, calls it, and returns its result.)

There are 5 categories of encodings that UniversalDetector handles:

UTF-N with a Byte Order Mark (BOM). This includes UTF-8, both Big-Endian and Little-Endian variants of UTF-16, and all 4 byte-order variants of UTF-32.
Escaped encodings, which are entirely 7-bit ASCII compatible, where non-ASCII characters start with an escape sequence. Examples: ISO-2022-JP (Japanese) and HZ-GB-2312 (Chinese).
Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: BIG5 (Chinese), SHIFT_JIS (Japanese), EUC-KR (Korean), and UTF-8 without a BOM.
Single-byte encodings, where each character is represented by one byte. Examples: KOI8-R (Russian), WINDOWS-1255 (Hebrew), and TIS-620 (Thai).
WINDOWS-1252, which is used primarily on Microsoft Windows by middle managers who wouldn’t know a character encoding from a hole in the ground.

UTF-N with a BOM

If the text starts with a bom, we can reasonably assume ...

Your First Python Program

Native Datatypes

Comprehensions

Strings

Regular Expressions

Closures & Generators

Classes & Iterators

Advanced Iterators

Unit Testing

Refactoring

Files

XML

Serializing Python Objects

HTTP Web Services

Case Study: Porting chardet to Python 3

Packaging Python Libraries

Appendix : Where To Go From Here

Introducing The chardet Module

UTF-N with a BOM