Search⌘ K

Parsing Broken XM

Explore how to parse XML documents that contain wellformedness errors using Python's lxml library. Understand XML's strict error rules, the challenges of broken XML like undefined entities, and use custom XMLParser recovery options to handle and log such errors effectively.

We'll cover the following...

The XML specification mandates that all conforming XML parsers employ “draconian error handling.” That is, they must halt and catch fire as soon as they detect any sort of wellformedness error in the XML document. Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters, and a number of other esoteric rules. This is in stark contrast to other common formats like HTML — your browser doesn’t stop rendering a web page if you forget to close an HTML tag or escape an ampersand in an attribute value. (It is a common misconception that html has no defined error handling. HTML error handling is actually quite well-defined, but ...