Understanding PDF
Learn the strengths and the weaknesses of the PDF.
Nowadays, PDFs run the world, with billions of PDF files available on the web for public viewing.
The Portable Document Format (PDF) was inspired from the ideas of Dr. John Warnock, co-founder of Adobe Systems, who said, “Every document that was ever printed, or ever would be printed, may be represented in a document.” The PDF was created by a team named CAMELOT in Adobe Systems Incorporated, with the primary objective to present and exchange documents reliably in an electronic form, independent of the software, hardware or operating system used to view them.
In 2008, Adobe abdicated control of PDF development to the International Organization for Standardization (ISO). By this time, the PDF became the de-facto open standard of information interchange. The specifications for PDF version (2.0) are documented under ISO 32000-2. ISO is now in charge of updating and developing future versions of the PDF.
Emanating from the PostScript page description language, PDF is an imaging model which enables the description of text and graphics in a platform-agnostic and resolution-independent manner, at a comprehensive, precise, and professional level. As opposed to PostScript, which was a well-known programming language at that time, PDF is considered a structured binary file format that is enhanced for high performance in interactive viewing.
PDF documents may be structured or simple. They may contain text, images, and other multimedia content, such as video and sound. There is also support for annotations, metadata, hypertext links, and bookmarks. Newer versions provide additional functionalities, for example, embedding geospatial information within documents that represent maps or other geospatial images, such as satellite photographs.
PDF is exciting due to its visual fidelity, cross-platform portability, security, compact storage along with others. Learning to manipulate the various aspects of PDF using the Python programming language gives us versatility in terms of platform, flexibility, supportive community, and a rich selection of well-founded libraries for PDF processing. But before we delve into PDF manipulation underlying functionalities, we will shed light on the inherent strengths and weaknesses of this powerful file format.
Advantages
PDF is distinguished by several features such as:
-
Visual fidelity and graphic integrity
A PDF displays precisely similar content and layout irrespective of which operating system, device, or software application it is viewed on. In fact, after developing a document in PDF, you can be confident that the intended visual appearance is presented to the reader, including layout, fonts, colors, and pictures. This is true whether the output is displayed on the screen of your PC or mobile, or printed as a hard copy. Since a PDF file is internally divided into pages of output, each page will have a look and feel that the user wants to convey. This is one of the reasons why PDF is widely used for distributing publications in electronic form.
-
Cross platform portability
Adobe developed free software for viewing PDF files on several platforms or operating systems, including Microsoft Windows, Apple Macintosh, UNIX, and handheld personal digital assistants. The Adobe Reader program guarantees that a PDF file can be viewed with the same visual fidelity on almost any type of computer. This cross-platform portability makes PDF a common means for the dissemination of knowledge. Moreover, almost all the operating systems for mobile devices enable access to PDF files which contributed to the widespread use of the PDF file format.
-
Ability to render a variety of media
The PDF format allows you to combine numerous types of content, such as text, images and vector graphics, videos, animations, audio files, 3D models, interactive fields, hyperlinks, and buttons. All of these elements can be integrated within the same PDF file and organized as a report, a presentation, or a portfolio.
-
Ability to secure documents
The PDF offers the option to set up different levels of access to secure the content and the entire document, such as watermarks, passwords, or digital signatures.
-
Compact storage
Although PDFs can logically accommodate an unlimited amount of information, they can be compressed into an exchangeable file size while maintaining full control over the level of quality of the images they may contain.
Limitations
Despite the aforementioned advantages, the PDF has the following limitations:
-
Difficult to edit
The PDF was conceived as an exchangeable format for documents. The original target was to maintain and protect the content and layout of a document, irrespective of the platform or computer program it is viewed on. That’s why PDFs are difficult to edit, and why extracting information from them can be challenging.
-
Hard to manipulate
How we manipulate a PDF depends on what type of PDF it is. Different types of PDFs, such as Scanned PDFs, or Searchable PDFs, entail different ways of working with them, for example, when searching for or collecting information.
-
Security issues and vulnerabilities
Unfortunately, there are many security vulnerabilities in modern PDF readers. One of the most dangerous is that an attacker can use code execution vulnerabilities to execute arbitrary code on the target system. For additional details, refer to the attached graphs exhibiting Adobe Acrobat Reader DC vulnerabilities as depicted in “Acrobat Reader Vulnerability”;
The following graph exhibits the total number of discovered vulnerabilities across a range of years:
The following graph shows the total number of vulnerabilities reported since 2013 and subdivided by their respective types:
Let’s describe briefly the most common types of vulnerabilities:
- A denial of service (DOS) attack can occur in two ways:
- An infinite loop that consists of self-referencing objects and elements embedded in the PDF document. This may lead to higher CPU usage and could crash the PDF reader.
- A deflate bomb that expands the size of the PDF exponentially in memory when decompressing its inner streams before processing may cause the entire system to freeze.
-
The execute code vulnerability allows remote attackers to execute arbitrary code on our system via a malicious PDF document.
-
The memory corruption vulnerability allows remote attackers to cause a denial of service (an application crash) via a crafted PDF document that triggers memory corruption.
To sum up, PDFs are undoubtedly more popular now than they have ever been, and this popularity shows no signs of waning.
Governments, private industries, and large organizations are increasingly relying on PDF for reliably sharing, managing, and maintaining their electronic records.