Types and Structure of PDF Files

Introduction

The PDF file format is simple by design. This encourages developers to create PDF documents using in-house solutions, without relying on external toolkits.

We will be manipulating PDF files later in the course, but knowing the different types of PDF files is a pre-requisite to discern which processing functions apply to each PDF type.

This lesson will explore the various types of PDF file formats and peek into their internals.

Types

PDF files are subdivided into three main types: image-only, searchable image, and formatted text and graphics.

  • Image-only PDF

An image-only PDF (also known as a Scanned PDF) includes a photographic image representing each page, and virtually no textual characters or vector graphics. Image-only PDF files are not editable and are generally created by scanning hard-copy documents.

It may be possible to use Optical Character Recognition (OCR) software to gather the content of an image-only PDF file. However, the extracted text usually contains recognition errors that require manual proofreading and adjustment to be accurate. Scanning documents into image-only PDF files has been a standard way of keeping information for archival purposes because electronic media is much smaller and less cumbersome than paper storage.

  • Searchable image

The searchable-image PDF (also known as OCRed PDF or made-searchable PDF) incorporates an image for each page, but this type also contains a text layer. The textual characters are produced from an OCR process, which analyzes each page or image for characters. Wherever characters are detected in the image, the software draws a layer of text under them. An observer of the page sees the surface image only, as with image-only PDF. The text layer allows a PDF file to be searched for pertinent phrases to a reader viewing the document. It also lets PDF files be indexed with keywords in a collection of electronic documents, lessening the searchability of a PDF file and improving its accessibility.

  • Formatted text and graphics

The formatted text and graphics type, also known as True PDF, digitally created PDF, text-based PDF, or Real PDF, reduces the use of photographic images. Textual characters and vector graphics are portrayed wherever they can represent the content of a page. Photographic images are used only when they are pictures that cannot be generated from building blocks of textual characters and vector graphics. In general, this type of PDF is the result of a conversion from another electronic file format, such as MS Word. This type is the most compact, often 10% of an image-only file with the same content. This type constitutes formatted text and graphics, and offers many features, like accessibility, because of the quality of the underlying text. It also offers more flexibility since it enables converting a PDF file into HTML for rendering as web pages, or transforming into an MS Word document, for editing as part of another document.

This subdivision depends on the originating source of the PDF, and indicates whether its content can be accessed, searched, copied, edited, or whether it is locked in an image of the page.

When it comes to purpose and functionality, and as we will find in the upcoming lessons, these three types may fall into two main categories: searchable and non-searchable PDFs. The following table highlights the differences between both categories:

Feature Searchable PDF Non-searchable PDF
Contains actual text Yes No
Contains graphics and images Yes Yes
Contains links Yes No
Text can be annotated Yes No, except with OCR
Text can be redacted Yes No, except with OCR
Content searchable Yes No, except with OCR
Can be converted back to MS Excel or MS Word Yes No

Structure

A PDF document is composed of 4 main parts:

  • The header

The first line of the PDF specifies the version of a PDF file format. These headers are the topmost portion of a document. It reveals the basic information of a PDF file, for example, “%PDF-1.4”, which means that this PDF format is the fourth version.

  • The body

It consists of objects that compose the content of the document. These objects include image data, fonts, annotations, text streams, and so on. Users can also integrate invisible objects or elements. These objects embed the interactive features in a document like animation or graphics. A user can also impose a logical structure on the document. The logical structure provides a mechanism for incorporating structural information about a document’s content into a PDF file, and allows the writer to choose what structural information to include and how to represent it.

Moreover, we can also make the content of a PDF document more secure by implementing security features. These security features include protecting the PDF file using a password to restrict the unauthorized viewing or editing of this file.

  • The cross-reference table (xref)

The cross-reference table (Xref) consists of links to all the objects or elements in a file. When you update a PDF file, the Xref table will automatically get updated. You can also trace the updated changes in the cross-reference table.

  • The trailer

The trailer contains links to the cross-reference table and always includes “%%EOF” to identify the end of a PDF file.

Get hands-on with 1400+ tech skills courses.