Types and Structure of PDF Files
Explore the various types and structure of a PDF.
Introduction
The PDF file format is simple by design. This encourages developers to create PDF documents using in-house solutions, without relying on external toolkits.
We will be manipulating PDF files later in the course, but knowing the different types of PDF files is a pre-requisite to discern which processing functions apply to each PDF type.
This lesson will explore the various types of PDF file formats and peek into their internals.
Types
PDF files are subdivided into three main types: image-only, searchable image, and formatted text and graphics.
-
Image-only PDF
An image-only PDF (also known as a Scanned PDF) includes a photographic image representing each page, and virtually no textual characters or vector graphics. Image-only PDF files are not editable and are generally created by scanning hard-copy documents.
It may be possible to use Optical Character Recognition (OCR) software to gather the content of an image-only PDF file. However, the extracted text usually contains recognition errors that require manual proofreading and adjustment to be accurate. Scanning documents into image-only PDF files has been a standard way of keeping information for archival purposes because electronic media is much smaller and less cumbersome than paper storage.
-
Searchable image