Metadata Treatment
Learn how to gather, modify, and delete the various types of Metadata embedded within a PDF File.
Introduction
Metadata is typically populated by PDF conversion applications. It encloses relatively common fields showing the document version, creation date, and creation program, among others. Some overlooked attributes merit a closer look in case you want to dive into PDF analysis.
Scope
The objective of this lesson is to show how to extract, update, and delete the metadata of a PDF file using the Python programming language.
Prerequisites
We need two libraries for metadata manipulation:
PyPDF4
It is a pure-python PDF library best suited to split, merge, crop, and transform the pages of a PDF file. Additionally, it can retrieve text and metadata from PDFs.
Pikepdf
It is a library intended for developers to create, manipulate, and parse the PDF format. It supports reading and writing PDFs, including creating from scratch.
Library | Version |
---|---|
PyPDF4 | 1.27.0 |
Pikepdf | 3.0.0 |
The Pikepdf library allows PDF XMP metadata editing in contrast to the PyPDF4 library. Therefore, we will leverage its capabilities during this lesson.
Let’s start coding
By harnessing the capabilities of the PyPDF4 library, we will define the functions collect_did_metadata
, update_did_metadata
and collect_xmp_metadata
.
Next, we will rely on the PikePDF library to develop the functions modify_metadata
and delete_metadata
.
Afterward, we will utilize these functions in different scenarios to manipulate the metadata of sample PDF files.
Let’s see what that looks like in code:
Create a free account to view this lesson.
By signing up, you agree to Educative's Terms of Service and Privacy Policy