Computing the Checksum of a PDF File
Learn to compute the checksum of a PDF file using various hashing algorithms.
Introduction
Tons of data are being sent over the Internet or other local networks. This data is susceptible to data loss due to network issues or even malicious attacks.
Typically, a checksum is used to ensure that the data received is unharmed and free of errors and losses.
Checksum stands for what
A checksum is the result of running an algorithm, called a cryptographic hash function, on a block of information, standardly a single file. Matching the checksum generated using a specific version of a file, with the one furnished by the original source of the file, confirms that the designated file version is genuine and untampered.
A checksum may have different names. It is commonly called a hash sum, while less common names include hash value, hash code, or simply a hash.
A checksum value by itself is intrinsically a string of letter characters and numbers that act as a sort of fingerprint for a string, a file, or a set of files.
Common reasons for inconsistent checksums
Multiple factors might lead to mismatches in checksum values such as:
-
Malicious tampering
The difference between the stored value and the computed value of the checksum of a certain file denotes that one of them was modified, but this does not ascertain which of them is legitimate. Identical files will have the same checksum. Changing anything other than the file name will result in a different checksum. In fact, the checksum technique helps detect integrity violations.
-
Data corruption
Data corruption doesn’t automatically mean that the data is malicious. Files might be inadvertently altered or modified during a file transfer if they include an unexpected type of encoding.
-
Incompatibility in hashing algorithms
Any dissimilarity in hashing algorithms leads to different checksums.
The checksum algorithms
At the core of a checksum, there is a software algorithm used to create the checksum value. The main purpose of an algorithm is to assign a numerical value to a file. This value is based on the content and the size of a file.
There is a wide array of checksum algorithms, which include:
Algorithm | Checksum Size |
---|---|
MD5 | 32 |
SHA1 | 40 |
SHA256 | 256 |
SHA384 | 96 |
SHA512 | 128 |
These algorithms listed per order of their security strengths, from lower to higher, differ mainly based on the security strengths they provide for the data being hashed. The security strength refers to the amount of work, that is, the number of operations, that are required to break a cryptographic algorithm.
Scope
This lesson aims to show us how to compute a checksum value for a PDF file using a lightweight command-line-based utility developed in the Python programming language.
Pre-requisites
As our requirements stand, the following component comes into play:
Hashlib is an interface for hashing messages easily. This built-in module will allow the usage of different hash algorithms in the Python Programming Language.
Let’s head into coding
Within the code snippet below, we will perform the following steps:
- Import the libraries (Line 1).
- Set a read buffer size (Line 4).
- Define a function called
verify_algorithm
tto check that the algorithm specified by the user is part of the list of algorithms defined in the module. If not, then this list is displayed on the console (Lines 10-12).
Get hands-on with 1400+ tech skills courses.