Python Tool For Hashing Large Files

Kyle December 31, 2022 0 Comments Python

Ever want to build your own tool that is able to generate MD5, SHA1, SHA256, or SHA512 hashes of files of any size, or just copy someone elses? In this post we will build a command line application that can generate a hash of any file on your system even without consuming a bunch of memory on large files. This command line program will be designed in such a way that it can be dropped into a larger program and integrated without the need to modify the code.

If you would rather just have the code and don't want to be walked through the creation process, click here to be brought to the section with the full code

Requirements

Must be under 75 lines of code
Must be able to be integrated into other programs without modifying the "backend" code
Must be able to output the following hashes
- MD5
- SHA1
- SHA256
- SHA512
Must not consume a bunch of memory when reading in the file to compute the hash
Must be 1 file

Design

Because this needs to be usable in other programs, I think it is best to split out the logic into 2 parts, the reading and hashing components, and the command line application portions.

To handle the file read and hashing logic I think it is best to create an object that handles reading in the file and generating the hashes. Then other programs will then have a simple to use interface when reading and hashing a file.

To handle the CLI application I can make a simple script using argparse and put that code at the bottom of our file protected with if __name__ == '__main__' so that it only executes if the file is run directly and not executing when the file is imported. This allows all of the code be in one file.

Hasher Object

The first thing we need to do is create our hasher object which responsibilities are to read in the file and generate the hash. The good piece is the underlying code for generating hashes is already integrated in python with the hashlib library, so the only problem we need to really solve is to read the file in chunks and feed those chunks to the hashing library to generate a hash.

First we will make our imports and define a hasher class.

import hashlib


class Hasher:

    def __init__(self, file_path, buffer_size=65560):
        self.file_path = file_path
        self.buffer_size = buffer_size

I defined the hasher class to accept 2 arguments which are the path to the file to hash, and a buffer size in which to read the file. I set it at a default size of 65KB.

Next we need to add a method which reads our file in chunks

def _reader(self):
    """
    Reads a file in chunks as specified by the self.buffer_size
    :return:
    """
    with open(self.file_path, 'rb') as f:
        while True:
            data = f.read(self.buffer_size)
            if not data:
                break
            yield data

This method is a generator which will open the file, read in the number of bytes as specified in the buffer, yield it out of the generator and continue to do that until there has been no data read in meaning you have reached the end of a file.

Two key things to note is that we open the file to read in bytes format. And the second is that we don't return the data out of the method we yield it, which is what makes this method a generator. When running this method in a for loop all of the internal variables maintain state each time it loops over and you get a continuous read of the file. If this is not making sense now it will make more sense when you see our hasher method

Now we need a method that will hash the data that is received from _reader. So we don't have to rewrite the same function 4 times for each of the 4 hashing algorithms we are using, I will make this method generic so that it can hash any of the algorithms. In order to understand this code you need to first understand how hashlib creates the hashes, take a look at this example.

# Example md5 hashlib

hasher = hashlib.md5()
with open('file.txt', 'rb') as f:
    chunk1 = f.read(10)
    chunk2 = f.read()

hasher.update(chunk1)
hasher.update(chunk2)
print(hasher.hexdigest())

You can see that I read the file in 2 chunks, the first 10 bytes, the second the rest of the file. Then I update the hasher variable with the chunks and then print out the hexdigest. If I wanted to do a sha512 hash the only thing I would need to change would be the declaration of the hasher variable.

hasher = hashlib.sha512()

So with this knowledge I will write the hasher as a generic function which accepts the hashlib.md5 or hashlib.sha1, etc.

def _hasher(self, algo):
    """
    Uses the reader and reads the file in chunks adding it to the hashlib object
    :param algo: hashlib object ex. hashlib.md5() or hashlib.sha1()
    :return:
    """
    for chunk in self._reader():
        algo.update(chunk)
    return algo.hexdigest()

This method accepts an algo variable which needs to be hashlib.md5, hashlib.sha1, etc.

It then iterates through each chunk of data that is read in from the file and updates the hashlib object until it has completed, which then returns the hexdigest.

The last thing we need to do is to create an interface for the object to generate each kind of hashes. I decided to use the @property decorate to make them appear as attributes of the object.

@property
def md5(self):
    return self._hasher(hashlib.md5())

@property
def sha1(self):
    return self._hasher(hashlib.sha1())

@property
def sha256(self):
    return self._hasher(hashlib.sha256())

@property
def sha512(self):
    return self._hasher(hashlib.sha512())

At this point we now have our completed hasher object which reads in our file in chunks as to not consume memory unnecessarily and has 4 properties which are the objects interface that will return the hex digest of their respective algorithms.

The harder part is now complete, and we could realistically use this in a larger program right now, but to finish this out I will create the script for the command line application.

Command Line Application

Now to use this object in a command line application we will make a script which uses the argparse library to parse our command line arguments for us and create help documentation, and then use that to instantiate our hasher object and generate our hash.

First we need to protect our code from execution when this python file is imported and import the ArgumentParser library.

if __name__ == '__main__':

    from argparse import ArgumentParser

If you do not understand what if __name__ == '__main__': does I have an article that can bring you up to speed.

Next I will create a parser and add a required mutually exclusive group. In a mutually exclusive group the user must specify exactly 1 of the arguments, this is how we will allow the user to select the hashing algorithm

parser = ArgumentParser()
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument('-md5', action='store_true')
group.add_argument('-sha1', action='store_true')
group.add_argument('-sha256', action='store_true')
group.add_argument('-sha512', action='store_true')

Next I will add a positional argument to the parser that will be the file path

parser.add_argument('file_path', type=str)

And lastly I will add an optional argument to specify the buffer size but set that at our default 65KB value and parse the arguments.

parser.add_argument('-buffer_size', type=int, default=65560)
args = parser.parse_args()

Now I will instantiate our hasher object passing in the file path and buffer size from our parsed arguments.

hasher = Hasher(file_path=args.file_path, buffer_size=args.buffer_size)

And to complete this I will just make a chain of if elif statements checking what hashing algorithm was selected and printing out the hash for the specified algorithm.

if args.md5:
    print(f'MD5: {hasher.md5}')
elif args.sha1:
    print(f'SHA1: {hasher.sha1}')
elif args.sha256:
    print(f'SHA256: {hasher.sha256}')
elif args.sha512:
    print(f'SHA512: {hasher.sha512}')

It is important to know that because of the way we designed the hasher object, the file is read in and hashed whenever one of the lines in the if statement is executed.

Running the Program

Alright, now that our program is complete let's run it against a file using a few of the algorithms to see it work.

$ python hasher.py /path/to/myfile.large -md5
MD5: 6a51177cf0178ee79f71ca39907231f4

$ python hasher.py /path/to/myfile.large -sha1
SHA1: 45e92be2fd2aae1b2400747c68f0a7fc5e155fef

$ python hasher.py /path/to/myfile.large -sha256
SHA256: 99fc3b48e2f41a27cc49c4185f55183f42684065530cebc25f2ba131008a847b

$ python hasher.py /path/to/myfile.large -sha512
SHA512: 11aed85f8a0e9ca67f7d7a7ce217b2cb589f728afa65d1ea4756bbaae1c1766dba83289490c233bfdc8aaf1f9004bb5f268faaaa5a08d282b9dabfc19b796e19

$

And if we forget to specify an algorithm option it gives us some help information.

usage: hasher.py [-h] (-md5 | -sha1 | -sha256 | -sha512) [-buffer_size BUFFER_SIZE] file_path
hasher.py: error: one of the arguments -md5 -sha1 -sha256 -sha512 is required

Full Program

Below I have the entire program in one segment so you can copy and paste if you wish.

import hashlib


class Hasher:

    def __init__(self, file_path, buffer_size=65560):
        self.file_path = file_path
        self.buffer_size = buffer_size

    def _reader(self):
        """
        Reads a file in chunks as specified by the self.buffer_size
        :return:
        """
        with open(self.file_path, 'rb') as f:
            while True:
                data = f.read(self.buffer_size)
                if not data:
                    break
                yield data

    def _hasher(self, algo):
        """
        Uses the reader and reads the file in chunks adding it to the hashlib object
        :param algo: hashlib object ex. hashlib.md5() or hashlib.sha1()
        :return:
        """
        for chunk in self._reader():
            algo.update(chunk)
        return algo.hexdigest()

    @property
    def md5(self):
        return self._hasher(hashlib.md5())

    @property
    def sha1(self):
        return self._hasher(hashlib.sha1())

    @property
    def sha256(self):
        return self._hasher(hashlib.sha256())

    @property
    def sha512(self):
        return self._hasher(hashlib.sha512())


if __name__ == '__main__':

    from argparse import ArgumentParser

    parser = ArgumentParser()
    group = parser.add_mutually_exclusive_group(required=True)
    group.add_argument('-md5', action='store_true')
    group.add_argument('-sha1', action='store_true')
    group.add_argument('-sha256', action='store_true')
    group.add_argument('-sha512', action='store_true')
    parser.add_argument('file_path', type=str)
    parser.add_argument('-buffer_size', type=int, default=65560)
    args = parser.parse_args()

    hasher = Hasher(file_path=args.file_path, buffer_size=args.buffer_size)

    if args.md5:
        print(f'MD5: {hasher.md5}')
    elif args.sha1:
        print(f'SHA1: {hasher.sha1}')
    elif args.sha256:
        print(f'SHA256: {hasher.sha256}')
    elif args.sha512:
        print(f'SHA512: {hasher.sha512}')

Python Tool For Hashing Large Files

Requirements

Design

Hasher Object

Command Line Application

Running the Program

Full Program

Recent Posts

Categories

Tags

Python Tool For Hashing Large Files

Requirements

Design

Hasher Object

Command Line Application

Running the Program

Full Program

Recent Posts

Categories

Tags

Newsletter