Today you're going to learn how to use in a way that can ultimately save a lot of space on your by removing all the duplicates. Python programming drive Intro In many situations you may find yourself having duplicates files on your disk and but when it comes to tracking and checking them manually it can . tedious Heres a solution Instead of tracking throughout your disk to see if there is a duplicate, you can automate the process using coding, by writing a program to recursively track through the disk and remove all the found duplicates and that's what this article is about. But How do we do it? If we were to read the whole file and then compare it to the rest of the files recursively through the given directory it will take a very long time, then how do we do it? The answer is hashing, with can generate a given string of letters and numbers which act as the identity of a given file and if we find any other file with the same identity we gonna delete it. hashing There's a variety of out there such as hashing algorithms md5 sha1 sha224, sha256, sha384 and sha512 Lets do some coding is pretty straight forward we are going to use hashlib library which comes by default with Python standard library Hashing in Python Below is an example of how we hash stuff using hashlib, we are going to hash of a string in Python using md5 hashing algorithms Example of Usage hashlib example_text = .encode( ) hashlib.md5(example_text).hexdigest() >>> import >>> "Duplython is amazing" 'utf-8' >>> '73a14f46eadcc04f4e04bec8eb66f2ab' It’s straight forward, you just need to import hashlib and then use md5 method to create hash and finally use hexdigest to generate string of the hash. The above example has shown us how to hash a string but as we look in relation to the project we are about to build we are more concerned with files rather than strings, another question arises; How do we hash files? Hashing files is similar to hashing string with just minor differences, during the hashing file we first need to open the file in binary and then generate a hash of the file binary value. Hashing File Let's say you have simple text document on your project directory with name learn.txt, This is how we will do it. hashlib file = open( , ).read() hashlib.md5(file).hexdigest() >>> import >>> 'learn.txt' 'rb' >>> '0534cf6d5816c4f1ace48fff75f616c9' As you can see above, even If you try to generate the hashes for a second time, The value of the generated hash doesn't change as long as it's the same file. The challenge arises when we try to read a very large file, It gonna take a while loading it therefore instead of waiting for the whole file to memory we can keep computing the hashes as we read the file. Computing hashes while reading the file requires us to read the file in blocks of a given size and keep updating the hashes as we keep reading the file until the complete hashing of the whole file. Doing it this way could save us a lot of waiting time that we could use on waiting for the whole file to be ready. Example of Usage hashlib block_size = hash = hashlib.md5() open( , ) file: block = file.read(block_size) len(block)> : hash.update(block) block = file.read(block_size) print(hash) cf6d5816c4f1ace48fff75f616c9 >>> import >>> 1024 >>> >>> with 'learn.txt' 'rb' as ... ... while 0 ... ... ... ... 0534 As you can see hash has not changed, it still the same. Therefore, we are ready to go to building our python tool to do the job. But wait for hashing is just one step we need a way to actually removes the duplicates, we gonna use built python module OS in deleting duplicates. We gonna use Python OS remove( ) method to remove the duplicates on our drive. Let’s try deleting learn.txt with os module Example of Usage (os module): os os.listdir() [ , , , , ] os.remove( ) os.listdir() [ , , , ] >>> import >>> 'Desktop-File-cleaner' '.git' 'learn.txt' 'app.py' 'README.md' >>> 'learn.txt' >>> 'Desktop-File-cleaner' '.git' 'app.py' 'README.md' Well, that’s simple you just call remove ( ) with a parameter of the of the file you wanna remove done. now let’s go build our application. name Building our cleaning Tool importing necessary libraries time os hashlib sha256 import import from import I'm a huge fan of and on this article, we gonna build our tool as a single class, below is just as for our code. Object-oriented programming exoskeleton class time os hashlib sha256 self.home_dir = os.getcwd(); self.File_hashes = [] self.Cleaned_dirs = []; self.Total_bytes_saved = self.block_size = ; self.count_cleaned = print( ) print( ) print( ) print( ) time.sleep( ) print( ) self.welcome() __name__ == : App = Duplython() App.main() import import from import : class Duplython : def __init__ (self) 0 65536 0 -> : def welcome (self) None '******************************************************************' '**************** DUPLYTHON ****************************' '********************************************************************\n\n' '---------------- WELCOME ----------------------------' 3 '\nCleaning .................' -> : def main (self) None if '__main__' That’s just initial cover for our Python Program of which when we run it will just print the welcoming method to the screen. $ python3 app.py ****************************************************************** **************** DUPLYTHON **************************** ******************************************************************** ---------------- WELCOME ---------------------------- Cleaning ................. We now have to create a simple function to of a file with a given path using the hashing knowledge we have learned above. generate hash time os hashlib sha256 self.home_dir = os.getcwd(); self.File_hashes = [] self.Cleaned_dirs = []; self.Total_bytes_saved = self.block_size = ; self.count_cleaned = print( ) print( ) print( ) print( ) time.sleep( ) print( ) Filehash = sha256() : open(Filename, ) File: fileblock = File.read(self.block_size) len(fileblock)> : Filehash.update(fileblock) fileblock = File.read(self.block_size) Filehash = Filehash.hexdigest() Filehash : self.welcome() __name__ == : App = Duplython() App.main() import import from import : class Duplython : def __init__ (self) 0 65536 0 -> : def welcome (self) None '******************************************************************' '**************** DUPLYTHON ****************************' '********************************************************************\n\n' '---------------- WELCOME ----------------------------' 3 '\nCleaning .................' ->str: def generate_hash (self, Filename:str) try with 'rb' as while 0 return except return False -> : def main (self) None if '__main__' Now Let's implement our program Logic Now after we made a function to generate hash per a given path of the file, let's implement where by we will be comparing those hashes and removing any found duplicate. I have made a simple function called clean( ) just to that as shown below. time os hashlib sha256 self.home_dir = os.getcwd(); self.File_hashes = [] self.Cleaned_dirs = []; self.Total_bytes_saved = self.block_size = ; self.count_cleaned = print( ) print( ) print( ) print( ) time.sleep( ) print( ) Filehash = sha256() : open(Filename, ) File: fileblock = File.read(self.block_size) len(fileblock)> : Filehash.update(fileblock) fileblock = File.read(self.block_size) Filehash = Filehash.hexdigest() Filehash : all_dirs = [path[ ] path os.walk( )] path all_dirs: os.chdir(path) All_Files =[file file os.listdir() os.path.isfile(file)] file All_Files: filehash = self.generate_hash(file) filehash self.File_hashes: filehash: self.File_hashes.append(filehash) : byte_saved = os.path.getsize(file); self.count_cleaned+= self.Total_bytes_saved+=byte_saved os.remove(file); filename = file.split( )[ ] print(filename, ) os.chdir(self.home_dir) self.welcome();self.clean() __name__ == : App = Duplython() App.main() import import from import : class Duplython : def __init__ (self) 0 65536 0 -> : def welcome (self) None '******************************************************************' '**************** DUPLYTHON ****************************' '********************************************************************\n\n' '---------------- WELCOME ----------------------------' 3 '\nCleaning .................' ->str: def generate_hash (self, Filename:str) try with 'rb' as while 0 return except return False -> : def clean (self) None 0 for in '.' for in for in if for in if not in if #print(file) else 1 '/' -1 '.. cleaned ' -> : def main (self) None if '__main__' Now that our program is nearly complete, we have to add a simple method to print the summary of the cleaning process. I have implemented the method just to do that. Print out the summary of the to the screen which completes our python tool as shown below: cleaning_summary () cleaning process time os shutil hashlib sha256 self.home_dir = os.getcwd(); self.File_hashes = [] self.Cleaned_dirs = []; self.Total_bytes_saved = self.block_size = ; self.count_cleaned = print( ) print( ) print( ) print( ) time.sleep( ) print( ) Filehash = sha256() : open(Filename, ) File: fileblock = File.read(self.block_size) len(fileblock)> : Filehash.update(fileblock) fileblock = File.read(self.block_size) Filehash = Filehash.hexdigest() Filehash : all_dirs = [path[ ] path os.walk( )] path all_dirs: os.chdir(path) All_Files =[file file os.listdir() os.path.isfile(file)] file All_Files: filehash = self.generate_hash(file) filehash self.File_hashes: filehash: self.File_hashes.append(filehash) : byte_saved = os.path.getsize(file); self.count_cleaned+= self.Total_bytes_saved+=byte_saved os.remove(file); filename = file.split( )[ ] print(filename, ) os.chdir(self.home_dir) mb_saved = self.Total_bytes_saved/ mb_saved = round(mb_saved, ) print( ) print( , self.count_cleaned) print( , mb_saved, ) print( ) self.welcome();self.clean();self.cleaning_summary() __name__ == : App = Duplython() App.main() import import import from import : class Duplython : def __init__ (self) 0 65536 0 -> : def welcome (self) None '******************************************************************' '**************** DUPLYTHON ****************************' '********************************************************************\n\n' '---------------- WELCOME ----------------------------' 3 '\nCleaning .................' ->str: def generate_hash (self, Filename:str) try with 'rb' as while 0 return except return False -> : def clean (self) None 0 for in '.' for in for in if for in if not in if #print(file) else 1 '/' -1 '.. cleaned ' -> : def cleaning_summary (self) None 1048576 2 '\n\n--------------FINISHED CLEANING ------------' 'File cleaned : ' 'Total Space saved : ' 'MB' '-----------------------------------------------' -> : def main (self) None if '__main__' Our app is complete, now to run the application go to the specific folder you want to clean and it will iterate recursively over a given folder to find all the files and remove the duplicate one. Example output: $ python3 app.py ****************************************************************** **************** DUPLYTHON **************************** ******************************************************************** ---------------- WELCOME ---------------------------- Cleaning ................. 0(copy).jpeg .. cleaned 0 (1)(copy).jpeg .. cleaned 0 (2)(copy).jpeg .. cleaned --------------FINISHED CLEANING ------------ File cleaned : 3 Total Space saved : 0.38 MB ----------------------------------------------- I hope you find this post interesting. Now it's time to share it with your fellow friends on Twitter and other dev communities. This article is also published on kalebujordan.com .