Solution: Find Duplicate File in System

Let’s solve the Find Duplicate File in System problem using the Hash Map pattern.

Statement

Given a list of directory information named paths, where each directory contains file paths and their respective contents, identify all the duplicate files in the system based on their contents. Each duplicate group should contain at least two files with identical content, and the output should list these groups with the full paths of the duplicate files. Each entry in the input list is formatted as follows:

"root_directory/dir_1/dir_2/…/dir_m file_1.txt(file_1_content) file_2.txt(file_2_content) … file_n.txt(file_n_content)"

This indicates there are n files (file_1.txt, file_2.txt, …, file_n.txt) with respective contents (file_1_content, file_2_content, …, file_n_content) in the directory "root_directory/dir_1/dir_2/…/dir_m". The output should be a list of groups containing the paths of files sharing the same content. Each file path should follow the format given below:

"directory_path/file_name.txt".

The order of the groups or the paths within them does not matter.

Constraints:

  • 11 \leq paths.length \leq 2×1042 \times 10^4

  • 11 \leq paths[i].length \leq 30003000

  • 11 \leq sum(paths[i].length) \leq 5×1055 \times 10^5

  • paths[i] consist of English letters, digits, '/''.''('')', and ' '.

  • You may assume that no files or directories share the same name in the same directory.

  • You may assume that each given directory info represents a unique directory. A single blank space separates the directory path and file info.

Solution

To find the duplicate files in the system, we will start by parsing each directory path to split the directory, its files, and the contents. Let’s say we have the following list of files along the directory as an input:

  • root/x 1.txt(abc) 2.txt(def)

We will first split the input string by spaces; doing so, we will get the following:

  • root/x

  • 1.txt(abc)

  • 2.txt(def)

We have separated the directory path from its files and contents at this stage. Next, we will further split the file names from their respective contents. We can use the delimiter "(" to achieve this. For example, splitting a file by "(" will result in the following:

  • 1.txt

  • abc)

The only task remaining in the parsing process is to remove the closing bracket from the end of the file contents, such as "abc)" in this case. We can simply replace the ")" with an empty string to make it "abc".  After fully parsing these two file names and their contents, they will look as follows:

  • 1

    • 1.txt

    • abc

  • 2

    • 2.txt

    • def

To check the duplicate files, we will store every file, its content, and its directory path in the hash map. The key in the hash map will be the file’s contents, and the value will be a list containing the complete path of every file with that content. For every file content, we will check if the hash map already has a key with that content. If a key exists, we will put the file name along with its directory path in the list associated with that key. If the key does not exist, we will create a new entry in the hash map with the file content as the key. The value will be a list containing the file name and its directory paths.

Once the paths list has been fully traversed, the hash map will be populated with the required information. If the list size for any hash map entry is greater than 11, it indicates duplicate files with the same content. Finally, we will create a list of lists called result and include every list from the hash map larger than 11. We will then return the result.

The algorithm to solve this problem is as follows:

  • Initialize a hash map with the key as a string and value as a vector of string.

  • Iterate over each path. Split every path by space. The first part is the directory, and the rest contains the file name and its contents.

    • For each file, split the file by opening parenthesis ( to separate the file name and its content. Remove the closing parenthesis ) from the file content.

      • Store the file path in the hash map using the file content as the key and the corresponding list, which contains the directory path and the file name, as a value.

  • Initialize a list result to store the list of duplicate files.

  • Loop through the keys of the hash map. If the list of the paths for a particular content (key of the hash map) has more than one file (indicates duplicate file), add this list to the result.

  • Return the result.

Let’s look at the following illustration to get a better understanding of the solution:

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.