DIY: Find Duplicate Files in System

Solve the interview question "Find Duplicate Files in a System" in this lesson.

Problem statement

Suppose you are given a list, paths, of directory information, including the directory path and all the files with contents in this directory. Your task is to return a list where each element is the path of all the files in the file system that have the same content.

Note: You can return the final answer in any order. At least two files that have the same content are considered to be a group of duplicate files.

A single directory information string in paths has the following format:

  • "root/dir1/dir2/.../dirm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

The above entry represents n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively in the directory root/dir1/dir2/.../dirm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of groups of duplicate file paths. Each group contains all the file paths of the files that have the same content. A file path is a string that has the following format:

  • directory_path/file_name.txt

Constraints

  • paths[i] consist of English letters, digits, '/', '.', '(', ')', and ' '.
  • We can assume that no files or directories share the same name in the same directory.
  • You can assume each given directory information represents a unique directory. A single blank space separates the directory path and file information.

Input

The input will be an array of integers and two integers. The following are examples of input to the function:

# Sample Input 1
paths = ["root/a 4.txt(xyz) 1.txt(algorithms)","root/c 3.txt(educative)","root/c/d 2.txt(algorithms)","root 4.txt(educative) 5.txt(abcd)"]

# Sample Input 2
paths = ["root 1.txt(abcd) 2.txt(algo)","root/a 2.txt(abcd)","root/c/d 4.txt(algo)"]

# Sample Input 3
paths = ["root 1.txt(abcd) 2.txt(algo)","root/a 2.txt(xyzc)","root/c/d 4.txt(educative)"]

Output

The output will be a list of integers containing k closest integers to x. The following are examples of the outputs:

# Sample Output 1
[["root/a/1.txt","root/c/d/2.txt"],["root/c/3.txt","root/4.txt"]]

# Sample Output 2
[["root/2.txt","root/c/d/4.txt"],["root/1.txt","root/a/2.txt"]]

# Sample Output 3
[]

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.