Feature #1: Possible Matches

Implementing the "Possible Matches" feature for our "Plagiarism Checker" project.

Description

We are given a set of documents. Each document is submitted by a different individual. However, we suspect that some individuals may have copied from others. After copying from others, they may have inserted dummy statements in the document to avoid detection. Given a plagiarised submitted document, we want to identify the number of documents with which there is a potential match.

We have converted each document into a set of tokens based on their content. As mentioned previously, the students could have added dummy statements between the copied content to avoid identification. We’ll have to match the tokens of two students while taking into account that there can be dummy tokens that might not match. A potential match can occur if one string of tokens is a subsequence of another. It is not a guarantee that every match is plagiarized content. In this scenario, we’ll discard the matched tokens that have a length less than two.

We’ll be provided with a string, plagiarized, and a list, students. The plagiarized string will contain the tokens against which we’ll match the code samples present in the students list. We have to return the number of possible students in a class the plagiarized content may have been copied from.

Level up your interview prep. Join Educative to access 80+ hands-on prep courses.