Feature #6: Most Common Token

Implementing the "Most Common Token" feature for our "Language Compiler" project.

Description

For this feature of the language compiler, the program statements are given to us as a string. We want to determine the variable or function that is most commonly referred to in a program. In this process, we want to ignore the language keywords. For example, a keyword may be used more frequently than any variable or function. The language keywords are given as an array of strings.

Let’s say you are given the following program as input:

Press + to interact
int main() {
int value = getValue();
int sum = value + getRandom();
int subs = value - getRandom();
return 0;
}

The list of keywords given to you is ["int", "main", "return"]. In this example, your function will return "value". Note that, your functions should ignore syntax, such as parentheses, operators, semicolons, etc.

Solution

We can solve this problem by normalizing the code string and processing it step by step. The complete algorithm is as follows:

  • First, we replace all the syntax, including parentheses, operators, semicolons, etc., with spaces. Now, the string only contains alphanumeric tokens.

  • Then, we split the code obtained from the previous step into tokens.

  • We iterate through the tokens to count the appearance of each unique token as keys, excluding the keywords.

  • We create a HashMap count, which has tokens as keys and occurrences as values.

  • In the end, we check the tokens in count to find the token with the highest frequency.

std::string mostCommonToken(std::string& code, std::vector<std::string>& keywords){
// Replacing the syntax with spaces
std::string normalizedCode;
// Convert all the characters other than Alphanumeric into spaces
// and insert them into normalizedCode variable
std::regex_replace(std::back_inserter(normalizedCode), code.begin(), code.end(), std::regex("[^a-zA-Z0-9]"), " ");
// Split based on the spaces
std::vector<std::string> tokens;
std::string delim = "\\s+";
split(normalizedCode, delim, tokens);
std::unordered_map<std::string, int> count;
std::unordered_set<std::string> bannedWords(keywords.begin(), keywords.end());
// Count occurence of each token, excluding the keywords
for(auto token : tokens){
if (bannedWords.find(token) == bannedWords.end()){
if(count.find(token) == count.end()){
count[token] = 0;
}
count[token] += 1;
}
}
// return the maximum valued key
// Find the maximum from the map using the lambda function
std::pair<std::string, int> pair = *std::max_element(count.begin(), count.end(), [] (const std::pair<std::string, int>&p1, const std::pair<std::string, int>&p2){
return p1.second < p2.second;
});
return pair.first;
}
int main() {
// Driver code
std::string code = "int main() {\n"
"int value = getValue();\n"
"int sum = value + getRandom();\n"
"int subs = value - getRandom();\n"
"return 0;\n"
"}";
std::vector<std::string> keywords = {"int", "main", "return"};
std::cout << mostCommonToken(code, keywords) << std::endl;
}
Most common token

Complexity measures

Time complexity Space complexity
...