...

/

Text Segmentation (Interpunctio Verborum)

Text Segmentation (Interpunctio Verborum)

Learn about the text segmentation problem and its solution using backtracking.

Problem statement

Suppose we are given a string of letters representing text in some foreign language but without any spaces or punctuation, and we want to break this string into its individual constituent words. For example, we might be given the following passage from Cicero’s famous oration in defense of Lucius Licinius Murena in 62 BCE, in the standard scriptio continua of classical Latin:

Press + to interact

Segmenting sequences of characters into words

A fluent Latin reader would parse this string (in modern orthography) as Primus dignitas in tam tenui scientia non potest esse; res enim sunt parvae, prope in singulis litteris atque interpunctionibus verborum occupatae. Text segmentation is not only a problem in classical Latin and Greek but in several modern languages and scripts, including Balinese, Burmese, Chinese, Japanese, Javanese, Khmer, Lao, Thai, Tibetan, and Vietnamese. Similar problems arise in segmenting unpunctuated English text into sentences, segmenting text into lines for typesetting, speech and handwriting recognition, curve simplification, and several types of time-series analysis. For purposes of illustration, we’ll stick to segmenting sequences of letters in the modern English alphabet into modern English words.

Of course, some strings can be segmented in several different ways; for example, BOTHEARTHANDSATURNSPIN can be decomposed into English words as either BOTH·EARTH·AND·SATURN·SPIN or BOT·HEART·HANDS·AT·URNS·PIN , among many other possibilities. For now, let’s consider an extremely simple segmentation problem: Given a string of characters, can it be segmented into English words at all?

To make the problem concrete (and language-agnostic), let’s assume we have access to a subroutine IsWord(w)IsWord(w) that takes a string w as input and returns TrueTrue if ww is a “word” or FalseFalse if ww is not a “word.” For example, if we are trying to decompose the input string into palindromes, then a “word” is a synonym for “palindrome,” and therefore IsWord(ROTATOR) = True but IsWord(PALINDROME) = False.

Just like the subset sum problem, the input structure is a sequence, this time containing letters instead of numbers, so it is natural to consider a decision process that consumes the input characters in order from left to right. Similarly, the output structure is a sequence of words, so it is natural to consider a process that produces the output words in order from left to right. Thus, jumping into the middle of the segmentation process, we might imagine the following picture:

Press + to interact

Here the black bar separates our past decisions—splitting the first 17 letters into four words—from the portion of the input string that we have not yet processed.

The next stage in our imagined process is to decide where the next word in the output sequence ends. For this specific example, there are four possibilities for the next output word—HE, HEAR, HEART, and HEARTH. We have no idea which of these choices, if any, is consistent with a complete segmentation of the input string. We could be “smart” at this point and try to figure out which choices are good, but that would require thinking! Instead, let’s “stupidly” try every possibility by brute force and let the Recursion Fairy do all the real work.

  • First, tentatively accept HE as the next word, and let the Recursion Fairy make the rest of the decisions.
Press + to interact
  • Then tentatively accept HEAR as the next word, and let the Recursion Fairy make the rest of the decisions.
Press + to interact
  • Then tentatively accept HEART as the next word, and let the Recursion Fairy make the rest of the decisions.
Press + to interact
  • Finally, tentatively accept HEARTH as the next word, and let the Recursion Fairy makes the rest of the decisions.
Press + to interact

As long as the Recursion Fairy reports success at least once, we report success. On the other hand, if the Recursion Fairy never reports success—in particular, if the set of possible next words is empty—then we report failure.

None of our past decisions affect which choices are available now; all that matters is the suffix of characters that we have not yet processed. In particular, several different sequences of past decisions could lead us to the same suffix, but they all leave us with exactly the same set of choices for that suffix.

Press + to interact

Thus, we can simplify our picture of the recursive process by discarding everything left of the black bar:

Press + to interact

Recursive algorithm

We are now left with a simple and natural backtracking strategy: Select the first output word and recursively segment the rest of the input string.

To get a complete recursive algorithm, we need a base case. Our recursive strategy breaks down when we reach the end of the input string because there is no next word. Fortunately, the empty string has a unique segmentation into zero words. Putting all the pieces together, we arrive at the following simple recursive algorithm:

Algorithm


Splittable(A[1..n]):if n=0return Truefor i1 to nif IsWord(A[1..i])if Splittable(A[i+1..n])return Truereturn False\underline{Splittable(A[1 .. n]):} \\ \hspace{0.4cm} if\space n = 0 \\ \hspace{1cm} return\space \text{True} \\ \hspace{0.4cm} for\space i ← 1\space to\space n \\ \hspace{1cm} if\space IsWord(A[1 .. i]) \\ \hspace{1.7cm} if\space Splittable(A[i + 1 .. n]) \\ \hspace{2.3cm} return\space \text{True} \\ \hspace{0.4cm} return\space \text{False}

Implementation

Press + to interact
import java.util.HashSet;
class Main {
// Create a HashSet to store dictionary words
static HashSet<String> dictionary = new HashSet<String>();
// Check if a given string is a word in the dictionary
static boolean isWord(String str) {
return dictionary.contains(str);
}
// Recursively check if a string is splittable into words in the dictionary
static boolean splittable(String A) {
int n = A.length();
// Base case: an empty string is always splittable
if (n == 0) {
return true;
}
// Try splitting the string at every possible position and check if both parts are splittable
for (int i = 1; i <= n; i++) {
// Check if the first part of the string is a word in the dictionary
if (isWord(A.substring(0, i))) {
// If it is, recursively check if the second part of the string is also splittable
if (splittable(A.substring(i, n))) {
return true; // If both parts are splittable, return true
}
}
}
// If none of the possible splits result in two splittable strings, return false
return false;
}
public static void main(String[] args) {
// Add some words to the dictionary
dictionary.add("HE");
dictionary.add("ART");
dictionary.add("HAND");
dictionary.add("SATURN");
dictionary.add("SPIN");
// Check if the given string is splittable into words in the dictionary
String A = "HEARTHANDSATURNSPIN";
System.out.println(splittable(A));
}
}

Explanation:

  • Line 5: This line initializes a new static HashSet object named dictionary that can hold strings.
  • Line 8: In this line, isWord is a method in the Main class with a string parameter named str. This method returns a boolean value, indicating whether the given string is present in the dictionary set.
  • Line 13: The splittable method is a static method that takes a single String argument A and returns a
...