In SpanBERT, we mask a contiguous span of tokens in the sentence. Let xs and xe be the start and end position of the masked tokens, respectively. We feed the tokens to SpanBERT and it returns the representation of all the tokens. The representation of token i is represented as Ri. The representation of the tokens in the span boundary is denoted as Rs−1 and Re+1.
The span boundary objective
Let's first look at the SBO. To predict the masked token, xi, we use three values, which are the representation of the tokens in the span boundary (Rs−1 and Re+1), and the position embedding of the masked token (pi−s+1). Okay, how exactly do we predict the masked token with these three values? First, we create a new representation called zi using a function f(⋅), with these three values as shown: