DIY: UTF-8 Validation
Solve the interview question "UTF-8 Validation" in this lesson.
We'll cover the following...
Problem description
Given an integer array data
, return whether it is a valid UTF-8 encoding.
A character in UTF8 can be from 1
to 4
bytes long, subject to the following rules:
- For a
1
byte character, the first bit of the packet is0
, followed by its Unicode code. - For an
n-bytes
character, the firstn
bits are all1s
, then + 1
bit is0
, followed byn - 1
bytes, with the most significant2
bits being10
.
This is how the UTF-8 encoding represents characters in specific ranges:
Char. number range (hexadecimal) | UTF-8 octet sequence (binary) |
---|---|
0000 0000 - 0000 007F | 0xxxxxxx |
0000 0080 - 0000 07FF | 110xxxxx 10xxxxxx |
0000 0800 - 0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
0001 0000 - 0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Note: The input is an array of integers. Only the least significant 8 bits of each integer are used to store the data. This means each integer represents only 1 byte of data.
Input
The input will be a vector of integers data
. The following two are example inputs to the function:
// Example - 1
data = [198, 150, 9, 8]
// Example - 2
data = [255, 129, 129, 129, 129, 129, 129, 129]
Output
For the above input, the output will be:
// Example - 1
true
// Example - 2
false
Coding exercise
For this coding exercise, you have to implement the validUtf8(data)
function, where data
represents a vector of integers. The function will return true
or false
depending on whether the given vector of data
is valid UTF8 encoding.