DIY: UTF-8 Validation
Solve the interview question "UTF-8 Validation" in this lesson.
We'll cover the following
Problem description
Given an integer array data
, return whether it is a valid UTF-8 encoding.
A character in UTF8 can be from 1
to 4
bytes long, subject to the following rules:
- For a
1
byte character, the first bit of the packet is0
, followed by its Unicode code. - For an
n-bytes
character, the firstn
bits are all1s
, then + 1
bit is0
, followed byn - 1
bytes, with the most significant2
bits being10
.
This is how the UTF-8 encoding represents characters in specific ranges:
Char. number range (hexadecimal) | UTF-8 octet sequence (binary) |
---|---|
0000 0000 - 0000 007F | 0xxxxxxx |
0000 0080 - 0000 07FF | 110xxxxx 10xxxxxx |
0000 0800 - 0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
0001 0000 - 0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Note: The input is an array of integers. Only the least significant 8 bits of each integer are used to store the data. This means each integer represents only 1 byte of data.
Input
The input will be a vector of integers data
. The following two are example inputs to the function:
// Example - 1
data = [198, 150, 9, 8]
// Example - 2
data = [255, 129, 129, 129, 129, 129, 129, 129]
Output
For the above input, the output will be:
// Example - 1
true
// Example - 2
false
Coding exercise
For this coding exercise, you have to implement the valid_utf8(data)
function, where data
represents a vector of integers. The function will return true
or false
depending on whether the given vector of data
is valid UTF8 encoding.
Level up your interview prep. Join Educative to access 80+ hands-on prep courses.