UTF-8 Format Basics
This article is part of a series, Decoding UTF-8.
I am currently working on several UTF-8 features for YSH (these will also benefit OSH.) Unfortunately, I cannot find any concise and correct sources on how to decode UTF-8. As great as the encoding is, there are a few foot-guns around validation. I've noted them all in this document.
For a quick history on UTF-8 and why it's important, see Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".
This document aims to cover just the rules around decoding (and encoding) UTF-8 string. I have also included an interactive decoder at the end so you can play around with a decoder to better understand the rules.
Decoding UTF-8
Without further ado, how do I decode UTF-8? Luckily, the Unicode Consortium has
written a helpful 24
chapter technical document novel on every miniscule
detail of the Unicode Standard.
Now, you could read through most of it, but I'll save you the trouble. Unicode defines 5 important entities:
- Unicode Code point — This is a 21-bit number assigned to a character
in the Unicode Database. Not all 21-bit numbers are valid code points; they
must be no larger than
U+10FFFF
(theU+XXXX
notation refers to the hexadecimal number0xXXXX
referring to a Unicode code point.) - Surrogate — A code point in the range
U+D800
andU+DFFF
. These are used for encoding in UTF-16. - Unicode Scalar Value — A code point which is not a surrogate value.
- Code unit — For UTF-8, this is a single byte in a UTF-8 stream.
- Code unit sequence — For UTF-8, this is a sequence of bytes / code units.
UTF-8 allows us to encode a sequence of Unicode scalar values (text) into an equivalent code unit sequence, and vice versa. (These terms become useful when comparing to other encodings like UTF-16. UTF-16 chooses to encode a sequence of Unicode scalar values using a sequence of 16-bit code units.)
In UTF-8, each Unicode scalar value can be encoded in 1, 2, 3 or 4 bytes. The first byte determines how long that scalar value will be. The next 1 to 3 bytes are called "continuation" bytes. The first few bits of each byte determine if this is a 1-, 2-, 3- or 4-byte starting byte or a continuation byte. The other bits contain data related to the encoded scalar value. Table 3-6 describes the exact encoding.
This encoding is not entirely arbitrary. It has a few important properties such as "self-synchronization" where it is possible to decode UTF-8 even in a truncated byte stream. UTF-8: Bits, Bytes, and Benefits lists more observations which make operating over UTF-8 strings convenient.
+----------------------------+----------+----------+----------+----------+ | Scalar Value | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte | +----------------------------+----------+----------+----------+----------+ | 00000000 0xxxxxxx | 0xxxxxxx | | | | | 00000yyy yyxxxxxx | 110yyyyy | 10xxxxxx | | | | zzzzyyyy yyxxxxxx | 1110zzzz | 10yyyyyy | 10xxxxxx | | | 000uuuuu zzzzyyyy yyxxxxxx | 11110uuu | 10uuzzzz | 10yyyyyy | 10xxxxxx | +----------------------------+----------+----------+----------+----------+
Table 3-6 from Unicode Standard 15.0.0 Ch3. UTF-8 bit distribution.
Many sources stop at Table 3-6, however it has three problems. First, we need
to make sure that we only accept the shortest encoding form of a UTF-8 code
unit sequence. For example, <0x41>
and <0xC1 0x81>
both resolve the Unicode
scalar value U+41
('A'). However, the second sequence is longer, and thus a
"non-shortest form" error. (I have also heard the term "overlong" for this
error.)
Why are "overlong" (or "non-shortest form") encodings bad? One reason is that enforcing a single encoding of each Unicode scalar value in UTF-8 means that substring search doesn't need to decode the code unit sequence. If overlong encoding was allowed, then searching for every instance of 'A' (
U+41
) would have to match both<0x41>
and<0xC1 0x81>
. However, this is only true of ASCII characters. You need normalization to have this in general. See Wikipedia: Unicode Equivalence.
The second problem is surrogates. Because UTF-16 cannot encode every 21-bit
Unicode scalar value in 16-bit code units, a special range of Unicode code
points were allocated to store "surrogates" which can be combined to store a
single 21-bit value in two 16-bit numbers. Any Unicode code point between
U+D800
and U+DFFF
is a surrogate code point. Even though UTF-8 can
unambiguously encode surrogates, it is an encoding error to do so.
Finally, the Unicode standard only defines code points between U+0
and
U+10FFFF
. However, UTF-8 can encode any 21 bit number. For example <0xF5 0x80 0x80 0x80>
would result in 0x140000
per Table 3-6. However, that is not
a Unicode scalar value, so it should result in a decoding error.
These three rules make writing a correct UTF-8 encoder and decoder a little tricky. Table 3-7 lists all well-formed UTF-8 byte sequences taking into account these restrictions.
+--------------------+------------+-------------+------------+-------------+ | Code Points | First Byte | Second Byte | Third Byte | Fourth Byte | +--------------------+------------+-------------+------------+-------------+ | U+0000..U+007F | 00..7F | | | | | U+0080..U+07FF | C2..DF | 80..BF | | | | U+0800..U+0FFF | E0 | A0..BF | 80..BF | | | U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | | | U+D000..U+D7FF | ED | 80..9F | 80..BF | | | U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | | | U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF | | U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF | | U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF | +--------------------+------------+-------------+------------+-------------+
Table 3-7 from Unicode Standard 15.0.0 Ch3. Well formed UTF-8 Byte Sequences.
The connection between Table 3-7 and the rules above is not obvious. It may help to observe that:
- For a 2 byte sequence, if the first byte equals
1100_000x
(x
means don't care) then we have an overlong. - For a 3 byte sequence:
- we have a surrogate pair if the first byte equals
1110_1101
and the second byte is101x_xxxx
. - we have an overlong if the first byte equals
1110_0000
and the second byte is100x_xxxx
.
- we have a surrogate pair if the first byte equals
- For a 4 byte sequence:
- we have an overlong if the first byte is
1111_0000
and the second byte is1000_xxxx
. - we end up with a number larger than
0x10FFFF
if the first byte is greater than or equal to1111_0101
.
- we have an overlong if the first byte is
Phew, that's it on the encoding. Here's what we've covered:
- A UTF-8 string is a sequence of bytes.
- In UTF-8, a single Unicode scalar value is encoded in 1 to 4 bytes.
- Table 3-6 describes how we pack a scalar value into a sequence of bytes.
- Surrogates and overlong / non-shortest form encodings are an error.
- Valid Unicode scalar values can be at most
U+10FFFF
. UTF-8 can technically pack larger values in the 4 bytes, but doing so is an encoding error. - Table 3-7 gives a list of the only valid UTF-8 code unit sequences which account for surrogates, overlong / non-shortest form and not-a-scalar-value encodings.