UTF-8 Format Basics

This article is part of a series, Decoding UTF-8.

I am currently working on several UTF-8 features for YSH (these will also benefit OSH.) Unfortunately, I cannot find any concise and correct sources on how to decode UTF-8. As great as the encoding is, there are a few foot-guns around validation. I've noted them all in this document.

For a quick history on UTF-8 and why it's important, see Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

This document aims to cover just the rules around decoding (and encoding) UTF-8 string. I have also included an interactive decoder at the end so you can play around with a decoder to better understand the rules.

Decoding UTF-8

Without further ado, how do I decode UTF-8? Luckily, the Unicode Consortium has written a helpful 24 chapter technical document novel on every miniscule detail of the Unicode Standard.

Now, you could read through most of it, but I'll save you the trouble. Unicode defines 5 important entities:

  1. Unicode Code point — This is a 21-bit number assigned to a character in the Unicode Database. Not all 21-bit numbers are valid code points; they must be no larger than U+10FFFF (the U+XXXX notation refers to the hexadecimal number 0xXXXX referring to a Unicode code point.)
  2. Surrogate — A code point in the range U+D800 and U+DFFF. These are used for encoding in UTF-16.
  3. Unicode Scalar Value — A code point which is not a surrogate value.
  4. Code unit — For UTF-8, this is a single byte in a UTF-8 stream.
  5. Code unit sequence — For UTF-8, this is a sequence of bytes / code units.

UTF-8 allows us to encode a sequence of Unicode scalar values (text) into an equivalent code unit sequence, and vice versa. (These terms become useful when comparing to other encodings like UTF-16. UTF-16 chooses to encode a sequence of Unicode scalar values using a sequence of 16-bit code units.)

In UTF-8, each Unicode scalar value can be encoded in 1, 2, 3 or 4 bytes. The first byte determines how long that scalar value will be. The next 1 to 3 bytes are called "continuation" bytes. The first few bits of each byte determine if this is a 1-, 2-, 3- or 4-byte starting byte or a continuation byte. The other bits contain data related to the encoded scalar value. Table 3-6 describes the exact encoding.

This encoding is not entirely arbitrary. It has a few important properties such as "self-synchronization" where it is possible to decode UTF-8 even in a truncated byte stream. UTF-8: Bits, Bytes, and Benefits lists more observations which make operating over UTF-8 strings convenient.

+----------------------------+----------+----------+----------+----------+
| Scalar Value               | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
+----------------------------+----------+----------+----------+----------+
| 00000000 0xxxxxxx          | 0xxxxxxx |          |          |          |
| 00000yyy yyxxxxxx          | 110yyyyy | 10xxxxxx |          |          |
| zzzzyyyy yyxxxxxx          | 1110zzzz | 10yyyyyy | 10xxxxxx |          |
| 000uuuuu zzzzyyyy yyxxxxxx | 11110uuu | 10uuzzzz | 10yyyyyy | 10xxxxxx |
+----------------------------+----------+----------+----------+----------+

Table 3-6 from Unicode Standard 15.0.0 Ch3. UTF-8 bit distribution.

Many sources stop at Table 3-6, however it has three problems. First, we need to make sure that we only accept the shortest encoding form of a UTF-8 code unit sequence. For example, <0x41> and <0xC1 0x81> both resolve the Unicode scalar value U+41 ('A'). However, the second sequence is longer, and thus a "non-shortest form" error. (I have also heard the term "overlong" for this error.)

Why are "overlong" (or "non-shortest form") encodings bad? One reason is that enforcing a single encoding of each Unicode scalar value in UTF-8 means that substring search doesn't need to decode the code unit sequence. If overlong encoding was allowed, then searching for every instance of 'A' (U+41) would have to match both <0x41> and <0xC1 0x81>. However, this is only true of ASCII characters. You need normalization to have this in general. See Wikipedia: Unicode Equivalence.

The second problem is surrogates. Because UTF-16 cannot encode every 21-bit Unicode scalar value in 16-bit code units, a special range of Unicode code points were allocated to store "surrogates" which can be combined to store a single 21-bit value in two 16-bit numbers. Any Unicode code point between U+D800 and U+DFFF is a surrogate code point. Even though UTF-8 can unambiguously encode surrogates, it is an encoding error to do so.

Finally, the Unicode standard only defines code points between U+0 and U+10FFFF. However, UTF-8 can encode any 21 bit number. For example <0xF5 0x80 0x80 0x80> would result in 0x140000 per Table 3-6. However, that is not a Unicode scalar value, so it should result in a decoding error.

These three rules make writing a correct UTF-8 encoder and decoder a little tricky. Table 3-7 lists all well-formed UTF-8 byte sequences taking into account these restrictions.

+--------------------+------------+-------------+------------+-------------+
| Code Points        | First Byte | Second Byte | Third Byte | Fourth Byte |
+--------------------+------------+-------------+------------+-------------+
| U+0000..U+007F     | 00..7F     |             |            |             |
| U+0080..U+07FF     | C2..DF     | 80..BF      |            |             |
| U+0800..U+0FFF     | E0         | A0..BF      | 80..BF     |             |
| U+1000..U+CFFF     | E1..EC     | 80..BF      | 80..BF     |             |
| U+D000..U+D7FF     | ED         | 80..9F      | 80..BF     |             |
| U+E000..U+FFFF     | EE..EF     | 80..BF      | 80..BF     |             |
| U+10000..U+3FFFF   | F0         | 90..BF      | 80..BF     | 80..BF      |
| U+40000..U+FFFFF   | F1..F3     | 80..BF      | 80..BF     | 80..BF      |
| U+100000..U+10FFFF | F4         | 80..8F      | 80..BF     | 80..BF      |
+--------------------+------------+-------------+------------+-------------+

Table 3-7 from Unicode Standard 15.0.0 Ch3. Well formed UTF-8 Byte Sequences.

The connection between Table 3-7 and the rules above is not obvious. It may help to observe that:

Phew, that's it on the encoding. Here's what we've covered: