All you need to know about UTF-8 decoding

This article has been broken up into a series, Decoding UTF-8.

I am currently working on several UTF-8 features for YSH (these will also benefit OSH.) Unfortunately, I cannot find any concise and correct sources on how to decode UTF-8. As great as the encoding is, there are a few foot-guns around validation. I've noted them all in this document.

For a quick history on UTF-8 and why it's important, see Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

This document aims to cover just the rules around decoding (and encoding) UTF-8 string. I have also included an interactive decoder at the end so you can play around with a decoder to better understand the rules.

Decoding UTF-8
Interactive Decoder Demo
- Decoder Demo Source

Decoding UTF-8

Without further ado, how do I decode UTF-8? Luckily, the Unicode Consortium has written a helpful 24 chapter technical ~~document~~ novel on every miniscule detail of the Unicode Standard.

Now, you could read through most of it, but I'll save you the trouble. Unicode defines 5 important entities:

Unicode Code point — This is a 21-bit number assigned to a character in the Unicode Database. Not all 21-bit numbers are valid code points; they must be no larger than U+10FFFF (the U+XXXX notation refers to the hexadecimal number 0xXXXX referring to a Unicode code point.)
Surrogate — A code point in the range U+D800 and U+DFFF. These are used for encoding in UTF-16.
Unicode Scalar Value — A code point which is not a surrogate value.
Code unit — For UTF-8, this is a single byte in a UTF-8 stream.
Code unit sequence — For UTF-8, this is a sequence of bytes / code units.

UTF-8 allows us to encode a sequence of Unicode scalar values (text) into an equivalent code unit sequence, and vice versa. (These terms become useful when comparing to other encodings like UTF-16. UTF-16 chooses to encode a sequence of Unicode scalar values using a sequence of 16-bit code units.)

In UTF-8, each Unicode scalar value can be encoded in 1, 2, 3 or 4 bytes. The first byte determines how long that scalar value will be. The next 1 to 3 bytes are called "continuation" bytes. The first few bits of each byte determine if this is a 1-, 2-, 3- or 4-byte starting byte or a continuation byte. The other bits contain data related to the encoded scalar value. Table 3-6 describes the exact encoding.

This encoding is not entirely arbitrary. It has a few important properties such as "self-synchronization" where it is possible to decode UTF-8 even in a truncated byte stream. UTF-8: Bits, Bytes, and Benefits lists more observations which make operating over UTF-8 strings convenient.

+----------------------------+----------+----------+----------+----------+
| Scalar Value               | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
+----------------------------+----------+----------+----------+----------+
| 00000000 0xxxxxxx          | 0xxxxxxx |          |          |          |
| 00000yyy yyxxxxxx          | 110yyyyy | 10xxxxxx |          |          |
| zzzzyyyy yyxxxxxx          | 1110zzzz | 10yyyyyy | 10xxxxxx |          |
| 000uuuuu zzzzyyyy yyxxxxxx | 11110uuu | 10uuzzzz | 10yyyyyy | 10xxxxxx |
+----------------------------+----------+----------+----------+----------+

Table 3-6 from Unicode Standard 15.0.0 Ch3. UTF-8 bit distribution.

Many sources stop at Table 3-6, however it has three problems. First, we need to make sure that we only accept the shortest encoding form of a UTF-8 code unit sequence. For example, <0x41> and <0xC1 0x81> both resolve the Unicode scalar value U+41 ('A'). However, the second sequence is longer, and thus a "non-shortest form" error. (I have also heard the term "overlong" for this error.)

Why are "overlong" (or "non-shortest form") encodings bad? One reason is that enforcing a single encoding of each Unicode scalar value in UTF-8 means that substring search doesn't need to decode the code unit sequence. If overlong encoding was allowed, then searching for every instance of 'A' (U+41) would have to match both <0x41> and <0xC1 0x81>. However, this is only true of ASCII characters. You need normalization to have this in general. See Wikipedia: Unicode Equivalence.

The second problem is surrogates. Because UTF-16 cannot encode every 21-bit Unicode scalar value in 16-bit code units, a special range of Unicode code points were allocated to store "surrogates" which can be combined to store a single 21-bit value in two 16-bit numbers. Any Unicode code point between U+D800 and U+DFFF is a surrogate code point. Even though UTF-8 can unambiguously encode surrogates, it is an encoding error to do so.

Finally, the Unicode standard only defines code points between U+0 and U+10FFFF. However, UTF-8 can encode any 21 bit number. For example <0xF5 0x80 0x80 0x80> would result in 0x140000 per Table 3-6. However, that is not a Unicode scalar value, so it should result in a decoding error.

These three rules make writing a correct UTF-8 encoder and decoder a little tricky. Table 3-7 lists all well-formed UTF-8 byte sequences taking into account these restrictions.

+--------------------+------------+-------------+------------+-------------+
| Code Points        | First Byte | Second Byte | Third Byte | Fourth Byte |
+--------------------+------------+-------------+------------+-------------+
| U+0000..U+007F     | 00..7F     |             |            |             |
| U+0080..U+07FF     | C2..DF     | 80..BF      |            |             |
| U+0800..U+0FFF     | E0         | A0..BF      | 80..BF     |             |
| U+1000..U+CFFF     | E1..EC     | 80..BF      | 80..BF     |             |
| U+D000..U+D7FF     | ED         | 80..9F      | 80..BF     |             |
| U+E000..U+FFFF     | EE..EF     | 80..BF      | 80..BF     |             |
| U+10000..U+3FFFF   | F0         | 90..BF      | 80..BF     | 80..BF      |
| U+40000..U+FFFFF   | F1..F3     | 80..BF      | 80..BF     | 80..BF      |
| U+100000..U+10FFFF | F4         | 80..8F      | 80..BF     | 80..BF      |
+--------------------+------------+-------------+------------+-------------+

Table 3-7 from Unicode Standard 15.0.0 Ch3. Well formed UTF-8 Byte Sequences.

The connection between Table 3-7 and the rules above is not obvious. It may help to observe that:

For a 2 byte sequence, if the first byte equals 1100_000x (x means don't care) then we have an overlong.
For a 3 byte sequence:
- we have a surrogate pair if the first byte equals 1110_1101 and the second byte is 101x_xxxx.
- we have an overlong if the first byte equals 1110_0000 and the second byte is 100x_xxxx.
For a 4 byte sequence:
- we have an overlong if the first byte is 1111_0000 and the second byte is 1000_xxxx.
- we end up with a number larger than 0x10FFFF if the first byte is greater than or equal to 1111_0101.

Phew, that's it on the encoding. Here's what we've covered:

A UTF-8 string is a sequence of bytes.
In UTF-8, a single Unicode scalar value is encoded in 1 to 4 bytes.
Table 3-6 describes how we pack a scalar value into a sequence of bytes.
Surrogates and overlong / non-shortest form encodings are an error.
Valid Unicode scalar values can be at most U+10FFFF. UTF-8 can technically pack larger values in the 4 bytes, but doing so is an encoding error.
Table 3-7 gives a list of the only valid UTF-8 code unit sequences which account for surrogates, overlong / non-shortest form and not-a-scalar-value encodings.

Interactive Decoder Demo

To help illustrate my explanation, I've written an interactive decoder below. First, enter in a series of bytes in hexadecimal. Then, press the "Decode" button. The resulting output (or error) will be displayed in the output box. For example, entering "F2 80 9F A2" will result in the byte sequence <0xF2 0x80 0x9F 0xA2>. When decoded, it should result in U+807E2.

Output:
Press "Decode" to view output

Some Example Inputs and Outputs

41 → U+41
ED 9F 80 → U+D7C0
F4 80 80 8F → U+10000F
ED BF 80 → UTF8 Decode Error: Decoded surrogate near byte 4
C1 81 → UTF8 Decode Error: Overlong near byte 3
41 42 80 → UTF8 Decode Error: Invalid code unit at byte 4

Decoder Demo Source

// Adapted from Douglas Crockford's C UTF-8 decoder [0].
// [0]: https://github.com/douglascrockford/JSON-c/blob/master/utf8_decode.c

/**
 * Create a UTF8Decoder over a Uint8Array (or other integer array like
 * object), `buf`. Use `next()` to decode each character.
 */
function UTF8Decoder(buf) {
  this.input = buf;
  this.index = 0;
}

/** [PRIVATE] Get the next byte in the sequence */
UTF8Decoder.prototype._get = function() {
  if (this.done) {
    throw new Error('UTF8 Decode Error: truncated byte stream');
  }
  var c = this.input[this.index] & 0xFF;
  this.index += 1;
  return c;
};

/** [PRIVATE] get() expecting that the next byte is a continuation byte */
UTF8Decoder.prototype._cont = function() {
  var c = this._get();
  if ((c & 0xC0) == 0x80)
    return (c & 0x3F);

  throw new Error(`UTF8 Decode Error: bad continuation byte near ${this.index + 1} '${c.toString(16)}'`);
};

/** [PUBLIC] Returns true if the decoder has consumed the entire string */
Object.defineProperty(UTF8Decoder.prototype, 'done', {
  get: function() {
    return this.index >= this.input.length;
  }
});

/** [PUBLIC] Extract the next Unicode scalar value or throw a decode error */
UTF8Decoder.prototype.next = function() {
  var c0 = this._get();

  // One byte case
  if ((c0 & 0x80) == 0) {
      return c0;
  }

  // Two byte case
  if ((c0 & 0xE0) == 0xC0) {
    var c1 = this._cont();
    var r = ((c0 & 0x1F) << 6) | c1;

    if (r < 0x80) {
      throw new Error(`UTF8 Decode Error: Overlong near byte ${this.index + 1}`);
    }

    return r;
  }

  // Three byte case
  if ((c0 & 0xF0) == 0xE0) {
    var c1 = this._cont();
    var c2 = this._cont();
    var r = ((c0 & 0x0F) << 12) | (c1 << 6) | c2;

    if (r < 0x800) {
      throw new Error(`UTF8 Decode Error: Overlong near byte ${this.index + 1}`);
    }
    if (r >= 0xD800 && r <= 0xDFFF) {
      // Only shortest-form 3 byte UTF-8 sequences can encode surrogates.
      throw new Error(`UTF8 Decode Error: Decoded surrogate near byte ${this.index + 1}`);
    }

    return r;
  }

  // Four byte case
  if ((c0 & 0xF8) == 0xF0) {
    var c1 = this._cont();
    var c2 = this._cont();
    var c3 = this._cont();
    var r = ((c0 & 0x07) << 18) | (c1 << 12) | (c2 << 6) | c3;

    if (r < 0x10000) {
      throw new Error(`UTF8 Decode Error: Overlong near byte ${this.index + 1}`);
    }
    if (r > 0x10FFFF) {
      // Unicode scalar values can be at most U+10FFFF
      throw new Error(`UTF8 Decode Error: Decoded too large of a codepoint at byte ${this.index + 1}`);
    }

    return r;
  }

  // A value like 0x80 is never valid in a UTF-8 string
  throw new Error(`UTF8 Decode Error: Invalid code unit at byte ${this.index + 1}`);
};

All you need to know about UTF-8 decoding

Table of Contents

Decoding UTF-8

Interactive Decoder Demo

Decoder Demo Source