UTF-8 Decoder Checklist
This article is part of a series, Decoding UTF-8.
Using the information in UTF-8 Format Basics, you should be able to write a correct decoder. However, it is very easy to miss small details. Aside from just extracing codepoints from the multi-byte format, it is important to check that your decoder rejects:
- Overlong encodings -- codepoints encoded in more bytes than necessary
- Codepoints in the surrogate range (
U+D800
toU+DFFF
, inclusive) - Codepoints greater than
U+10FFFF
are rejected