The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26 [1]. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. Each CESU-8 character code (1, 2, or 3 bytes) can be converted to exactly one UTF-16 code (2 bytes).

CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only. It should be used exclusively for internal processing and never for external data exchange.

CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).

The CESU-8 encoding form is used in the Oracle database software. Oracle's UTF8 character set (unfortunately, a misnomer), available since version 8.0 of the database, is actually CESU-8. The character set AL32UTF8, introduced in version 9.0, is UTF-8 compliant.

The encoding of unicode supplementary characters works out to 11101101 1011yyyy 10xxxxxx 11101101 1010xxxx 10xxxxxx (yyyy represents the top five bits of the character minus one i.e. U+10**** becomes 1111, U+01**** becomes 0000, x represents the remaining bits of the character).clarification needed

Examples Encoding Unicode code point U+0045 U+0205 U+10400 E ȅ 𐐀 UTF-8 45 C8 85 F0 90 90 80 UTF-16 0045 0205 D801 DC00 CESU-8 45 C8 85 ED A0 81 ED B0 80 External links Unicode Technical Report #26 Modified UTF-8 overview Graphical View of CESU-8 in ICU's Converter Explorer v • d • e Unicode   U+0055 U+006E U+0069 U+0063 U+006F U+0064 U+0065 Unicode Consortium Character sets GB 18030 · Han unification · ISO/IEC 8859 · ISO/IEC 10646 (Universal Character Set) Code points planes · blocks · Mapping characters · Character properties Characters Scripts and symbols Scripts in Unicode · Unicode symbols Charts Character charts Special purpose BOM · Combining grapheme joiner · Left-to-right mark and Right-to-left mark · Zero-width non-breaking space · Zero-width joiner · Zero-width non-joiner · Zero-width space Miscellaneous lists Apple Symbols · CJK Unified Ideographs (CJK Ideographs list) · Combining character · Cultural symbols · Duplicate characters · Graphic characters · Latin characters · Mathematical operators and symbols · Phonetic symbols (including IPA) · Punctuation Processing Algorithms Bi-directional text · Collation (ISO 14651) · Equivalence Transformation BOCU-1 · CESU-8 · UTF-1 · UTF-7 · UTF-8 · UTF-9/UTF-18 · UTF-16/UCS-2 · UTF-32/UCS-4 · UTF-EBCDIC · Punycode · Graphic characters · SCSU · Comparison On pairs of code points Equivalence · Combining character · Duplicates · Homoglyph · Precomposed character (List) · Compatibility characters · Z-variant Usage Unicode and e-mail · Unicode and HTML · Character entity references · Unicode input · Internationalized domain name · Numeric character reference · Typefaces (fonts) Related topics Common Locale Data Repository (CLDR) · ConScript Unicode Registry · Ideographic Rapporteur Group · International Components for Unicode · MUFI · People related to Unicode v • d • e Character encodings Category:Character sets Early telecommunications ASCII · ISO/IEC 646 · ISO/IEC 6937 · T.61 · sixbit code pages · Baudot code · Morse code ISO/IEC 8859 -1 · -2 · -3 · -4 · -5 · -6 · -7 · -8 · -9 · -10 · -11 · -12 · -13 · -14 · -15 · -16 Bibliographic use ANSEL · ISO 5426 / 5426-2 / 5427 / 5428 / 6438 / 6861 / 6862 / 10585 / 10586 / 10754 / 11822 · MARC-8 National standards ArmSCII · CNS 11643 · GOST 10859 · GB 2312 · HKSCS · ISCII · JIS X 0201 · JIS X 0208 · JIS X 0212 · JIS X 0213 · KPS 9566 · KS X 1001 · PASCII · TIS-620 · TSCII · VISCII · YUSCII EUC CN · JP · KR · TW ISO/IEC 2022 CN · JP · KR · CCCII MacOS codepages Arabic · CentralEurRoman · ChineseSimp / EUC-CN · ChineseTrad / Big5 · Croatian · Cyrillic · Devanagari · Dingbats · Farsi · Greek · Gujarati · Gurmukhi · Hebrew · Icelandic · Japanese / ShiftJIS · Korean / EUC-KR · Roman · Romanian · Symbol · Thai / TIS-620 · Turkish · Ukrainian DOS codepages 437 · 720 · 737 · 775 · 850 · 852 · 855 · 857 · 858 · 860 · 861 · 862 · 863 · 864 · 865 · 866 · 869 · Kamenický · Mazovia · MIK · Iran System Windows codepages 874 / TIS-620 · 932 / ShiftJIS · 936 / GBK · 949 / EUC-KR · 950 / Big5 · 1250 · 1251 · 1252 · 1253 · 1254 · 1255 · 1256 · 1257 · 1258 · 1361 · 54936 / GB18030 EBCDIC codepages 37/1140 · 273/1141 · 277/1142 · 278/1143 · 280/1144 · 284/1145 · 285/1146 · 297/1147 · 420/16804 · 424/12712 · 500/1148 · 838/1160 · 871/1149 · 875/9067 · 930/1390 · 933/1364 · 937/1371 · 935/1388 · 939/1399 · 1025/1154 · 1026/1155 · 1047/924 · 1112/1156 · 1122/1157 · 1123/1158 · 1130/1164 · JEF · KEIS Platform specific ATASCII · CDC display code · DEC-MCS · DEC Radix-50 · Fieldata · GSM 03.38 · HP roman8 · PETSCII · TI calculator character sets · ZX Spectrum character set Unicode / ISO/IEC 10646 UTF-8 · UTF-16/UCS-2 · UTF-32/UCS-4 · UTF-7 · UTF-EBCDIC · GB 18030 · SCSU · BOCU-1 Miscellaneous codepages APL · Cork · HZ · IBM code page 1133 · KOI8 · TRON Related topics control character (C0 C1) · CCSID · charset detection · Han unification · ISO 6429/IEC 6429/ANSI X3.64 · mojibake