In computing, Punycode is an instance of a general encoding syntax (Bootstring) by which a string of Unicode characters is transformed uniquely and reversibly into a smaller, restricted character set.

Punycode is intended for the encoding of labels in the Internationalized Domain Names in Applications (IDNA) framework, such that these domain names may be represented in the ASCII character set allowed in the Domain Name System of the Internet. The encoding syntax is defined in IETF document RFC 3492.1

The IDNA methodology encodes only select label components of domain names with a procedure called ToASCII. The procedure ToUnicode decodes the DNS label into Unicode representation.

Contents 1 Encoding procedure 1.1 Separation of ASCII characters 1.2 Encoding of non-ASCII character insertions as code numbers 1.3 Re-encoding of code numbers as ASCII sequences 2 See also 3 References 4 External links // Encoding procedure

This section demonstrates the procedure for Punycode encoding, using the example of the string "bücher" (German for books), which is translated into the label "bcher-kva".

Separation of ASCII characters

First, all basic (ASCII) characters in the string are copied directly from input to output, skipping over other characters (e.g., "bücher" → "bcher"). If one or more basic characters were copied, an ASCII hyphen is added to the output next (e.g., "bücher" → "bcher-"). Since the rest of the encoding does not use "-" the last "-" (if any) in the encoded label signifies the end of the basic characters.

Encoding of non-ASCII character insertions as code numbers

The next part of the encoding process first requires an understanding of the decoder, which is a finite-state machine with two state variables i and n. i is an index into the string ranging from zero (representing a potential insertion at the start) to the current length of the extended string (representing a potential insertion at the end).

i starts at zero while n starts at 128 (the first non-ASCII code point). The state progression is a monotonic function. A state change either increments i or if i is at its maximum resets i to zero and increments n. At each state change either the code point denoted by n is inserted or it is not inserted.

The code numbers generated by the encoder represent how many possibilities the decoder should skip before an insertion is made. "ü" has code point 252. So before we get to the possibility of inserting ü in position one it is necessary to skip over six (there are five characters in "bcher" giving six insertion positions) potential insertions of each of the 124 preceding non-ASCII code points (252 - 128, the upper limit of ASCII) and one possible insertion (at position zero) of code point 252. That is why it is necessary to tell the decoder to skip a total of (6 × 124) + 1 = 745 possible insertions before getting to the one required.

Re-encoding of code numbers as ASCII sequences

Punycode uses generalized variable-length integers to represent these values. For example, this is how "kva" is used to represent the code number 745:

A number system with little-endian ordering is used which allows variable-length codes without separate delimiters: a digit lower than a threshold value marks that it is the most-significant digit, hence the end of the number. The threshold value depends on the position in the number and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits vary. In this case a number system with 36 digits is used, with the case-insensitive 'a' through 'z' equal to the numbers 0 through 25, and '0' through '9' equal to 26 through 35. Thus "kva", corresponds to "10 21 0".

To decode this string of digits, the threshold starts out as 1 and the weight is 1. The first digit is the units digit; 10 with a weight of 1 equals 10. After this, the threshold value is adjusted. For the sake of simplicity, let's assume it is now 2. The second digit has a weight of 36 minus the previous threshold value, in this case, 35. Therefore the sum of the first two "digits" is 10 × 1 + 21 × 35. Since the second "digit" is not less than the threshold value of 2, there is more to come. The weight for the third "digit" is the previous weight times 36 minus the new threshold value; 35 × 34. The third "digit" in this example is 0, which is less than 2, meaning that it is the last (most significant) part of the number. Therefore "kva" represents the number 10 × 1 + 21 × 35 + 0 × 35 × 34 = 745.

The threshold itself is determined by an algorithm keeping it between 1 and 26 inclusive, meaning the last character of an encoding will always be alphabetic. The case can then be used to provide information about the original case of the string.

For the insertion of a second special character in "bücher", the first possibility is "büücher" with code "bcher-kvaa", the second "bücüher" with code "bcher-kvab", etc. After "bücherü" with code "bcher-kvae" comes "ýbücher" with code "bcher-kvaf", etc.

To make the encoding and decoding algorithms simple, no attempt has been made to prevent some encoded values from encoding inadmissible Unicode values: however, these should be checked for and detected during decoding.

Punycode is designed to work across all scripts, and to be self-optimizing by attempting to adapt to the character set ranges within the string as it operates. It is optimized for the case where the string is composed of zero or more ASCII characters and in addition characters from only one other script system, but will cope with any arbitrary Unicode string. Note that for DNS use, the domain name string is assumed to have been normalized using Nameprep and (for top-level domains) filtered against an officially registered language table before being punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output Punycode string.

See also Hostname References ^ RFC 3492, Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA), A. Costello, The Internet Society (March 2003) External links Punycode encoding and decoding C source code from above paper Online Punycode/IDN Decoder/Encoder Convert Unicode to websafe url (PUNYCODE/CGI Escaped) Online Punycode/IDN Decoder/Encoder (allows bulk) GNU IDN Library—Libidn Punycode PHP library A libidn wrapper to allow PHP to convert Unicode to Punycode ICU IDNA Demonstration An online demonstration of how ICU performs IDN operations Punycode for Domains Convert Unicode to Punycode List of TLDs considered by the Mozilla developers to have an effective anti-spoofing policy for name registration IDN and Punycode in IE7 Punycode converter for Korean Shows which languages must be enabled for some web browsers (eg chrome) to display a particular IDN decoded v • d • e Unicode   U+0055 U+006E U+0069 U+0063 U+006F U+0064 U+0065 Unicode Consortium Character sets GB 18030 · Han unification · ISO/IEC 8859 · ISO/IEC 10646 (Universal Character Set) Code points planes · blocks · Mapping characters · Character properties Characters Scripts and symbols Scripts in Unicode · Unicode symbols Charts Character charts Special purpose BOM · Combining grapheme joiner · Left-to-right mark and Right-to-left mark · Zero-width non-breaking space · Zero-width joiner · Zero-width non-joiner · Zero-width space Miscellaneous lists Apple Symbols · CJK Unified Ideographs (CJK Ideographs list) · Combining character · Cultural symbols · Duplicate characters · Graphic characters · Latin characters · Mathematical operators and symbols · Phonetic symbols (including IPA) · Punctuation Processing Algorithms Bi-directional text · Collation (ISO 14651) · Equivalence Transformation BOCU-1 · CESU-8 · UTF-1 · UTF-7 · UTF-8 · UTF-9/UTF-18 · UTF-16/UCS-2 · UTF-32/UCS-4 · UTF-EBCDIC · Punycode · Graphic characters · SCSU · Comparison On pairs of code points Equivalence · Combining character · Duplicates · Homoglyph · Precomposed character (List) · Compatibility characters · Z-variant Usage Unicode and e-mail · Unicode and HTML · Character entity references · Unicode input · Internationalized domain name · Numeric character reference · Typefaces (fonts) Related topics Common Locale Data Repository (CLDR) · ConScript Unicode Registry · Ideographic Rapporteur Group · International Components for Unicode · MUFI · People related to Unicode