Unicode Reserved Codepoints

agisaak's picture

While there are many unallocated codepoints in the Unicode Standard, some codepoints are specifically listed as "reserved" in the code charts.

Since AFAIK all unallocated codepoints are effectively reserved, I was wondering if anyone knows why some codepoints are explicitly indicated as such. I can't find anything in the unicode documentation (The Unicode Standard v6.2).

Just curious,

André

Theunis de Jong's picture

There are 2 different types of "Reserved". From "2.13 Special Characters and Noncharacters":

The Unicode Standard contains a number of code points that are intentionally not used to represent assigned characters. These code points are known as noncharacters. They are permanently reserved for internal use and should never be used for open interchange of Unicode
text.

(Note the permanently reserved here.) This is used for 'internal use only', for codes that would *never* indicate a displayable glyph, such as the BOM and the codes for switching RTL/LTR.

The other kind of "reserved" is simply 'not (yet) in use'. Most blocks contain some reserved -- unassigned -- codes at the end, which is probably just to align the start of the next code block on a round hexadecimal number. Also, this free space can be used to add one or two useful characters to an existing block.

agisaak's picture

Hi Theunis,

Thanks for the response.

I wasn't actually thinking of noncharacters here. As an example, within the Greek block, there are some characters listed as "reserved" (e.g. U+03A2) and other characters which are not so listed, but which are still unassigned (e.g. U+0378).

André

Theunis de Jong's picture

It seems the committee had been contemplating a possible future use for "U+03A2":

From: Michael Everson (xxx@xxx.com)
Date: Fri Aug 02 2002 - 19:12:46 EDT
At 11:13 +0200 2002-08-01, Otto Stolz wrote:

>I have selected U+03A2 with care: this code point covers the place
>of a non-existing "Greek capital letter final sigma". I think that
>this code-point -- while, admittedly, unsafe as any other unassigned
>one -- is rather unlikely to get assigned a character, in the fore-
>seeable future.
>
>Please do not promote an assignation to U+03A2 just do make a point :-)

Do not tempt us. Oh, do not tempt us. If ever GREEK CAPITAL LUNATE
SIGMA needed a place to hang its curvy hat, it is surely U+03A2.

There is no rhyme or reason to individual glyph assignments. In this case, someone thought it might be possible for a valid character to appear -- an uppercase equivalent of the lowercase pair "final sigma/regular sigma". In other cases, code points may simply have been removed from the specification.

Theunis de Jong's picture

Ah, wait: you are wondering when a code points is "officially declared reserved"!

Well ... in the example of U+03A2 above, the codes clearly run parallel from uppercase to lowercase, per design. So the code point U+03A2 would logically be 'uppercase terminal sigma'; but, because there is no such beast (as yet), the code point is "reserved for (possible) future use". Whereas other code points are simply 'left over' and could be used for anything.

agisaak's picture

OK, that makes sense.

Thanks,
André

ilyaz's picture

> (Note the permanently reserved here.) This is used for 'internal use only', for codes that would *never* indicate a displayable glyph, such as the BOM and the codes for switching RTL/LTR.

A minor correction: Unicode “reservation” has nothing to do with whether the glyph (I would say character) is “displayable”. The codes for switching RTL/LTR are just “normal” Unicode characters.

Syndicate content Syndicate content