Why multiple Unicode points for what appears to be the same character ?

Belloc's picture

Given the table below, one can see that the only noticeable difference in the glyphs, is the one between the character U+FBE5 and the characters U+FBE4 and U+0620. For the other characters the glyphs from the Arial font are identical. One other thing that called my attention was the fact that indeed the Arial font considers the character pairs (FBE4, 06D0), (FBE2, 06C9) and (FBD7, 06C7) equivalent, as they have the same GlyphID's.

Khaled Hosny's picture

The U+FB50-FDFD and U+FE70-FEFD code points are compatability character encoded for round trip conversion with pre Unicode encodings and their use for encoding text is discouraged.

Belloc's picture

Khaled Hosny

Thanks for you reply. Reading the Unicode docs about the characters in the range U+FB50-FDFD, I can see the reference "Preferred characters are found in the Arabic 0600-06FF". But I couldn't find in this last block, the characters for the final forms.

Belloc's picture

Thanks for the link. Very informative indeed. But I'm a newbie on fonts, and more so with Arabic characters. The answer to the very first question in the link you provided states the following "Arabic font designers should do whatever is necessary to add the full range of glyph support to the fonts they develop". Can I interpret this as the answer to my last question ?

Khaled Hosny's picture

Sorry, but I’m not sure what the question is.

John Hudson's picture

Belloc, the background here is that in order to encode Arabic text you only need one character code per letter, but that in order to display Arabic text you need multiple glyphs per letter. Before smart glyph processing technologies like ACE and OpenType, some styles of Arabic text could be adequately displayed by performing character-level substitutions of presentation form characters (initial, medial and/or final, plus ligature presentation forms), so Unicode encoded a set of these for compatibility with such systems. Some fonts contain mappings for presentation forms even though they also contain e.g. OpenType Layout tables and are not expected to used the presentation form characters during display. This is done for backwards compatibility only, so that older documents can be displayed in those fonts.

The other phenomenon that you will likely notice is that some of the characters in the Arabic range (0600) appear to be identical if you only look at the default encoded glyphs or at the Unicode glyph charts. This is because there are letters in various Arabic-script orthographies that take identical isolated forms but that have different joining behaviour and/or take different forms in initial, medial or final positions. Unicode made the decision to separately encode such characters, rather than relying on language-specific glyph processing to ensure correct display. [This topic can be especially confusing for newbies, because the differential behaviour of some letters is derived from different styles of the Arabic script, so in some fonts the difference will disappear. For example, the character ھ (do chashmī he) is a letter in the Urdu alphabet that is differentiated in Unicode from the Arabic hā’ because in some common styles of type it takes different forms in different positions. But the do chashmī he originated as the standard form of hā’ in the Persian nastaliq style (which is the normative style for Urdu), and in a nastaliq font one would expect both the Arabic and Urdu letters to behave identically. Confusing?]

Belloc's picture

John

I have to say something important here. I'm always amazed by your generosity in sharing with me (and the others who read theses posts) your deepest knowledge about fonts in general and Arabic characters in particular. I'm very grateful to your help, for you always give me more than I ask for, and this not only strengthens the respect I already have for you as a professional, but also as a person.

As always, your explanation was superb. The last part wasn't totally clear to me. But that really doesn't matter, for I know I'll be able to return here later, if necessary.

Best regards,

Syndicate content Syndicate content