Unicode Questions

Igor Freiberger's picture

I was analyzing Unicode tables and some pro fonts to understand how it works. Even after navigate through the huge Unicode documentation, some doubts remains:

1. Unicode tables does not includes variations for small caps, petite caps, swashes, beginnings, endings and alternates. So, all these glyphs will have no Unicode set while the font development. Correct?

2. When the font is generated, these glyphs without Unicode are recorded on Private User Area and receive a Unicode assigned by the font generator program. Correct?

3. Glyphs without a Unicode definition works correctly but are identified as NULL in InDesign glyph palette. Is there a way to replace this NULL name by a descriptive one?

4. How could I control the order InDesign adopts in the Glyph palette? Illustrator shows exactly the same order I adopted in Index view under FontLab, but InDesign ignores it for the first 256 glyphs.

5. A glyph that does not exists on Unicode table – as g with tilde, used in some Native American languages – could be added without problems to the font? I remember to read a recommendation not to add language variations in PUA. Which is the best practice to these situations?

Thanks in advance!

Arno Enslin's picture

Adobe does not add PUA Unicode points to characters anymore, that haven’t an official Unicode point. It’s your decision to record characters without an official Unicode point to the PUA. And have a look in the FontLab name table Standard with PUA: “In addition, a number of semi-standardized Private Use Area mappings have been included (in the range 0xE000-0xF8FF).” So, if you want to use PUA Unicode points, you should use the ones, that are semi-standardized.

And in my opinion you should generate Unicode, before you generate the font (Menu/ Glyphs/ Glyph names/ generate Unicode). You can deactivate the FontLab option to generate Unicode, when the font is build. (You also should deactivate all hinting options, that only apply, when the font will be generated, because you loose control with these auto functions.)

With regard to 5: If you would not be allowed to give g.tilde (No official Unicode point?) a PUA point, what else point should you give it? Probably not an official Unicode point, that belongs to another character. I assume, that you can only choose between a PUA point and no point. It is the same as with the typographic variations, isn’t it?

Note, that I am amateur.

Khaled Hosny's picture

for 5, you have to resort to combining marks, it Unicode's policy now not to encode new accented characters, instead you have to use base character + combining accent, i.e. g+ ̃ which results in g̃ (how this will look depends on the level of support in your system/font). Proper positioning is then achieved by OpenType mark anchors.

Arno Enslin's picture

@ Khaled Hosny

Thanks! This seems to be explained here, right?: http://www.adobe.com/devnet/opentype/afdko/topic_feature_file_syntax.htm...

Theunis de Jong's picture

You could also make it a ccmp. The glyph itself doesn't have to have a Unicode, but OTF aware programs (such as InDesign) will always replace a 'g' followed by a tilde with your g.tilde glyph.

Khaled Hosny's picture

@ Arno Enslin

Yes, sort of. I usually use FontForge to do my OpenType work and thus I rarely use Adobe's font feature files, but it is a mutter of what tool to use but the concept is the same.

And as Theunis said, one can also use ccmp to composite the g+tilde into g.tilde glyph, but I find this inflexible as now you need a separate glyph for each possible combination, which can be huge number, but it has the benefit of more strait forward kerning, as mark positioning and kerning interaction can be tricky.

Igor Freiberger's picture

Thanks for all! I understood the possibility to achieve gtilde (or other combinations) through combining diacritics and ccmp feature. And also the way to build them using marks, although I believe the ccmp would be easier to handle due to kerning issues (as Khaled pointed).

I'd like to have the proper glyph already in the font because this enables direct access. For example: people who do studies on Native American languages could use a keyboard layout where a simple shortcut (as AltGr + g) could result in the desired glyph. This is quite simple to do in Windows as Microsoft has a free tool to create keyboard layouts. But this solution is not possible when the desired glyph results from a combination.

These languages also uses some digraphs with may be treated as ligatures. Once more, Unicode has no room for them.

Objectively, which is the problem to add accented glyphs to languages with no Unicode support? Besides the non-standardization this causes, is there any functional issue?

It's a pity to see no one South American native language is considered in Unicode. Even Samaritan, talked by less than 1,000 people, has Unicode support, while American native languages with more than one million of active speakers haven’t. (Of course, I know this evolves cultural questions, as Samaritan has historic relevance.)

An example:

agisaak's picture

I don't know if windows allows you to map a key to a sequence of two unicode characters -- if it does, you can map a key to U00670303. If not, you won't be able to map a key directly to a g+tilde (or any other accented character not present in unicode).

The way around this would be to have a key which maps to tildecmb (U0303) and a rule in your ccmp feature (which you'd need in any case) along the lines of:

sub g tildecmb by g_tildecmb

(n.b. I would not recommend using g.tilde as the name for your glyph as others have done above since applications which depend on glyph names will strip everything after the period when determining unicode values -- g_tildecmb will be treated as a ligature).

It's a pity to see no one South American native language is considered in Unicode.

The issue here revolves around the underlying principles of unicode -- I think precomposed accented characters were initially included in unicode primarily to ensure round-trip compatibility with existing encodings. I suspect the unicode consortium had hoped that these characters would eventually fall out of use in favour of decomposed characters (e.g. using e (U0065) + acutecmb (U301) to represent the character é rather than the precomposed glyph), but this doesn't appear to be happening.

André

riccard0's picture

Even Samaritan, talked by less than 1,000 people, has Unicode support, while American native languages with more than one million of active speakers haven’t.

But how many writers?

John Hudson's picture

I'd like to have the proper glyph already in the font because this enables direct access. For example: people who do studies on Native American languages could use a keyboard layout where a simple shortcut (as AltGr + g) could result in the desired glyph. This is quite simple to do in Windows as Microsoft has a free tool to create keyboard layouts. But this solution is not possible when the desired glyph results from a combination.

Note that a key in a keyboard layout may be assigned to input more than one character, so to input g + tilde you could either have separate keys for the g and the combining tilde or you could assign a sequence of two Unicode characters to a single key: 0067+0303.

Remember, the relationship of number of keys to number of characters to number of glyphs is n-to-n-to-n. A single key may input multiple characters that are displayed as a single glyph. Multiple keys may input multiple characters that are displayed as a single glyph. Multiple keys may input a single character (a deadkey mechanism) that is displayed as multiple glyphs (ccmp decomposition). Etc.
___

There is a lot of misunderstanding regarding Unicode encoding of diacritic characters and a persistent idea that Unicode somehow ignores or does not support many languages because it does not encode their diacritics as precomposed characters. Unicode's encoding principles have always favoured decomposed encodings; even when text is encoded using precomposed diacritics Unicode normalisation provides for decomposition as the cleanest way to handle search, sort and spellcheck operations. The only reason why any precomposed diacritic characters exist in Unicode is because of a secondary principle that provides for one-to-one roundtrip character compatibility with pre-existing standards. This is why most European languages have precomposed diacritics included in Unicode: because they existed as precomposed characters in earlier standards; note that all of them have canonical decompositions in Unicode to base plus combining mark sequences. Having multiple ways to encode the same diacritic is not a benefit, it is a software development cost and a potential security risk. This is why stability agreements with other standards prevent Unicode from adding any more characters with canonical decompositions.

Rather than thinking that Unicode somehow ‘doesn't support’ e.g. native American languages, consider that Unicode supports these languages in the way that Unicode is fundamentally designed to support all languages involving combining diacritics. It is the backwards compatibility encoding of precomposed European diacritics that is the anomaly.

John Hudson's picture

In direct answer to the original questions:

1. Glyphs that are display variants of other glyphs do not need to be encoded. So long as they is a path through the OpenType Layout tables from an encoded glyph (i.e. one that is directly mapped in the font cmap table) to the variant glyph, the latter can be accessed via layout features.

2. PUA codepoints are best avoided. They should be reserved for things like logos and other idiosyncratic pseudo characters. Any glyph that represents a standard encoded character, even a variant glyph, should only be associated with that character, not with a PUA codepoint.

3. No. In some situations the Unicode glyph pallette tries to associate an unencoded glyph variant with an underlying character plus layout feature, but I can't figure out how or why this sometimes works and sometimes doesn't.

4. InDesign has the option to sort the glyph pallette by either Unicode or by Glyph ID order. I think the latter is preferable, but it is up to the user to set this.

5. As others have suggested, either ccmp composition to a precomposed diacritic glyph or GPOS dynamic mark positioning. The former is best if you are targeting specific languages and you know exactly what diacritics are involved; the latter is best if you are making a font with widespread language support and need a flexible solution. As you realised, the mark positioning approach introduces kerning complexities that you may wish to avoid.

twardoch's picture

> Rather than thinking that Unicode somehow ‘doesn't support’ e.g. native
> American languages, consider that Unicode supports these languages in
> the way that Unicode is fundamentally designed to support all languages
> involving combining diacritics.

It's even better than that: fonts do not glyphs targeted for those specific languages to support them. A well-designed "linguistic" font with extensive mark-to-base and mark-to-mark positioning (such as Arial or Times New Roman that ship with Windows Vista) support arbitrary mark combinations, which makes them support those languages even without "explicitly knowing about them".

In fact, the situation of some languages that use less-common precomposed characters is worse because fonts need to be specifically developed to support each of these characters, and this rarely happens.

charles ellertson's picture

And more: beyond the Combining Diacriticals, the Spacing Modifiers need to be filled out in fonts. The raised comma character used to signal a glottal stop should come from the Spacing Modifiers, not from General Punctuation, even if it is the same glyph. And the acute as a spaced character should come from spacing modifiers, not from the legacy acute.

etc.

Thank God for EULA's that allow the end user to modify the font as needed.

Igor Freiberger's picture

@ John Hudson, Agisaak, Twardoch and Charles: Thank you very much! This thread is a concise and excellent lesson on Unicode and OT building.

I see that most good fonts released recently does not includes combining diacritics or space modifiers. Vesper Pro is one of the few with this support, largely present just on fonts bundled with Windows.

BTW, my font project now includes these combining diacritics and modifiers.

twardoch's picture

Freiberger,

I believe that one reason behind the lack of mark attachment support in the majority of shipping fonts is that the tools are not quite up to it. Microsoft VOLT is the pretty much only tool that has been used for that by professional foundries, and it is somewhat cumbersome to use, and only works on Windows. Also, InDesign only supports mark attachment since version CS3, which is still relatively recent.

I would speculate that if FontLab Studio gets mark attachment support, the number of such fonts will increase significantly.

Adam

Jongseong's picture

I would speculate that if FontLab Studio gets mark attachment support, the number of such fonts will increase significantly.

Please tell me we won't have to wait much longer. :)

Igor Freiberger's picture

I second Jongseong ask! :-)

Off-topic: Another improvement I want to see in FL 6 is about hinting.

By now, I do tests with PS autohinting, manual PS hinting and no hints at all. Curiously, some glyphs works much better without any hinting (especially the complex ones, like /g/ß/W/).

But I don't know if this is because I'm using alignment zones and default stems in a consistent way, or because almost all nodes/BCPs are aligned to a 4x4 grid, or it's simply a FL limitation when doing hints –and later I may discover that to have some glyphs without hints is not a good idea.

dezcom's picture

Remember that unencoded glyphs achieve their encoding through substitutions in the feature code and classes. A smallcap N for example is substituted for either a cap or lowercase n depending on which feature comes into play. A variant does not require its own codepoint and when read back into a PDF, comes back as the referent's unicode point. In the case of our N smallcap, this could either be a cap N or lowercase n. A PUA codepoint would just confuse the issue. An N by any other name is still an N.

blokland's picture

Brian: Please tell me we won't have to wait much longer.

I can’t speak for Adam, but in any case DTL and URW++ will present a new tool, which amongst other things will support mark attachment via a GUI, at the coming ATypI Conference in Dublin (and that is all I will reveal here).

Arno Enslin's picture

@ Freiberger

Use the AFDKO autohint macro instead of the autohint function, that belongs to FontLab. And the quality of the hinting is dependent from the standard stem values.

@ Frank E. Blokland

But the new tool is obviously not OTMaster. But for sure you can get it for free. Cool!

Khaled Hosny's picture

And FontForge has been supporting an excellent GUI for mark attachments since eternity, oh well, "professionals" don't use free crap.

Sindre's picture

[tracking]

Andreas Stötzner's picture

… most good fonts released recently does not includes combining diacritics or space modifiers

Text fonts to be labled “good” should include them.

The problem is rather, that there are no advanced encoding tables publicly available to schedule what scope of characters or glyphs a font shall contain if it is meant to serve a particular field of text composing. Who really knows what is actually required for a comprehensive coverage of south American native languages?! Who really knows what is actually required for a comprehensive coverage of medieval Latin? These things are more like best-kept-secrets …

For my own part, I go with selfmade custom encoding schemes for nearly all of my font work, since neither UCS tables nor default encodings do tell you what is *actually* neccessary to cover. To make one step forward: could you supply us with a concise listing of characters needed in your field? Maybe, I would eventually volunteer to provide an encoding to fit.

twardoch's picture

> And FontForge has been supporting an excellent GUI for mark
> attachments since eternity, oh well, "professionals" don't use free crap.

I did not mention FontForge for a simple reason: I've tried installing it repeatedly, over time, on three different computers -- and never got it working properly. So I never learned much about it, even though I tried.

Currently, FontForge that is installed on my Mac OS X 10.5.8 crashes immediately after launching it. I've reinstalled, uninstalled, installed the released version as well as a fresh CVS copy, tried compiling it in various modes on several different systems. Something always fails.

Of course I contacted the developer numerous times, but he was never able to help me. And I'm talking about _installation_. At some point (about a year ago), I did have a working version of FontForge running, but only until I installed some updates to my system.

Which is a pity. I personally would love to use FontForge for some specific technical tasks that it supposedly supports, while FontLab Studio doesn't (like generating AAT tables). But after hours and hours spent trying to install it, I've given up.

While FontLab Studio is far from being perfectly stable, it usually does the job somehow. From my personal experience -- and I'm not saying it gladly -- FontForge is a tool that can be called "experimental", at most. I'm certain some users will be successful in using it, but they need bring some "readiness to be challenged" in their spirits. :) Which I think is a bit of a pity. I think there is a good spot for a free font editor that works well, simply because some people just cannot afford commercial software.

Khaled,

I personally use lots of free and opensource software packages, and I have contributed my own development to some. So when it comes to your sentence, "professionals don't use free crap", I can agree with it only if you remove the word "free" from it.

Frank,

I'm very excited about your announcement, and certainly look forward to seeing your new application. It's always very good to hear that someone's working on new font development software, and, as you know, I have great respect for both DTL and URW++ in that domain. DTL OTMaster rocks, and I've enjoyed using some of your other tools as well.

A small list of font editing software is available on my page: http://www.font.org/software/

Best,
Adam

JanekZ's picture

(Adam: laptop za 200$ i 5 minut pracy...)
Installing FF in 3 easy steps:
1. go to http://www.geocities.jp/meir000/fontforge/ , download fontforge-mingw_2009_10_28.zip (ca 18 MB)
2. unzip it somewere
3. double click fontforge.bat
Fontforge has access only to the disk where "somewere" is. It could be even on memory card, no install process is necessary!
Edit: tested win2000, winXP

Theunis de Jong's picture

Janek, I've used FontForge through MinGW and can't but agree with Adam. I got it to work, but boy! every two or three minutes it crashed, sometimes requiring not just a restart from MinGW (which on occasion worked) but an entirely fresh restart of Windows. "Unstable", to say the least.

It may be FontForge itself, it may be MinGW, and it also may be my ill-fated Windows PC. I'm too wary to try it on my Mac, since trying to run it on Snow Leopard has been called "a brave enterprise" a few times :-P

twardoch's picture

I work on Mac OS X, and I *did* get it to work: on a completely clean system, about 1.5 years ago. The moment I installed a dozen applications and some libraries, it stopped working, and never resumed. This happens on three computers, one with Mac OS X 10.4, one with 10.5, and one with 10.6.

Which is a pity, because, as I said, I would have some use for it.

Igor Freiberger's picture

Firstly, let me thank all the info posted here. Very useful.

About the hinting issue, a special thank to Arno. Now I'm using AFDKO autohint with excellent results.

Back to original questions: I saw some fonts includes a few characters with no Unicode correspondence. For example: Greta Text has Ering and Newzald has Eacutedotbelow and Egravebotbelow. In Nezwald these glyphs even got given codepoints (in E1xx range).

I understand this is not recommended, but as Greta and Newzald are widely acclaimed fonts, I'd know if these specific glyphs have some reason to be included even without a Unicode. And if the procedure of Newzald (to attribute a Unicode from the PUA range) is acceptable in any sense.

Theunis de Jong's picture

Giving a Unicode codepoint to each glyph

Pro: you can insert these characters in any program. Not every program can handle (all) feature sets or non-Unicode characters.
Con: Your font *will* be incompatible with other fonts that have other glyphs on those same positions. Changing from your font to another will change the glyphs but not their UC.

Inserting "odd" glyphs without a Unicode
Pro: they don't interfere with the general idea of Unicode.
Con: Non-feature enabled apps will not have access to these characters.

It has been discussed before (with strong arguments both pro and con both methods), and I think the general idea was to put "real" glyphs which just don't happen to have a UC codepoint assigned (yet) in the PUA -- there are even a few "informal" ranges in the PUA with characters that ought to have a UC assigned -- and reserve the non-UC code points for glyphs that exclusively get used by features, such as ligatures and "ccmp" composited characters that don't have a UC code point. In the latter case, changing from your font to another would not be able to use your specific ligature combo's *anyway*.

I can't find the relevant discussion right away; perhaps someone else remember what it was called.

Igor Freiberger's picture

Thanks, Theunis. You put it in a very clear way.

By now, I'm including some glyphs which aren't in Unicode tables. My criteria is to include only those characters with diacritics also built into good fonts, as Minion Pro, Newzald or Greta Text. Actually, these are just a few.

Although I found various fonts with given codepoints to glyphs outside the Unicode table (PUA range, E1XX and above), I chose not to give codepoints to them. Firstly, because the incompatibility you pointed. Secondly, because this is a large work without a very tangible advantage.

As I included many OT features, the font is huge (2300+ glyphs) even without Greek and Cyrillic support. Probably, I'll become crazy while making the italics.

Syndicate content Syndicate content