Cannot force substitution feature to work

PhysicallyReal's picture

The problem:
I want sequence of U+0435 (CYRILLIC SMALL LETTER IE, name "uni0435") and U+00A8 (name "dieresis") chars to be substituted by U+0451 (CYRILLIC SMALL LETTER IO, name "uni0451") char.
I want sequence of U+F0987 (Supplementary Private Use Area, name "uF0987") and U+100987 (Supplementary Private Use Area, name "u100987") chars to be substituted by "myglyph1" glyph ("myglyph1" has no Unicode position).
I want sequence of U+F0986 (Supplementary Private Use Area, name "uF0986") and U+100986 (Supplementary Private Use Area, name "u100986") chars to be substituted by "myglyph2" glyph ("myglyph2" has no Unicode position).
I want sequence of "uni0451", "myglyph1" and "myglyph2" glyphs to be substituted by "myglyph3" glyph ("myglyph3" has no Unicode position).

So that if I have uni0435 + uni00A8 + uniF0987 + uni100987 + uniF0986 + uni100986 sequence, I want it all to be substituted by "myglyph3" glyph.

I thought about this kind of opentype code:


languagesystem DFLT dflt;
languagesystem cyrl dflt;
languagesystem latn dflt;
feature ccmp { # Glyph Composition / Decomposition
# DEFAULT
lookup ccmp0 {
sub uni0435 dieresis by uni0451;
sub uF0987 u100987 by myglyph1;
sub uF0986 u100986 by myglyph2;
sub uni0451 myglyph1 myglyph2 by myglyph3;
} ccmp0;
script cyrl; # Cyrillic
lookup ccmp0;
script latn; # Latin
lookup ccmp0;
} ccmp;

(or feature "dlig")

and MS VOLT lookup source:


DEF_LOOKUP "l002" PROCESS_BASE PROCESS_MARKS ALL DIRECTION LTR
IN_CONTEXT
END_CONTEXT
AS_SUBSTITUTION
SUB GLYPH "uni0435" GLYPH "dieresis"
WITH GLYPH "uni0451"
END_SUB
SUB GLYPH "uF0987" GLYPH "u100987"
WITH GLYPH "myglyph1"
END_SUB
SUB GLYPH "uF0986" GLYPH "u100986"
WITH GLYPH "myglyph2"
END_SUB
SUB GLYPH "uni0451" GLYPH "myglyph1" GLYPH "myglyph2"
WITH GLYPH "myglyph3"
END_SUB
END_SUBSTITUTION
END

and I tried to ship font using MS VOLT, but this compiled feature becomes unavailable in Chrome browser (Firefox is OK, I didn't test IE10). How can I solve the problem with Microsoft VOLT or FontLab?

PhysicallyReal's picture

Actually, I don't even know what software should I use to add substitution feature to the font.
For example, I just want to substitute uni0435+dieresis by uni0451, that's all.
If I want to use MS Volt or Fontforge, then what exactly should I do so my browser could access this feature (using CSS "font-feature-settings")?

Thomas Phinney's picture

I don't see any reason that your code and approach shouldn't work. If it works without applying any explicit features in Firefox, but not in Chrome, then it sure sounds like Firefox is behaving correctly, and Chrome has a significant bug in not applying required features.

(On the side, "substitute uni0435+dieresis by uni0451" doesn't sound quite right. The dieresis encoded at U+00A8 is not the combining dieresis, that would be U+0308.)

charles ellertson's picture

I believe the dieresis at U+00A8 is what most of us think of as a "legacy" dieresis. It originated back when only the first 127 characters of ASCII were fixed. In many 8-bit ASCII encodings, there was a dieresis at A8.

It should live on only for the proper reading of older texts, before Unicode. (Unless there is any need for a spacing modifier dieresis, as I don't believe Unicode has one.) Any new use as a combining diacritical should be U+0308.

Having said that, most font designer's don't bother with including any of the combining diacritics, which makes their fonts not proper Unicode.

Thomas Phinney's picture

> which makes their fonts not proper Unicode

I was in 100% agreement with you until that phrase. I might say instead "which means they have a pretty stupid character set."

Michel Boyer's picture

If it works without applying any explicit features in Firefox, but not in Chrome, then it sure sounds like Firefox is behaving correctly, and Chrome has a significant bug in not applying required features. [Thomas Phinney]

Thomas, concerning the characters in the extended private user area, at the end of the unicode consortium file http://www.unicode.org/Public/UNIDATA/NamesList.txt it is explicitly written

@+ These codes are intended for process-internal uses, but are not permitted for interchange.

If those codes are provided as input to Chrome and the consortium says they are not permitted for interchange, why would Chrome not be allowed to consider them as garbage?

Michel Boyer's picture

Concerning the original question, I suggest having a look at Opentype features in web browsers by Gustavo Ferreira on typotheque. There is a series of tests, and in particular for liga, standard ligatures. Unfortunately, the syntax varies from browser to browser. I managed to get the ligatures working with Chrome on Mac OS X 10.8 (with a webkit font) with the following settings (a list copied from the site so as to cover most browsers and situations)

font-feature-settings: "liga" on;   
/* vendor-prefixes */
-moz-font-feature-settings: 'liga=1';
-ms-font-feature-settings: "liga" on;
-webkit-font-feature-settings: "liga" on;
-o-font-feature-settings: "liga" on;

and Chrome even accepted those weird ligatures I imagined as some kind of test after reading the initial post:

lookup ligaStandardLigaturesinLatinloo {
  lookupflag 0;
    sub \A \B  by \C;
    sub \uF0987 \u100987  by \Z;
} ligaStandardLigaturesinLatinloo;

feature liga {
  script DFLT;
     language dflt ;
      lookup ligaStandardLigaturesinLatinloo;
  script latn;
     language dflt ;
      lookup ligaStandardLigaturesinLatinloo;
} liga;

The font was made with FontForge (with the graphic interface) and the above feature file is the output from FontForge (and such a feature file can also be used in FontForge with Merge feature file; I prefer using feature files for contextual ligatures and alternates but for liga, the graphic interface works comfortably). As a conclusion, ligatures are not automatic, and also Chrome does not reject disallowed input (I am not sure I like that).

All that being said, I must confess I am also puzzled by the original post. Concerning the letter uni0451, it is normally obtained using a keyboard by first typing a key corresponding to dieresis, and then the key corresponding to uni0435; the final effect is that a single character is entered into the text, and it is uni0451. There is no need of a feature file for that, the keyboard layout sends the right character, uni0451, to the text editor. Everything works fine as long as those three characters are in the font (and you have the appropriate keyboard).

charles ellertson's picture

@Thomas,

I agree it's "stupid" ("not the sharpest knife in the drawer" is gentler, maybe?), but I'll stick with it not being proper Unicode, for the following reason. Remember, the consortium put their foot down on assigning any more codepoints to accented characters. IIRC, at that time, they allowed they'd never intended to give so many accented characters codepoints anyway, such characters were suppose to be constructed using base characters and combining diacritics.

Thomas Phinney's picture

Michel: Look at the codepoints that comment is applied to more carefully. That comment is applied to explicitly unassigned blocks which are never supposed to be used. It is NOT applied to the three Private Use Area code blocks. At least, that's how I read it. As the characters under discussion are in PUA, and not in any of these reserved non-character blocks, I don't see how this is relevant.

Examples of reserved non-characters: FDDO, 2FF80-2FFFF, EFF80-EFFFF....

The three PUA blocks:

U+E000..U+F8FF (though some of these have been used for Variation Selectors and other oddities)
U+F0000..U+FFFFD (Supplemental Private Use Area A)
U+100000..U+10FFFD (Supplemental Private Use Area B)

Michel Boyer's picture

The characters used were indeed not touched by the "ban" which concerned only a limited portion of the areas namely (from the last lines of NamesList.txt)

@@	FFF80	Supplementary Private Use Area-A	FFFFF
@@	10FF80	Supplementary Private Use Area-B	10FFFF

I am afraid I was also misled by the meaning I was giving to the word "private" (that has a completely different meaning in cryptography).

Thomas Phinney's picture

Ah! I see.

Yes, in Unicode the "Private Use Areas" are special zones for people to use the character codes for basically whatever they want. Apps are generally expected to be able to handle and process PUA characters. They are not unusual in the field—particularly the first Private Use Area in the Basic Multilingual Plane has been used pretty extensively....

Michel Boyer's picture

I would rather call those areas "Public playground" than "Private user area". I wonder where those that chose the terminology are from.

Syndicate content Syndicate content