FontLab - accents

br.david's picture

Greetings! I'm a real newbie here.
I've got a homegrown font (which I didn't make), but need to modify. It was developed before unicode and UTF-8 allowed the millions of character possibilities for foreign languages. Specifically, Lithuanian. The Lithuanian letters were placed in un-needed places, like certain mathematical symbols, currency symbols, etc.

It's easy enough to take the letters and put them in their correct unicode places. HOWEVER, there are no unicode-defined characters for the Lithuanian letters with accents. How does one go about making this happen? I know it's possible because there are other fonts that can do it. If someone can help, I can explain why using these other fonts aren't a good option.

If anyone's confused, here's an example of something we need to do. There is the "normal" ligature OE for French and Latin. It's a great example, because there is no unicode definition for an accented OE. For the AE ligature, there is!! How do you go about adding an accent mark such as ´ to an oe ligature? Getting the keyboard to display it another thing. But that part I can handle. I just don't know how to make a font produce the needed glyph.

Any chance of getting some help? Would be MOST MOST appreciated.
Thanks!

hrant's picture

Unicode defines things like Wookie transliteration (I'm kidding, but you get the picture) so I'm sure Lithuanian is in there.

hhp

George Thomas's picture

Lithuanian uses AE, but not OE according to the several references I have.

This PDF from Michael Everson might be of help:
http://www.evertype.com/alphabets/lithuanian.pdf

For even more information, visit:
http://en.wikipedia.org/wiki/Lithuanian_alphabet

charles ellertson's picture

Probably, since Lituanian uses free accents, if one were starting with a fresh sheet, the best way to handle the accents would be to use the mark and mkmk features of OpenType. You can read the OpenType spec or do a Google search for the specifics of the technique. This strategy implements a feature of OpenType, the only aspect affected by Unicode is that the accents need to be in the font, properly encoded as combining diacritics (most of which are found in the 0300 to 036F range).

see http://www.unicode.org/charts/PDF/U0300.pdf

The positioning of the accents is left to mark and mkmk.

But mark and mkmk do not take advantage of precomposed characters, that is, a glyph that already exists in a font. What you're describing in the old font seems to be exactly that, one where the accented characters are already "built." With this approach, you use the ccmp feature of OpenType. Once again, the accents need to exist in the font as proper combining diacritics, but they are "positioned," if you want, by having a character already composed. This character will have a *name*, but no Unicode number. However, it's *name* will specify the Unicode-character components.

For example: an m with a tilde will have, as its name,
uni006D0303

The components have meaning: "uni" indicates that a Unicode number is being used for the name, 006D is the Unicode number for the Latin lowercase m, and 0303 is the Unicode number for the tilde combining diacritic. When encountered in a text stream, any Unicode-savvy program or application knows exactly what is going on, namely, there is an m with a tilde accent over it.

To point to the characters, you need a ccmp feature -- again, you can look this up in the OpenType spec. The particular ccmp statement that will cause this precomposed character to be used with any application program that supports OpenType would be

sub m tildecomb by uni006D0303;

and with the feature statement

feature ccmp { # Glyph Composition/Decomposition
# DEFAULT
...;
...;
sub m tildecomb by uni006D0303;
...;
...;
} ccmp;

It all sounds harder than it actually is, primarily, I suppose, because one has to get into OpenType and OT features.

To use the such characters from a keyboard, however, is dirt simple. In either case (whether a mark or a ccmp strategy has been used), you simply type in the characters in sequence, the "font" does the rest. And Unicode preserves syntactic meaning for any future use of the text file.

Do bear in mind that the combining diacritic tildecomb at U+0303 is a different character from the (now) spacing modifier tilde found in so many old PostScript Type 1 fonts. That one, which has width, maps in Unicode to 02DC -- a different character.

Hope this helps...

Michel Boyer's picture

Here are the required code points for Lithuanian as listed in

http://github.com/Extensis/lang/blob/master/languages.xml

from Thomas Phinney's link http://blog.webink.com/custom-font-subsetting-for-faster-websites/

U+0104  Ą   latin capital letter a with ogonek
U+0105  ą   latin small letter a with ogonek
U+010C  Č   latin capital letter c with caron
U+010D  č   latin small letter c with caron
U+0116  Ė   latin capital letter e with dot above
U+0117  ė   latin small letter e with dot above
U+0118  Ę   latin capital letter e with ogonek
U+0119  ę   latin small letter e with ogonek
U+012E  Į   latin capital letter i with ogonek
U+012F  į   latin small letter i with ogonek
U+0160  Š   latin capital letter s with caron
U+0161  š   latin small letter s with caron
U+016A  Ū   latin capital letter u with macron
U+016B  ū   latin small letter u with macron
U+0172  Ų   latin capital letter u with ogonek
U+0173  ų   latin small letter u with ogonek
U+017D  Ž   latin capital letter z with caron
U+017E  ž   latin small letter z with caron
U+201E  „   double low-9 quotation mark

From the Lithuanian keyboard I have on my Mac, I guess there are other useful code points (there is for instance a dead key for a grave accent).

Hmm. I just checked and the dead key simply seems to serve to enter digits that don't figure where we would expect.

Michel Boyer's picture

According to this proposal to the Unicode consortium: http://std.dkuug.dk/JTC1/SC2/WG2/docs/n4191.pdf
there are still 35 accented Lithuanian characters that are waiting to be added to the charts.

That file suggests a decomposition. For instance the latin capital letter u with macron and tilde would have decomposition U+016A, U+0303. I guess you can rely on that for your choices and you do as Charles suggested.

If you already have text files, you then recode them using those decompositions.

hrant's picture

Wow. There's so much junk in Unicode – how come a real language is poorly supported? Are those characters extremely rare? Or is it that they're composable hence low-priority?

Also, how could foundries, like here
http://www.underware.nl/support/language_support/Which_languages_do_your...
claim to support it?

hhp

Michel Boyer's picture

Well, if I rely on the decomposition that I found in the file above (and if there is no bug in my script) here are those "characters" rendered with your browser and Georgia

Ą́ ą́ Ą̃ ą̃ Ę́ ę́ Ę̃ ę̃ Ė́ ė́ Ė̃ ė̃ i̇̀ i̇́ i̇̃ Į́ į̇́ Į̃ į̇̃ J̃ j̇̃ L̃ l̃ M̃ m̃ R̃ r̃ Ų́ ų́ Ų̃ ų̃ Ū́ ū́ Ū̃ ū̃

I guess the foundries may just add precomposed characters for better rendering.

charles ellertson's picture

Michel, the real question would be the need to encode for the diphthongs. Unless every occurrence of a letterpair is the diphthong in a language -- as far as I can tell, OpenType doesn't provide for a way to substitute "just some."

http://en.wikipedia.org/wiki/Lithuanian_language#Diphthongs

As for new accents over Latin letters goes, I believe Unicode is on record as saying "no more." That the font publishers do not populate the combining diacritics is not Unicode's fault; they have provided a mechanism within the Standard where syntactic meaning can be easily and always be preserved. The "accenting" mechanism itself can be done in several ways -- in the font, or à la TeX, etc. Not Unicode's responsibility.

In fact, I now think Unicode has gone too far. People have come to expect new codepoints just to provide for accented Latin letters. I use to be one of them until I began working with Native American languages. One begins to realize just how many new characters would have to be added, if each accented or multiple accented vowel was given it's own codepoint.

For example, I've been working on Kiowa lately. There would need to be over 100 new codepoints if all the accented vowels for just Kiowa (McKenzie orthography) were added as single characters...

If you think on it, fully half of Latin Extended Additional isn't needed. Pretty much all of Latin Extended A, except for "legacy" reasons.

hrant's picture

Indeed, they should have realized from the get-go that non-intersecting compound characters should never have been encoded.

hhp

Michel Boyer's picture

By the way, the proposal is bugged; they propose these letters with these glyphs


and here is the decomposition for "latin small letter i with dot above and grave" they propose:

U+0069  i   latin small letter i
U+0307  ̇   combining dot above
U+0300  ̀   combining grave accent

More precisely, here is a copy paste from their pdf:

U+HH0C;LATIN SMALL LETTER I WITH DOT ABOVE AND GRAVE;Ll;0;L;0069 0307
0300;;;;N;;;00CC;; 00CC

Am I dreaming? Misinterpreting? When you see two dots in my post above, it is not your browser's fault, nor that of Georgia.

Michel Boyer's picture

Here are decompositions that appear to give the desired output but are they robust against all acceptable decompositions and recompositions?

U+0131 U+0307 U+0300      ı̇̀
U+0131 U+0307 U+0301      ı̇́
U+0131 U+0307 U+0303      ı̇̃
U+012F U+0301             į́
U+012F U+0303             į̃
U+006A U+0303             j̃
Michel Boyer's picture

As for new accents over Latin letters goes, I believe Unicode is on record as saying "no more." [Charles Ellertson]

Charles, the problem is that there are probably still systems or applications that do not handle properly the combining diacritics.

Edit: And, by the way, the other characters are not rendered properly on my mac with OS X 10.6.8 and Safari 5.1.10. With the typewriter font, they look ok:

Ą́ ą́ Ą̃ ą̃ Ę́ ę́ Ę̃ ę̃ Ė́ ė́ Ė̃ ė̃ Į́ Į̃ J̃  
L̃ l̃ M̃ m̃ R̃ r̃ Ų́ ų́ Ų̃ ų̃ Ū́ ū́ Ū̃ ū̃

Here is a grab of what I see:

As for the previous post, here is what it gives.

charles ellertson's picture

Yes but...
Michel, this isn't Unicode's problem. That there are application programs that will not correctly render characters is the problem of those programs, not Unicode. It would be different if NO program could properly render them. Unicode, in the end, is a standard, just like ASCII was.

I'll allow there may be some confusion with the i. Maybe the answer is there, & I've missed it. Suppose you have an "i" with a macron below, a macron above, and an acute above that. Should the tittle (dot) show? (And isn't that a function of the language, not the Standard?) But given that, should the name used to deconstruct the glyph use, for its base character, the dotless i or the basic i? Or should it vary, depending on what you want the deconstructed text stream to represent? (I've put in spaces to make reading clearer, No spaces in the name when actually used, of course).

That is, be named
uni0069 0331 0304 0301, or
uni0131 0331 0304 0301

Or if the tittle is wanted, should it need to be explicit, i.e.,
uni 0069 0331 0307 0304 0301 or maybe?
uni 0131 0331 0307 0304 0301, where uni 0131 (whatever) 0307 is somehow further deconstructed to uni0069?

Anyway, that's the kind of thing the Standard is responsible for, and for all I know, it is there & I just missed it.

Nor, for that matter, does Unicode proscribe preconstructed glyphs -- just giving them a code point.

Finally, even if they were given a codepoint, wouldn't the browser folk then complain about a different problem -- all the glyphs they now have to support, and the attending file size of each font?

hrant's picture

the problem is that there are probably still systems or applications that do not handle properly the combining diacritics.

As Charles says (or at least implies) let those applications suffer for that. Separate the wheat from the chaff. Natural selection. Right now we're on crutches, which means when we fall it's worse for being unexpected.

hhp

Michel Boyer's picture

Separate the wheat from the chaff. Natural selection.

I kind of like that. For me that would first scrap all font editors that can't handle mark and mkmk features.

Michel Boyer's picture

And concerning the character oe with an acute, that does not exist in unicode, that glyph cannot be found in the font Gentium either. That does not prevent Gentium to render decently the glyph oe followed by the character acutecomb (the combining diacritic). Here is how it works; the grabs are of Gentium Basic Bold seen through the eyes of FontForge; the text behind with the 35 "missing" characters is simply from TextEdit with the regular font:


On top of the character oe there is a mark, here denoted Anchor-0. For the glyph acutecomb, there is also a corresponding anchor. When the acutecomb follows the oe, the acutecomb is positioned on the oe so that the anchors be on top of each other.

Those anchors can be moved, copied and pasted in FontForge so as to take care of diacritics for which you do not intend to make precomposed characters.

charles ellertson's picture

BTW & kind of off-topic: does anyone know how to remove a "mark" anchor in FontLab? (Michel, it wasn't until FontLab 5.2 that mark & mkmk were supported. And 5.2 has other bugs, it is still officially in beta, I believe. One reason I've stayed with ccmp...)

Which raises a second question: anyone know of a FontForge build that doesn't crash with Windows 7? With Windows 8? Found one that works with Windows 8.1, but Lord, are FontForge builds fussy!

Michel Boyer's picture

but Lord, are FontForge builds fussy!

Yes, it seems lots of people want to make sure there is no free lunch.

hrant's picture

You're saying somebody is sabotaging FontForge? Who? And How?

hhp

Michel Boyer's picture

I don't think anyone from the FontForge community has ever intended to sabotage FontForge. The problem is that in order to have a fully working version of FontForge, you need to be able to compile it. If you just want to fix anchors, there is a binary for the mac that can be installed quite rapidly. I don't know for windows. I personally need more than that: I need the FontForge python class and Python scripting.

hrant's picture

I mean FontForge rivals. But how?
Anyway now I don't get what you meant.

hhp

charles ellertson's picture

Michel, very interesting. As it has the "under construction" banner, do you know if this a new work in progress site or an older, abandoned site?

Michel Boyer's picture

Charles, the site containing what seems to be the most up to date sources gives http://fontforge.github.io/en-US/ as the "new site" and if you then click on "Download" you get the page above. That is all I know.

Michel Boyer's picture

I just checked on Maverick (my grab above had been made on OS X 10.6) and the dot disappears on the j and the iogonek in most fonts (but not with Georgia, that appears to be an exception).

The decompositions that work fine with Brill, Charis SIL, Gentium Plus (I checked with TextEdit and XeLaTeX on Maverick, and InDesign on OX 10.6) are

0131 0328 0307 0301   ı̨̇́ idotogonekacute
0131 0328 0307 0303   ı̨̇̃ idotogonektilde
0237 0307 0303        ȷ̇̃ jdottilde

Where the components are:
0131    latin small letter dotless i
0237    latin small letter dotless j
0301    combining acute accent
0303    combining tilde
0307    combining dot above
0328    combining ogonek

The other decompositions (for idotacute, idotgrave, idottilde) still work fine.

I also checked that those are NFD and NFKD decompositions. That implies, I think, that they should pose no problem when transmitted on the internet. If I copy the "composite characters" above from the typophile page and paste them in MorxTester, I get back the string of characters. Making a keyboard for the Macintosh that outputs those strings of characters when you hit a key can easily be done with Ukelele.

I am far from being convinced (that is a euphemism) that giving "0069 0301" i.e. "i acutecomb" as decomposition for iacute (etc) in UnicodeData.txt was a good idea; in terms of components (and that is what is relevant with decompositions), iacute should have been given the decomposition "0131 0301" i.e. dotless i acutecomb" etc.

Birdseeding's picture

As Charles says (or at least implies) let those applications suffer for that. Separate the wheat from the chaff.

I'm sure Mozilla corporation will be crying in their beards over the fact that a few dozen Lithuanian dictionary writers have an unsatisfactory experience with their free product. Or whatever. -_-

The "market" can't solve everything. The Unicode standard explicitly protects small-language and minority-application users, and that's one of its strengths.

hrant's picture

It's not about getting blind behemoths to change, it's about "guiding" a given niche of users to better solutions.

The "market" can't solve everything.

Not the financial market by itself, but the cultural "market" can help.

hhp

Thomas Phinney's picture

> (Michel, it wasn't until FontLab 5.2 that mark & mkmk were supported. And 5.2 has other bugs, it is still officially in beta, I believe.

No, the shipping and supported version of FontLab Studio for Windows is 5.2.1. So if you are still using 5.2.0 you should upgrade (it’s free).

We are planning for one more dot release of the Mac and possibly the Windows version of FontLab Studio 5.x before the next full versions come out. If you know of bugs in 5.2.1, please report them if you have not already done so: http://www.fontlab.com/contact-and-support/product-support/problem-report/

charles ellertson's picture

Thomas, is there a list of already-reported bugs somewhere, or should we just report what we've found, regardless. I imagine the few I've found in Windows 5.2.1 have already been reported...

charles ellertson's picture

And back to the topic of the thread... or maybe a secondary topic:

A Standard should preseve syntactic integrity of a file for all users. It doesn't matter if it is a kid on the African plains using a free, basic computer (or China or anywhere, but they don't use the Latin alphabet), or the most sophisticated New York Advertising Agency with high-dollar hardware and software.

And practical things get in the way. The dotless i is in Latin extended A, and for ASCII, was surely not in the fixed portion 0-127. What was in the 8th bit was always up for grabs.

So the point is, an "i" with a tittle is better than no i at all. Whether or not the Unicode Consortium should assume the dotless i in all operating systems and type formats, I don't know. AFAIK, the only language that uses it is Turkish, for all the other, it is just assumed that should any accent appear over an i, the tittle is not used. So, getting back to the kid in Africa with the hand-out computer, what's important, in terms of the standard, is that he/she knows it is an "i" being used, not that it is rendered beautifully.

The beautiful rendering, as had been suggested, is the work of the various application programs and browsers. When there is a fault, it is there. However, before the people who write those programs can do their work, they do need to know what the standard is, and I'm not sure the Unicode Consortium has made that clear. Anyone know that information?

Michel Boyer's picture

Charles

Is there anyone on this planet of ours that types an i followed by an acutecomb to get the character iacute? So far as I know, there is always a keyboard layout available such that, after keying the right combination, the character iacute is input and displayed; the input is a single character, not two, even if there were two strokes on the keyboard to get it. The decomposition has nothing to do with it. Kids need only learn to use the keyboard. Why should the decomposition in the file UnicodeData.txt be of any concern for kids in Africa? That decomposition is used, so far as I know, only by software people and font designers, and I think it is giving them troubles; to prove me wrong, please don't use an argument of authority.

By the way in TeX (and consequently LaTeX etc) \i is dotless i and to get an idieresis (i tréma, ï) we (used to) type \"\i which asks for a dieresis on a dotless i. Similarly for i circumflex. That was one of the first things we had to learn so as to be able to type text in French with TeX (and I know at least one colleague that still codes her input latex files that way, without using the package inputenc that allows for latin1 or utf-8 input).

Birdseeding's picture

(sorry, misuderstood)

Thomas Phinney's picture

Charles: “how to remove a "mark" anchor in FontLab?”
Right-click or control-click on it, and select "delete" from the pop-up menu.

Michel (& Charles): Perhaps Michel is right that Unicode could or should have made the canonical comp/decomp use the dotless forms of the letters i and j. But they didn’t, and there is no changing it now. In which case, it is incumbent upon Unicode-processing systems (fonts and system-level support) to deal with the problem. Basically, for combining marks on i and j that do not have precomposed forms, some part of the machinery needs to strip the dot.

I would not be surprised if the responsibility was already understood to lie in one court or the other between the fonts and the engines, but I don’t happen to know which.

Té Rowan's picture

Michael Boyer wondered:

Is there anyone on this planet of ours that types an i followed by an acutecomb to get the character iacute?

Yes.

http://www.sunnlenska.is/menning/14779.html

Michel Boyer's picture

Wow! Here is my count in that small text:

    1   A acutecomb
    7   u acutecomb
   17   a acutecomb
   22   i acutecomb
    3   y acutecomb
    9   o acutecomb
    1   e acutecomb

for a total of 60 occurrences of acutecomb. Since acutecomb is coded on two bytes, that means that each of these accented characters requires three bytes in utf-8 instead of two (the utf-8 characters in latin-1 supplement being coded on two bytes). Is that standard for Islandic?

Té Rowan's picture

Nope. I have only seen that in Sunnlenska, and then only in some articles. Possibly tied to writer.

Michel Boyer's picture

Basically, for combining marks on i and j that do not have precomposed forms, some part of the machinery needs to strip the dot.  [Thomas]

I just had a look at how Gentium Plus processes the "i combining-dot-above" sequence: it simply puts the combining dot on top of the existing dot which then hides behind the combining dot and is thus saved from being stripped! Same thing with iogonek. Nice and simple.

Thomas Phinney's picture

Sure, that's fine for those, but does not help with adding an acute, grave, circumflex, caron, or other accents.

Also, I *think* that solution will only be reliably correct with TrueType. In some sizes on some rasterizers overlapping paths in PostScript outlines will cause reversed color. So in your example, the dot would disappear.

Michel Boyer's picture

but does not help with adding an acute, grave,

The subsequent accent is positioned with the mkmk feature.


Is that solution not safe for ttf fonts?

Michel Boyer's picture

on some rasterizers overlapping paths in PostScript outlines will cause reversed color. [Thomas]

After the post http://www.typophile.com/node/115342 I had indeed realized that with Inkscape (on a Mac, but that probably makes no difference) non merged overlapping contours in .otf characters were not rendered correctly. After your comment, I thought that Inkscape might indeed erase the dots; I just modified Heuristica .otf to check and with different glyphs, there appears to be no problem. What follows is a grab from Inkscape 0.48.2 (on OS X 10.9.3)


Anyway, should it not be the case that a rendering that erases the intersection of two different characters is faulty?

Thomas Phinney's picture

Sorry, my misunderstanding on one point. You wanted to keep the dot on the i, and add the additional accent. I thought those other accents were supposed to replace the dot. So you were trying to use an "i" and an "acute" to make "iacute"... seemed doomed to failure. Which it would have been, had that actually been, you know, what you were attempting. :)

Michel Boyer's picture

Keeping the dot is the whole point... Here is a grab from page 206 of http://www.unicode.org/versions/Unicode6.0.0/ch07.pdf


I personally find all that a bit crazy; I think TeX had it the right way. But you said that we are stuck with the decisions of the Unicode Wise Men...

Thomas Phinney's picture

Well, with two of those four examples, you are not in fact keeping the dot! That's why it is complicated. Oh well, these sorts of things keep us font folks in business, so I guess I shouldn't complain.

Michel Boyer's picture

The j with an arrow is used as the unit vector in the y direction, which means it is a mathematical symbol and there is a dtls Math feature to handle mathematical dotless forms; Cf http://blog.fontlab.com/font-tech/opentype-layout/opentype-layout-featur.... I wonder why it was given as an example and why a dotlessj with an arrow would not be good enough for text.

Michel Boyer's picture

these sorts of things keep us font folks in business

I guess you can save yourself lots of work with a contextual substitution that puts in the context all characters in your font that have combining canonical class 230 (distinct marks directly above); if you have ffPython installed (the python that comes with the latest mac binaries for fontforge), you can find them and output the class in the format needed for Fontlab with just a few lines of Python. With that solution, you however remove the dot of these four for which Gentium Plus does not:


I see no property in the unicode data that distinguishes them from the others.

Birdseeding's picture

I can vaguely see the logic in i + acutecomb producing iacute, because that's a sort of semantically logical way to think about it in some languages, e.g. Hungarian: the acute accent is the long-vowel mark, so semantically an i with an accent is thought of in the same way as an o or a with an accent. No-one thinks "o-with-an- accent, a-with-an-accent, i-without-a-tittle-but-with-an-accent...".

(Of course, then ő should just be ö + acutecomb as well. For Hungarian. And it all breaks down when other character systems are introduced... :D)

Syndicate content Syndicate content