building characters with multiple accents

charles ellertson's picture

Perhaps this quandary has no real answer, but discussion's still helpful.

A publisher recently asked me about a character they needed, an "a" with a macron above, and a tilde above that. Their inclination was to start with an amacron (U+0101) and add a combining tilde (name: uni01010303). My inclination was that since we are dealing with a character for which there is no Unicode index, it should be expressed completely decomposed -- that is (name: uni006103040303).

Why does it matter? Well, prosaically, since we use FontLab, ccmp is the only OT feature available to form such characters. Listing all possible combinations in the ccmp feature is exceeding time consuming, since one should also include all the characters in Latin A, B, and additional. Why? Once you start providing for character strings in canonically correct Unicode, you're sort of obligated to finish. And we have had manuscripts come in where authors expressed, say, aacute with an *a* and a *acutecomb* diacritical. What's wrong with that?

Order is just as bad. Within a plane, Unicode requires inside out. But Unicode doesn't specify the order of the planes. Following a suggestion of 3.0, I prefer bottom to top. Once again, with ccmp, order counts. If you provide for any order, the number of lines grows uncomfortably large.

I haven't used it, but I suspect mark and mkmk have at least some of these issues.

Moving beyond OT fonts, how do people search for such characters in a text file? Is a reader just suppose to search for all possible combinations?

Has anyone else grappled with this?

oldnick's picture

Not to disrespect you in any way, but aren't you overthinking this problem a bit? If the publisher wants a single exception character, why not simply create it in a seldom-used slot--such as florin or currency--and use FontLab's Get Component function to build it? And, no disrespect intended to authors either, but since when do they dictate how accented characters are to be built; it seems to me that such decisions should lie with typographical professionals.

Khaled Hosny's picture

I've no experience with Latin accents, but stacked vowel marks are common in Arabic and the order is always bottom to top (some fonts try to account for reversed order being the same but IMO this is wrong, and if one for some odd reason wants marks stacked backwards should be able to do that). In my fonts I usually build stacked marks using 'mkmk' feature so I don't have to foresee every possible combination.

charles ellertson's picture

@Nick:

If the only task of a font is to put ink on paper, you might have a point -- except for that nonsense about leaving things to the "typographic professionals." But I prejudge. If by typographic professionals you mean the people who use type, I'll have more sympathy. Not what you meant, right?

Text in the 21st century is not only about ink on paper. A "repurposed" text can't even assume a common font these days.

As to authors, yeah, right. "I'm sorry Mr. (Stephen) King, we can't accept your latest manuscript, since your way of entering characters doesn't agree with our typographic professionals. You'd better find another publisher. Please return the $500,000 advance."

@Kaled:

As I read section 2.11 (Unicode 5.0), esp. fig 2.21 the order is inside out, not top down.

See also 3.11, Canonical Ordering Behavior, particularly P2. Since they say [guideline], I suppose it is not a *requirement*. But sure as there are little green apples, somebody having to search a text for a word is going to appreciate a consistent ordering -- for Latin-based texts anyway.

oldnick's picture

A "repurposed" text can't even assume a common font these days.

Then your publisher is on a fool's errand, unless all possible fonts contain the features you propose...

Khaled Hosny's picture

As I read section 2.11 (Unicode 5.0), esp. fig 2.21 the order is inside out, not top down.

Right, I was only thinking of stacked marks above the base, so inside out should be more generalised.

Khaled Hosny's picture

Then your publisher is on a fool's errand, unless all possible fonts contain the features you propose...

Even if they don't have a proper font the text still caries the intended semantics, but no one can ever tell than a ₣ in the middle of a word was supposed to be "a with macron above and tilde".

Theunis de Jong's picture

.. since when do they dictate how accented characters are to be built; it seems to me that such decisions should lie with typographical professionals.

I don't think the authors dictate how the characters ought to be built. I'm in the same branch as Charles is (linguistics), and it's certainly not uncommon to have some text discussing a language containing multiple stacked accents -- his a macron tilde sure sounds familiar. The authors don't really care how the characters are built; typically, they use Word, which (it seems) allows stacking, or they use a mark/mkmk enabled font, or ... they use some self-made font! (And typically, they'd use Nick's hack of replacing any random character with theirs; even more typically, they forget to send that adjusted font together with their text...)

Charles, I've had bad experiences with mark/mkmk, so I settled for this: I created a non-spacing macro/tilde and put it in the PUA. It being non-spacing means it'll almost never center nicely on every character, but a well-targeted manual kern takes care of that. For a few frequently occurring combo's I made 'special' versions -- i.e., the e, o, and a are nearly the same width in my base font, so I made one specially for these and have ccmp code replace the 'generic' macron/tilde with this one.

I can't recall, off the top of my hat, if I set the proper Unicode for macron/tilde -- if there is one, I might have. If there isn't one, however, I should really have done the simple concatenation of the UC names. So far, I see no need to actually create glyphs from a_macron_tilde up to z_macron_tilde (and schwa_macron_tilde and opena_macron_tilde and more specialized phonetic characters than almost any type designer would care).

charles ellertson's picture

Right, I was only thinking of stacked marks above the base, so inside out should be more generalised.

It may just be I'm misundersanding you, but Unicode organized combining marks into 3 planes, one above, one below, and one through the middle. Within each plane, you work inside out. Which plane is specified first is not a Unicode requirement, though it apparently is with some applications. Since "those that care" tend to specify plane 1 (bottom) first, I go with them.

Have I misunderstood what you're saying?

* * *

Nick, A lot of text editors -- I believe MS Word is one -- will render a character with combining diacriticals properly as long as the font has the components: the base character and the combining diacriticals. The positioning may lack a certain nicety, but it will be quite clear. I believe MS Word -- some text editors anyway -- will even go and get the needed characters from a system font if the specified font lacks them. Let's suppose that some people are reading primarily for content.

Theunis -- With whatever small clout I may have, I try to steer customers away from fonts that forbid modification. So with permissible fonts, as each new character comes along, I add it to the font, properly named, and no Unicode index. The ccmp feature will sub the character just fine, so I just add it to the ccmp list.

I suppose if there were only one or two instances of a character, we could do it in InDesign, but another rule is no private use characters unless the have a meaning for a special group. I have to assume everything we set will wind up as an ebook, or at least an XML file.

oldnick's picture

Nick, A lot of text editors -- I believe MS Word is one -- will render a character with combining diacriticals properly as long as the font has the components: the base character and the combining diacriticals. The positioning may lack a certain nicety, but it will be quite clear.

So...what's the problem?

Khaled Hosny's picture

Again I didn't think about which group come first, I was only thinking about individual marks in each group (looks like I didn't understand your original question). I did not think about this before, but naturally I'd do , but marks in the middle is a bit trickier, the first or the last both are as logical IMO.

charles ellertson's picture

Guys, I'm not particularly interested in defending *why* I want to do this. Given that the character should be well-presented in the typesetting (printing) font, and correct Unicode in the text file, has anyone opinions on the original questions?

Igor Freiberger's picture

Charles, considering (1) the need to properly kern and position these multi-diacritical glyphs; (2) the possibility of different text inputs; and (3) the need to avoid unnecessary diacritical combinations, the solution I did adopt for my own font:

1. I did search for all accented letter used in the range of languages I focused.
2. I added all these glyphs as precomposed ones, even the non-coded.
3. I did not used mark and mkmk features.
4. I'm slowly adding all these glyphs to ccmp feature using bottom to top order (to diacritics above), top to bottom (to diacritics below), first top and then below (to diacritics in both positions).

I guess one must include in ccmp feature the different ways to name a glyph –at least, the named (amacroncombtildecomb) and the Unicode method (uni006103040303). This task would make me crazy, but it seems to be needed.

A good resource about these non-coded accented glyphs is Adobe Latin 5 table. Here you find more info about this.

charles ellertson's picture

Igor (if I may),

Perhaps I can save you some time. The *named version* is not needed. Two possibilities: (1) you are importing text. In that case, the characters will be in Unicode. (2) Direct entry. Again, whatever you type on the keyboard, the characters in the file will be in Unicode. I suppose if you're using an old Type 1 font and non-OT layout program that wouldn't be true, but 99-44/100 of them wont support the needed accents anyway.

Where *names* come in is with FontLab. For good or ill, FL substitution routines are written using character names. So, for example, the acute combining diacritical is U+0301. But in the Adobe Glyph list, is is named *acutecomb*. Unless you rename the character uni0301, "acutecomb," and only that, is what must be in your ccmp feature. There are about 5 or so combining diacriticals that have a name (other than uniXXXX) in the AGL.

What I'm whining about is the lack of a standard for, what, correctly specifying characters without a codepoint and with multiple accents. Since an author has choices, unless you can limit those choices by saying "This is the right way," you have some obligation to provide for all those choices.

For example, if it is fine for author A to use a combining diacritical with an already accented character, and author B for using the base character plus two combining diacriticals, you have to provide for both. If it's not fine, you then have at last some excuse to tell the author to fix it, and more importantly, over time, authors will learn (not in my lifetime, of course).

Worse, for characters with accents above and below, which plane comes first is also a choice. That increases the number of ccmp substitution lines quite a bit.

None of this is too bad if you're limiting yourself to European languages. But for people like Theunis in linguistics, or me who has to work with linguistics plus the Native American languages, Asian languages, etc., the number of lines in the ccmp feature goes up dramatically.

What I was hoping for was that someone like John Hudson or Adam or Karsten etc. would post, indicating either any emerging conventions, or a correct procedure, if there actually is one.

I do think you can insist on the inside-out ordering, as that is in Unicode.

Igor Freiberger's picture

Thanks Charles! You surely saved a lot of time.

Let me add two questions:

1. if you set ccmp substitutions in crescent order, the issue with author A/B wouldn't be avoided? I mean, if you already have the substitution a+macron=amacron in the code, a later amacron+tilde=amacrontilde in the same feature will not acts also as a+macron+tilde=amacrontilde? As long as I understand, the decrescent order in OT features is used exactly to avoid this kind of substitution, but in this case it's desired. If this is possible, it may save many lines of code.

2. there are the combining diacritics and also the spacing modifier ones (the five or so named accents you did mention). I believe the proper way to handle ccmp is always use combining diacritics, ignoring the spacing modifier ones. But for some keyboard layouts the input may use the spacing modifiers, so I`ll need both substitutions (a+acute and a+acutecomb). Is this true?

Now I see that to build a font with support for all languages using Latin script is a kind of self trap. But now I'm engaged with this. Hope the font would be useful for professionals who work with linguistics.

Khaled Hosny's picture

For example, if it is fine for author A to use a combining diacritical with an already accented character, and author B for using the base character plus two combining diacriticals, you have to provide for both. If it's not fine, you then have at last some excuse to tell the author to fix it, and more importantly, over time, authors will learn (not in my lifetime, of course).

Some fonts, like Gentium, have 'ccmp' feature to decompose accented characters then all mark positioning is done by 'mark' and 'mkmk' feature, so in addition to not worrying about all possible combinations, a sequence like amacron+tilde is equal to a+macron+tilde since the former will be decomposed to the later anyway.

charles ellertson's picture

Khaled -- thanks for that. Best I remember, older SIL fonts (Charis, maybe older versions of Gentium) used ccmp to combine all characters with base + accent to precomposed characters in Latin A, B, and Additional. I forget how they handled characters without a codepoint. In an ideal world, all fonts would first deconstruct characters, then remap them.

This gets into expected workflows. One of the things so many font designers seem not to know are the workflows of people who use type. This even occurs with EULA's -- I've had several interesting discussions with people at Adobe, who seem unaware of typical production workflows in publishing. But that's a different issue.

Consider: a manuscript is submitted by an author. It then goes to an editor. Next stop is the designer, and on to the typesetter. The result is "first proof." There will be a second proof, usually a third, sometimes more. The result is printed. At this point, the only file that is correct, that has been maintained through all the changes, is the final typeset file. Moreover, the printed book is no longer the end product of all this work. If there is an end product, it is that final (typesetting) file. Further uses will be made of that file, perhaps an ebook (perhaps even several forms of ebooks). Perhaps an html file intended to be read in a browser, perhaps a searchable xhtml file for research, etc. etc.

Couple of points: In modern times, it is rare for the designer to even be aware of all the characters used in the text. Rarer still for s/he to check and see if the fonts selected have those characters. At some point, the compositor has to put everything together, resolving the decisions the editor made, and the decisions the designer made.

Point two is that the typesetters decisions now become an issue for whoever is putting together the xhtml file for an ebook or archive/research purposes.

That basically is the workflow. The opportunity to introduce error with characters -- either the wrong one, or the character dropping out, is uncomfortably high. While no single standard exists, the more we can standardize, the more we can both lessen the incidence of error and make life simpler for the people who have to resolve what remains.

For font designers to assume they can resolve such things is at best naive.

I like the notion that all composites are first broken into their components. But since not all fonts work that way, editors can't assume that. In fact, the vast majority of fonts don't even have the combining diacriticals. At the time the editor does his/her work, they don't know what font will be used -- not for the printed book, not for the ebooks, not for the browsers.

As a jack-of-all trades (compositor), I don't know the Unicode standard as well as I should. I was hoping there was something in it, if not a requirement, at least a recommendation, that would address the original questions, and could become a basis for standardizing both editorial and OT feature practices.

* * *

@Igor -- I don't imagine you or anyone else is still reading this post -- I seem to have been caught into defending what it's about again -- but if so:

The spacing modifiers have a different meaning. They should not be used to form accented characters. You cannot assume an author isn't using the acute, say, as a spacing letter, as does happen in linguistics. It is the old glyph/character distinction.

I'd add that there are other traps here. For example, the raised comma may be the same glyph as the apostrophe, but it *is* a different character. A raised comma is one way to signal a glottal stop in some languages. If you use an apostrophe, then someone else who later uses the files and wants to change the way glottal stops are signaled has one hell of a job. If, on the other hand, you use the raised comma in the spacing modifiers, their task is far simpler.

Practically, that there are so few fonts that cover spacing modifiers is an issue. Syntactically correct meaning is sacrificed to get some black stuff on the white stuff. The result is, just now, that even if you include it in your font, no editor will encode the text file that way. In time, maybe things will change.

k.l.'s picture

"In an ideal world, all fonts would first deconstruct characters, then remap them."

In an ideal world, normalization/decomposition would be done by OS or application, not by fonts. It is matter of Unicode (character level, text) rather than OpenType (glyph level, font). The line between them is pretty thin though.

Igor Freiberger's picture

Charles, thanks for the detailed and useful post (yes, I'm reading it).

But there is a point I do not understand: in most fonts, there aren't combining diacritics. So just spacing modifiers are available to input (for example: e+circumflex instead of e+circumflexcomb). Don't you have to hand these possible text inputs?

Probably I'm missing some aspect about OT processing here.

charles ellertson's picture

Karsten -- Thank you. Of course, you're right. In our shop, we normalize the characters in a text as a part of preparing it for composition. I was thinking of publishers who are setting type in house (sadly, all to common now). They often lack knowledge of what, the "technical internals" for a text, esp. preparing it to accommodate all the uses a text is put to these days.

They will just have to learn, and come up with yet another house style for their press. Life is easier when there aren't a bunch of conflicting house styles -- but there you are.

Igor: But there is a point I do not understand: in most fonts, there aren't combining diacritics.

True. I's like the old joke

Patient: "Doctor, I broke my arm in two places."
Doctor: "So, stay out of those kind of places."

Khaled Hosny's picture

I have not read it fully, but I think that Understanding Unicode page might give some directions.

charles ellertson's picture

I'm sure it will. My question, I'm afraid, concern sections that begin

[Yet to be written. . . .

Syndicate content Syndicate content