Diacritic line below (Unicode 0332)

Igor Freiberger's picture

The "line below" diacritic is used in African and native North American languages. It has the 0332 codepoint. Also called underline, it must not be confused with the typographical underline feature.

Questions this diacritic causes:

1. Must this diacritic match the base letter width? I found three approaches so far: (a) fonts with line below matching the base letter width; (b) fonts where the line below has always the same width; and (c) fonts with line below matching thin letters (as i and l) but with a fixed size for the others.

2. If the line below must follow the base width, how to distinguish i or l with line below from i or l with macron below?

3. When pairing two letters with line below, must the diacritics touch each other in order to produce a continuous line?

Similar issues could be argued about the line above diacritic (Unicode 0305)

Samples:

John Hudson's picture

Are you sure the diacritic used in African and Amerindian orthographies isn't U+0331, the combining sub macron? That is a much more common diacritic mark than the underline.

With regard to your other question, my interpretation of the combining underline, and the combining overline, is that they should connect to form a continuous line.

Igor Freiberger's picture

John, thanks again.

I'm sure it is U+0332 just regarding Pacific Northwest languages as I kept contact with scholars who work with them (basically, languages from Yukon and Alaska). For other Amerindian and African ones, I have much less info.

Anyway, the samples I found (again, from Huronia, various Linguist fonts and also in Aboriginal Serif, from LanguageGeek) seem to use to U+0332 and not U+0331.

charles ellertson's picture

A story. When I had to set a book with a number of Kiowa words -- which use a "line below" for some vowels -- I asked the author what was proper, a macron below, or a line below? And if a line, should it be the width of the character it was below, or have a constant width? And I pointed out to him that if a line below with constant width was used, there could be some vowel sequences that would, in effect, appear like several characters simply underlined.

His rely was that it should be a macron below. I'd note in his typescript, he had used an underline.

Point is, you have to poke with even scholars. They use what's both easy, and available. Then old wives tales are set up. I use to draw up characters for the hamza and ayn (for romanized setting of Arabic) that were in keeping with the font. Then kept getting told by junior-level proof readers that the proper character was monoline (well, they don't know that word, but that's what they meant), as in the phonetic half-ring. What they didn't know was that prior to PostScript, that was about as close as we could easily come to a character that represented what was needed. Its use was an unfortunate compromise

Question: At what point does an unfortunate compromise become accepted orthography? Many of these languages never had a written form until scholars came up with an orthography for a romanized setting. They chose what was easy. And that choice could be affected by whether or not they were using a pencil or a typewriter, as with the differences between Ella Deloria and Buchtel in Lakota.

If we're voting, I vote with John on this one for the proper character, but disagree about touching. Just a standard-deviation decision, based on author responses.

Edit

Actually, if you want to be thorough, you should have both. Hard to tell an expert in the language he/she is wrong, even when there is disagreement by the scholarly "authorities" on what's correct. IIRC, The only real decision rests on a couple of precomposed characters in Latin Extended Additional. These two could be handled by a stylistic set. Or maybe I don't RC.

Khaled Hosny's picture

The Unicode chart have this comment for U+0305 and U+0332:

connects on left and right

charles ellertson's picture

Khaled, that's a real problem. Unicode doesn't have precomposed characters with U+0332, does it? Yet you can't have a single character as with U+0332 that covers an "i" and a wide letter, such as a "W", and is connecting. If you make it wide enough for the "W", when used with an "i", it will "overextend" enough to underline other characters. What to do?

You shouldn't use the layout program to underline, as that is not a character but an attribute, and will not preserve syntactical meaning. What you would have to do in the font is to use *ccmp*, not *mark*. And BTW, a number of languages have diacriticals on the character besides the "underline," so you're stuck with ccmp throughout.

Now if most languages use the macron below, the problem is lessened.

Or write a EULA that passes the problem along to the comp, who has to deal with the author(s) in any case. I know, I'm a broken record.

k.l.'s picture

What you write is interesting because it suggests a solution which some applications already use when it comes to space characters (thin space, en-space, em-space, figure space, etc):

Technically, there does not need to be a glyph for Unicode 0332 in fonts. Instead, when an application finds this diacritic, it knows that it needs to underline the preceding base character's glyph(s) and generate an underline that is slightly wider than glyph(s) width. As a bonus, if a glyph is present for Unicode 0332 in a font, the application would measure its yMin and yMax to determine the generative underline's positioning and thickness – but ignore the glyph's actual shape.

I do not think that it is a good idea that fonts provide glyphs for each and every Unicode character. It also looks like a misunderstanding on type designers' side and application developers' side. Unicode also encodes operators and higher-level delimiters which may or may not be used by layout engines, and may or may not be present in a character string, yet they either do not need any visual representation or their appearance is better generated on the application side.
E.g. the above-mentioned spaces are more consistent if applications produce them in a generative way and independent of the fonts in use, which makes 'space behavior' more predictable to the typographer. Put a bit less favorable as regards ourselves: Type designers easily get such things wrong and come up with ideosyncratic solutions. Their job, above all, is to design the visual elements for text production. They do not need to know whether this or that special character exists, they should not be bothered with certain technicalities of text layout. This is not their business. Switching from type designers' to fonts' perspective: It is not up to fonts to care for all aspects of text layout. They provide the visual parts needed for text layout.
In so far, it would be interesting to have some Unicode Companion that not only offers a brief description per character, but for special characters provides clear recommendations whether and, if so, how to visualize them. Application and/or layout engine developers who claim to produce Unicode compatible software better follow.

Khaled Hosny's picture

I think this is issue is similar to Syriac abbreviation mark which is an overline that spans a whole word, and thus need to be handled in a special way by the layout engine (AFAIK only MS's Uniscribe handles it correctly). I think the layout engine can handle the combining overline/underline in a similar way e.g. stretching or shrinking it to fit the width of the underlying glyph, or even replacing it entirely by a horizontal rule (in TeX speak) of thickness and position similar to the diacritic in the font.

Another brute force approach is to have multiple glyphs each fits the width of a group of base glyphs and use contextual substitution (trying this for my Arabic font, where I might use the overline for Quranic annotation, it seems I'll need from 25 to 50 overline glyphs, based on how vigorously I round base glyph widths for grouping).

charles ellertson's picture

Karsten, in theory I like that approach. If I understand you, syntactical meaning would be preserved, but the presentation would be handled by the layout program.

The problem is I trust type designers more than I trust the software engineers who write layout programs. Just one small example: with InDesign, the "figure space" is one value. It does not change with the figure character set selected.

One larger example: now that we have EPUB3, you would think the readers would comply. But they're too caught up with device wars. So Apple changes the available fonts, but does not, best I can tell, allow useful, always-going-to-work, font embedding. If you can take a bit of rough language, you might find this enlightening:

http://www.the-digital-reader.com/2011/02/20/an-analysis-of-epub3-and-uh...

Back to layout programs: it's not quite enough if the layout program supports this only for those who can write a script -- too small an audience. There are very few people who have all the needed expertise with languages, typesetting, programming, etc.

(my turn to apologize for a bunch of edits)

Igor Freiberger's picture

Firstly, I thank very much all you for the detailed information and ideas.

You have to poke with even scholars. They use what's both easy, and available.

Charles, you are right they would use the tool available and this can cause misinterpretation. But this situation is a bit different.

My attention to this issue was actually a result of a comment about my font project. Its author, James Crippen, studies the Pacific Northwest languages and knows about typography. Since then, I kept contact with other researchers and got information about both combinations, using macron below and line below.

The macron below is easy to manage with mark positioning. The problem arises with U+0332, as I found the three possible approaches cited above.

...when an application finds this diacritic, it knows that it needs to underline the preceding base character's glyph(s) and generate an underline

Karsten, this would be nice –if available.

Underline effect cannot emulate the line below diacritic when there are descenders. Combinations with g, p, q or j and u+0332 must have the diacritic below the descender and not crossing it. But this is not the way underline effect is applied in word processors.

The only alternative I know is InDesign underline style (the same may exist in Quark). But, again, this reduces the audience.

Although there is no clear solution by now, this is a rich discussion. Thanks.

Sidenote about type designer's role
IMHO, as a font embraces base characters and diacritics, it would offer a decent result when combining these elements through the state-of-art technologies. We have precomposed glyphs and mark positioning to produce this, but in some cases this seem not enough –as in the U+0332 case.

If layout applications can supply additional improvement, so better. But if the font claims support to a given language, it need to offer user a solution for at least the current version of main word processors.

charles ellertson's picture

Well, TeX will do what Karsten suggests. I'm not sure whether or not InDesign has the resources, I'll have to think on that. Maybe Karsten already knows, and a simple script could be written. I'm not sure you, as a font designer, have to support the language regardless of the text editor chosen -- in mathematics, for example, few text editors, without plugins, can handle multi-level equations, even if all the pieces are available in the font.

But you're right on at least two counts, it is an interesting problem, and differing orthographies use differing treatments of basic characters. And I can certainly feel the emotional tug to provide the solution inside the font, so any neophyte can have success.

Khaled Hosny's picture

Trying out my "brute force" solution, I wrote a bit of Python code (using FontForge) to automate building the mark glyphs, the variants and the contextual substitutions (the code is unnecessary complex because I want to handle glyphs that have other combining marks).

It seems not to work with all OpenType engines though (2 out of 3 tested work, didn't test MS's or Adobe's however), an interesting exercise nevertheless.

John Hudson's picture

Igor, when I'm making a font in which I know the overline or underline combinng marks need to work, I include a few different variant widths based on typical letter widths, and triggered by context. I also vary the height over the whole run of glyphs depending on whether there are ascenders/descender or not. And because some wide glyphs may not have a line the covers the full width, I contextually insert connecting strokes between the bases.

Karsten, application over/under lining for diacritic purposes is tricky. You need information about diacritic weight and alignment to do it properly.

jcrippen's picture

I suppose I ought to chime in here about Pacific Northwest orthographies again. The original implementation of underscore diacritics in the Pacific Northwest languages was on (American) typewriters. An underscore was considered to be very easy to use, you simply pressed backspace and then pressed the underscore key (Shift + -). It was also a symbol that wasn’t going to be confused with English, and it had the added advantage of being mnemonic as a ‘lower’ symbol to represent uvular sounds that are articulated further back in the mouth. Since typewriters were designed for English, the underscore was obviously originally meant to be a punctuation or annotation symbol and hence was meant to connect between glyphs so one could type a continuous underline. This would replicate the traditional manuscript underlining meant to indicate italics to the printer.

When the underscore was repurposed as a diacritic there was absolutely no thought given to typography since after all people were using typewriters where typographic nuance was already terribly impoverished. But if you look at the very few publications that had input from professional typographers (e.g. Constance Naish & Gillian Story 1973, “Tlingit verb dictionary”) you find that they replaced the underscores with carefully crafted diacritics which did not extend the full width of the glyph, and more importantly didn’t connect between glyphs. In addition, rather than crossing the descenders like the typewriter underscore would, the typographically sensitive solution shortened the descenders slightly and put the diacritics below them.

It’s less likely that you’d get a chance to see handwriting, but I can unequivocally say that handwritten underscore diacritics are always separate from each other when two of them occur in sequence. (Due to the vagaries of handwriting, the width of the diacritics is wildly variable, ranging from a slightly extended dot to a straight line wider than the actual letter with curls at both ends.) In handwriting it is more common for the diacritic to cross the descender, but they sometimes do appear below.

Now we come to Unicode, which actually makes a distinction between the two kinds of underscore diacritic. This the first time in history that the two kinds are available at the same time in the same system. U+0332 is meant to connect between glyphs, as the Unicode standard explicitly says. Consequently U+0331 must be meant to act as a diacritic appropriate to a single glyph, and isn’t meant to connect between glyphs. But the scholars and community members who are working to computerize their languages are not Unicode experts, and they go with whatever a “good looking” font happens to implement (or more often they use whatever underlining markup is provided by e.g. Word, HTML, Flash, or InDesign). They also choose something familiar, and if all you’ve ever seen is typewriter underscores then that’s all you know. Unicode was to some extent trying to capture the distinction between textually meaningful underlining (as opposed to markup) and the underscore as a single-glyph diacritic. That’s a distinction that was never made within the same document, but now all document preparers are forced to make the decision between the two. Most don’t even know that there was ever a decision to be made.

The point I’m getting at is that this is a situation where the font designers have to take charge, rather than rely on some community consensus that will probably never arise. If you make fonts where U+0332 doesn’t do what people expect but also make U+0331 available, then people will switch to it. If you make a compromise between them then people won’t switch and we’ll be left with both poor typography and incompatible encoding of fundamentally the same orthographic symbols. In reality this depends a bit more on keyboard layout designers than on font designers, since the layout is what makes the diacritics easily available for people. But it’s rather useless if you have a symbol you can type but which won’t print, so the fonts need to have good support too.

There are actually a few orthographies where the connected underscore is the proper one to use. Dakelh, a.k.a. Carrier, uses underscore diacritics on *digraphs*. Because of this, the underscores are meant to connect together since the two glyphs in a digraph are orthographically part of a single letter. For this use, even if U+0332 doesn’t connect between glyphs it should get nearly to the point of connection so that the spread of the diacritic across digraphs is clear.

The overline was probably initially thought of by Unicode members for things like Roman inscriptions and for manuscript transcriptions. In manuscripts it is often used to indicate elision or abbreviation, so that Χ̅Ρ̅ is an abbreviation for ‘ΧΡΙΣΤΟΣ’, and l̅ for ‘vel’. If one is trying to make an accurate transcription of a manuscript then this cannot be treated like some kind of markup, but is instead an integral part of the text as much as the letters or punctuation. I imagine that an underline has seen similar uses, hence Unicode’s inclusion.

charles ellertson's picture

For the line below a digraph, there is U+035F. Used in romanized Indic scripts, too -- kh & gh, isn't it?

As for the general character to be used, this cannot be entirely left up to typographers. I'm in sympathy with the macron below (U+0331), but if a publisher/editor/author insisted, would use the touching line below.

There is a sort of parallel here with the glottal stop for romanized Polynesian. Why should it (the okina) be signaled with the turned comma, ("single open quote"), whereas so many other romanized languages use the raised comma ("single closed quote")? And why does one of the the South American Native languages have to use yet another character to signal a glottal stop? (It is late, I am tired, forgive me for not remembering.) Orthography is custom, in the final analysis.

I'm still of the opinion the type designer must provide for both, the typesetter can recommend the "macron below," but in the end, author/editor (aka customers) have the final choice.

FWIW

Syndicate content Syndicate content