Catalan middot (periodcentered ) unicode

peter bilak's picture

I've asked this already on FL list, but now think that this might be a more better platform fot my question.

I have a question concerning Catalan middot in PC TTF. I generated fonts in FL with a standard Latin 1,
1252 encoding, and the middot (aka periodcentered) had 00B7 unicode value, yet it seems that it doesn't function properly on a PC. Now I am looking around and see that other values e.g., 22C5, 2219 are also used. Does anyone know where are these values coming from? Is it necessary to encode the middot to all three of them?

Is there a know issue about using middot in various PC applications? Does anyone have experience with this?

Thank you, Peter

hrant's picture

> it doesn't function properly on a PC

What do you mean by "it doesn't function"?
Do you mean that a Catalan keyboard isn't picking it up?

hhp

peter bilak's picture

Correct. The Catalan keyboard cannot access it, and when trying to type it get only missing glyph symbol.

John Hudson's picture

I believe the correct codepoint for the Catalan middot is U+2219. I usually double-encode the /periodcentered/ glyph as U+00B7 and U+2219.

peter bilak's picture

I figured that it is safer to double/triple encode the periodcentred, but is there any particular reason why is it done? John, do you know of any application that needs different codepoint than the standard U+2219? I am just wondering why the standard Latin 1 encoding would include 'wrong' codepoint. Or is periodcentred used in also another language than Catalan?

p

twardoch's picture

Peter,

1. Instead of using arbitrary names for your glyphs ("middot"), I suggest that you use standardized names listed in the Adobe Glyph List For New Fonts 1.1 (in your case, "periodcentered").

You will find all the Adobe Glyph List For New Fonts 1.1 at http://partners.adobe.com/asn/tech/type/aglfn13.txt

2. For glyph names that are not in the Adobe Glyph List For New Fonts, check whether the glyph is a basic (default) form of a character encoded in the Unicode Standard. If so, use the Unicode-based glyph name:
a) for BMP Unicode codepoints, use the name "uniXXXX" where XXXX are the 4 hexadecimal digits representing the codepoint.
b) for non-BMP Unicode codepoints, use the name "uXXXXX" where XXXXX are the 5 hexadecimal digits representing the codepoint.

3. If a glyph is an alternate form of a character that is encoded in the Unicode Standard or is listed in the Adobe Glyph List For New Fonts 1.1, use the glyph name of the basic form followed by a period, followed by a suffix. For the suffix, use the name of the OpenType Layout feature that you would most likely access that glyph through.

For example, for a small-caps A, use "A.smcp", for a styllistic alternate R use "R.salt", for a swash Q use "Q.swsh", for a superior m use "m.sups", for a tabular 5 use "five.tnum" etc. If there are multiple OpenType Layout features that can be used to access a glyph, pick one of your liking.

4. If a glyph is a ligature that is not found in the Adobe Glyph List For New Fonts, use the glyph names of the glyphs that form the ligature, concatenated using underscore. For example, for a ct ligature, use "c_t", for an ffi ligature use "f_f_i", for a ligature of long s and i use "longs_i".

5. If a glyph is an ornament, a non-textual symbol etc., use a glyph name of your liking.

6. Assign proper Unicode indexes to the glyphs discussed in (1) and (2). If the same glyph represents more than one Unicode character:
a) create multiple glyphs with identical content but different names, and assign one Unicode codepoint per glyph; for example, create a "periodcentered" glyph and encode it as U+00B7, and create a "uni2219" glyph and encode it as U+2219.
b) alternatively, either assign multiple Unicode codepoints to your glyph; for example, for "periodcentered", assign U+00B7 and U+2219.

7. For glyphs discussed in (3), (4), (5), you may but don't have to assign custom codepoints from the Unicode Private Use Area (PUA): from U+E000 to U+F8FF. I recommend against assigning these codepoints, but for some applications (e.g. Microsoft Word 2003 for Windows), assigning PUA codepoints may be the only way to display such glyphs in your font.

Regards,
Adam Twardoch
Fontlab Ltd.

hrant's picture

Another "Golded Post" from Adam. Thank you!

BTW, I'm thinking Typophile needs some kind of repository for extremely valuable posts. Not only would it help existing users, but it would certainly attract new ones as well.

hhp

Thomas Phinney's picture

Nice post from Adam. Just a couple of comments.

3) Although it's good to use consistent suffixes, people should be aware that from a technical perspective the suffix is irrelevant.

5) Some symbols have Unicode codepoints, and if so one should use the appropriate uniXXXX/uXXXXX name. Otherwise, as Adam says, one can name them whatever one likes (as long as it doesn't conflict with names that have defined meanings).

Cheers,

T

twardoch's picture

I have added a slightly extended version of my posting to:
http://groups.msn.com/fontlab/tipsandtricks.msnw?action=get_message&mview=0&ID_Message=3065

Other than that, the front page of the FontLab forum:
http://groups.msn.com/fontlab/
just got a brief "recommended reading" list showing the most important postings that deal with FontLab.

Regards,
Adam

John Hudson's picture

I figured that it is safer to double/triple encode the periodcentred, but is there any particular reason why is it done? John, do you know of any application that needs different codepoint than the standard U+2219? I am just wondering why the standard Latin 1 encoding would include 'wrong' codepoint. Or is periodcentred used in also another language than Catalan?,

I'm not sure that the Latin 1 encoding claims support for Catalan. I've looked into this some more, and I think I may be mistaken in my identification of 2219 as the correct codepoint for the Catalan dot. I thought this was noted on the MST website somewhere, but I cannot find the reference now and instead note this, from the Diacritics Design Standards:

L or l Catalan (L with mid dot)
This character is actually a compound character made from a base character and an additional punctuation character. The mid dot is used in the Catalan language to separte two lowercase l or two uppercase L characters that are not part of the same syllable in a word.

The mid dot is commonly made from the overdot diacritic U+02D9 or a character made specifically for this purpose. Often the period U+002E, period centered U+2219 or mid dot U+00B7 are not an appropriate size for this character. The dot in the L or l Catalan character should be positioned to center vertically on the uppercase height and center horizontally when followed by another L or l.


This seems to me very unsatisfactory, because although the U+02D9 glyph, is an appropriate size for the Catalan dot, it is not necessarily the appropriate character code to use.

Peter, try this: In WordPad on Windows, enter the Catalan dot from your Catalan keyboard -- resulting in a .notdef glyph I presume -- and then enter alt+X this will convert the preceding character to a hexadecimal character code, so you can confirm exactly what Unicode value is input from the Catalan keyboard.

John Hudson's picture

I just checked the Unicode standard, which includes a precomposed L=middot for Catalan (based on some old 7- or 8-bit Spanish standard), and the compatability decomposition for this indicates that the dot is, in fact, U+00B7. So please ignore my previous advice re. U+2219.

What Catalan keyboard driver are you using? My Windows XP does not ship with a Catalan keyboard.

peter bilak's picture

Adam, John thanks for your answers. I figured how it could be solved, but I was wondering why there would be different codepoints for the same glyph, and whether it is the same glyph or it is also used in different languages or situations. Some foundries, for example, add U+2219 as a supplement for the space character when 'show invisibles' is turned on in some applications.

I cannot answer in detail right now what didn't work about 00B7, because I can't test it myself, and rely on my Catalan contact. I will check and get back.

Thomas Phinney's picture

There are tons of cases in Unicode where you have what could be the same glyph at different codepoints. Some of these are purely for historical encoding reasons (capital letter A with ring above versus Angstrom symbol). Some of them have linguistic differences (English, Greek and Cyrillic have different characters, but in each case one of those language-specific characters might look an awful lot like an "A").

Cheers,

T

vincent_connare's picture

I just found this..

The referrence to the catalan in the MS document referring to x02C9 (overdot) is saying that the catalan dot should be a punctutation character similar to the way the over dot x02C9 was designed and NOT a MATHEMATICAL operator which is what both X00B7 and x2219 are.

x2219 is in 'Mathematical Operators' in unicode and xB7 is part of the ASCII set that comes from earliest technical keyboards and all the character in there are Math characters not pubishing characters, ie. ASCII Tilde (used in programming), ASCII circumflex (used in Pascal for pointers), middot (used as a mathematical operator for muliplication).

I did a test back then and just did now and xB7 is mapped in the catalan keyboard.

catalan shift three

I believed technically x2027 'Hyphenation point' in the x2000..x206F 'General Punctuation' section of Unicode is more correct as the Catalan dot than either the historical middot or math operator periodcentered...

But they pull x00B7.

Back in the Ikarus days for lazer imagesetters we use to make a Capital Catalan dot and a small Catalan dot that were on different widths and spaced differently so when they were composed with the Capital L or small l they would center correctly. The Capital L dot would need to have a negative left sidebearing for it to center between L's.


vincent_connare's picture

I should mention I tested this on Window 2000.

In 1999 I talked to the person who wrote the keyboard table that maps the character codes to unicode. I found it coming out as x2219. And told her that that is a 'math operator' and in that section of unicode and is probably wrong.

Back then I was testing it on Windows 95 and I have it recorded in the MS document that x2219 came up in Win95 and xB7 in Win2000.

xB7 is more correct so I hope that is what changed.

I'm sure this adds to the confusion but never trust a programmer so always test it against what they do...

Stephen Coles's picture

Agreed. Another job for the impending intern.

Syndicate content Syndicate content