Unicode hex codes for the Cambria Math font

Belloc's picture

I need a complete set of unicode hexadecimal codes for the Cambria Math font used in Microsoft Equation Editor. The symbol dialog box in MS Word is not complete. Some of the stretched (larger) glyph variants for brackets, braces, parentheses and other character glyphs are not shown on that table.

I have looked at Cambria Math Ascender Fonts, but they don't include the hex codes in the Characters Set section on that page.

Thanks in advance for your help

riccard0's picture

Alternates don't have Unicode codepoints.

Belloc's picture

@riccard0 I didn't understand your comment. Maybe because we speak a different dialect. I'm not a font designer, but a programmer. If a character glyph doesn't have a UNICODE number, like ('A' 0x0041), how would it be referenced in a program ?

Sindre's picture

When writing OpenType code, names are used for reference rather than codepoints. For instance:
sub f j by f_j;

Sometimes font designers assign non-Unicode glyphs to Private use area codepoints. I usually do, but that's just to keep track of things.

Belloc's picture

@Sindre First of all, I'd like to thank you for your help. But I'm still confused.

You'll find the character glyphs that I'm concerned
here,
by selecting "Character Set" for the "Cambria Math" font and go to the last page (19). The glyphs are the ones at the end of this page. My question is : how to reference those characters in a C++ program ?

Are you saying those characters don't have a UNICODE number ?

Sindre's picture

No, those look like Unicode glyphs to me, those last ones are u+0363 to u+368, combining Latin small letters. I used Charmap to find those numbers. The Unicode consortium also has very useful code charts online.

Belloc's picture

@Sindre I think you got the wrong characters. Go to this page. Then select "Character Set" and the "Cambria Math" font. Finally select page 19 (last page). The characters that I need are on this page, after the greek letters.

Sindre's picture

Ah, I must've looked at Cambria Regular instead. Those you're looking for are not known to me nor to Charmap as Unicode symbols, and they're not mapped to PUA-codepoints. You have to find software that can look closer at TTC-files (TrueType collection) to find their names.

JanekZ's picture

This one?


BTW it is Cambria winXP v.1.02

Theunis de Jong's picture

I'm not a font designer, but a programmer. If a character glyph doesn't have a UNICODE number, like ('A' 0x0041), how would it be referenced in a program ?

Not all character glyphs in a font need to have a Unicode number. Consider, for example, an alternate for your example "A". What Unicode number should it have, other than U+0041? (Since assigning another Unicode number to it would change it from an 'uppercase a' to something else.)

The common way of referencing non-Unicode characters in some proprietary font is by Glyph Index. That's a one-way street: it's only valid for one release of the font (since the glyph indexes may vary from release to release), and it's only valid for that single font (since other fonts may have more, less, or different glyphs at the same index).

(I'm not sure on how Cambria Math is currently used by Word. It's certainly not above Microsoft to invent yet another "standard", and next tell no-one what it actually implies.)

Here is a snapshot of a glyph-dumping program; you can see Unicode values for characters that have one, raw glyph indices (indicated by hash marks) for those that don't.

Belloc's picture

@JanekZ No. The UNICODE numbers for the characteres shown in your snapshot are available. You'll find the characters I'm referring to, if you follow the directions I gave on my prior post.

@Theunis Very interesting points you mentioned. I didn't know about these glyph indexes.

(I'm not sure on how Cambria Math is currently used by Word. It's certainly not above Microsoft to invent yet another "standard", and next tell no-one what it actually implies.)

Does that mean that MS will not tell me what those indexes are ? Or in other words, they won't tell how to use the characters that I'm interested in ?

That's possible, but very odd, since I'm designing a system using the Windows API (application programmming interface).

I've already asked MS for the information I'm looking for, but so far had no answers.

John Hudson's picture

The Cambria Math font works in conjunction with Microsoft's math layout engine, and includes a MATH table in addition to the OpenType Layout GSUB table. The MATH table includes a variety of math-specific sizing and positioning information, which is interpreted by the math layout engine. This includes a list of sizing variants of some characters, e.g. braces, as well as component sequences and alignment information for building arbitrarily sized symbols larger than the biggest variant glyph.

From the Microsoft Mathematical Typesetting booklet:

Unfortunately, the MATH table specification does not appear to have been published, although I'm aware of at least one independent implementation.

This is a useful article: OpenType Math Illuminated.

And this is a good overview of MS math handling by Murray Sargent, who was responsible for it. If you are seeking to use MS APIs for math, I recommend contacting Murray.

Belloc's picture

This is the first really good answer I received in the last few days about the subject. Thanks very much for your input. As a matter of fact, I have already left a message to Murray Sargent on his blog, which is the exact page you indicated above.

Definitely I will need the mathfont.dll, and for this purpose I have already asked for Mr. Sargent's help, as you can see here.

Basically, what I'm looking for, are the UNICODE hex codes for the parenthesis on the left side of your drawing !! Since, the one on the right, by composing the 3 glyphs I'll be able to extend them to any pixel height I need. That's perfectly understood.

Could you give me additional information on the independent use of the MATH table, you mentioned on your post ?

In the mean time I'll read the paper "OpenType Math Illuminated".

Again, thanks very much for your help.

John Hudson's picture

Basically, what I'm looking for, are the UNICODE hex codes for the parenthesis on the left side of your drawing !! Since, the one on the right, by composing the 3 glyphs I'll be able to extend them to any pixel height I need. That's perfectly understood.

All the parentheses, regardless of size, are encoded as the single Unicode character (, i.e. U+0028. It is the math layout engine, working with the font MATH table, that determines which size of parenthesis to use and at what size to break the parenthesis into an assembly. The MATH table contains a mapping from encoded characters to variants and assemblies, as well as information about allowable minimum and maximum overlap for assembly components, enabling the size to be adjusted and letting the math layout engine decide at what size another medial section might need to be inserted.

I don't think the independent implementation of math layout that I mentioned has been published yet, so I can't provide any details.

Belloc's picture

Cristal clear !!

I'll try to get further information from Murray Sargent at Microsoft.

Many thanks.

Ross Mills's picture

The base character for the parenthesis is the only one of the set listed that is encoded. The others are mapped in the MATH table, firstly enumerated as vertical glyph variants, secondly mapped as specific variants of the base character. The same procedure is used for horizontal variants, again with only the base form having Unicode character-to-glyph mapping.

Some assemblies (the split components in the right example above) are mapped in Unicode, but for MATH table support these are accessed as vertical or horizontal assemblies. The assemblies are adjunct to the size variants, so you may have a case such as:

parenleft->[Variants] parenleft.display1, parenleft.display2...parenleft.display7 ->[Assembly Components]parenleftbottom, parenleftextend, parenlefttop

The Math Handler (be it the sort in Word or otherwise) measures the expression the parens are enclosing, compares it first against the [Variants] and if none are large enough uses [Assembly Components]. Within the MATH table additional rules for how those components overlap are also listed, in this way the assembly is treated as a whole, but can be minutely adjusted to best match the height (or width) of the expression, or part thereof.
The same goes for other fences, although not all may have the same number of variants (generally the more common have more variants). There are also arrow assemblies.

This is of course an over-simplification, which may be useful for those typesetting maths, but if you have a notion to make a math handler its well outside the scope of this forum and you're best talking directly with Murray. Same goes for access via C++.

For a basic overview of how the OpenType Math table works in conjunction with Microsoft's implementation, you can order the Mathematical Typesetting book from us. Note that this is an overview, not a specification.

I can't say anything about the other, non-Microsoft implementation but can say that there will be another compatible typeface publicly available as an alternative to Cambria Math.

Belloc's picture

@Ross I thank you for your detailed input. Now it is clear to me that the glyph variants are not mapped into Unicode character codes.

But I'm still confused with this concept. I'll explain why : in C++, there are certain functions to exhibit characters, or words, on the screen, e.g., TextOut() or ExtTextOut(). These functions basically receive as input the address in memory containing the character codes. How will I be able to exhibit those glyph variants on the computer screen, without a code ? This would only be possible if other functions are provided for displaying those variants. Is that what you mean by parentleft.display1, parentleft.display2, etc... ?

By the way, could you elaborate a little bit more on this :

"The assemblies are adjunct to the size variants, so you may have a case such as:

parenleft->[Variants] parenleft.display1, parenleft.display2...parenleft.display7 ->[Assembly Components]parenleftbottom, parenleftextend, parenlefttop"

Again, thanks for your attention. This has been a great help.

@John Hudson I've just received an email from Murray Sargent with the math table documentation.

John Hudson's picture

Yes, unencoded glyphs* need secondary mechanisms to be displayed. This is the entire concept behind most recent digital typography, based in turn on the Unicode character/glyph distinction. Unicode encodes characters, not glyphs; hence the need for secondary layout and display mechanisms. OpenType provides a good model for how this works, and in this respect my ancient essay on Windows Glyph Processing is still worth reading (note that some aspects of that essay are not out-of-date; it has not been updated since 2000).

Microsoft math layout has some similarities to OpenType, and also employs a character/glyph distinction, but does things that are too complex for the general OpenType Layout architecture to handle; hence the need for the MATH table.

___

*A glyph is a visual representation of a character, combination of characters or part of a character. A single character may be represented by one-of-many glyphs, which we refer to as variant glyphs; or it may be represented by multiple glyphs, e.g. in the case of a diacritic character such as é decomposed to the letter glyph plus a combining mark glyph.

Ross was using some glyph naming conventions to indicate glyph variations. So, for example, parenleft is the name of the default glyph representation of the encoded character ( U+0028. parenleft.display1 and parenleft.display2 are two unencoded variant glyphs. The names by themselves, of course, are of no help to you; you need to get to grips with the math handler capabilities and the APIs to call the layout functionality, query the font MATH table, etc.. Murray is the best person to assist you with this.

Ross Mills's picture

I don't think that there is a direct way to access them without calling secondary processes/functions. The assembly bits are encoded and you could display their constituents, but not the built-up form (which comprises 3 or more glyph components). Its probably not a good idea to use the codepoints for the assembly bits.

By the way, could you elaborate a little bit more on this :

Its basically just a textual version of the illustration John posted. There are a given number of elements (constituents of the expression) within any set of fences (or delimiters), the 'handler' measures these elements to determine whether or not a larger variant is needed. It then looks in the font's MATH table to see what variants are available and uses the one whose size is closest (within certain parameters). If the contained expressions exceed the height of the tallest available display variant, then it uses the growing assemblies. The handler has to constantly look at the contents of the expression and dynamically adjust the size of the delimiters until the equation is completed. Note that both the variants and the assemblies map back to the base character. This is more an issue for 'display' equations then it is for plain text. The names themselves, as John noted, are arbitrary and just illustrate the process of transforming, or mapping a base character to a number of glyph variants.

twardoch's picture

Ayrosa,

with OpenType fonts, in order to convert a series of Unicode codepoints into a series of correct glyph variants with *correct positioning*, you need an OpenType Layout processor/engine. An OpenType font contains a "cmap" table, which maps Unicode codepoints to default glyphs, and also includes a number of tables which then map the default glyphs to variants, and contain precise positioning information of those glyphs in relation to each other. Those tables are "GSUB", "GPOS", "BASE, and "JSTF". Those tables work in connection with OpenType Layout "features", which you could see as character-level formatting commands. Those features can be requested by the user, or applied by default by the engine in certain situations.

Conceptually, this is a little bit like font switching in HTML+CSS: if you specify that the default font family should be "Verdana", the default style will be "Verdana Regular". However, when the HTML markup includes the "strong" or "b" element, the web browser will automatically switch to "Verdana Bold". In addition, the author can request a font switch explicitly, by specifying a CSS rule (such as "style='font-weight:bold'"). In OpenType, the same principle applies to glyph switching. If you specify the letter "A", by default, you'll get the default "A" glyph variant which is indicated by the "cmap" table, but the OT Layout engine can in certain situations automatically perform a switch to a different variant (such as "A.initial"), or the user can request such a switch by explicitly calling a certain layout feature (such as "init").

The one thing to remember is that as a user, you don't specify the glyphs directly. You specify a codepoint, and optionally a feature, and the OT Layout engine parses the appropriate font tables ("cmap", "GSUB", "GPOS" etc.), and returns a series of glyph IDs and their relative positioning to each other. This is, again, similar to how the web browser does font switching. If you say "font-family: Verdana; font-weight: Bold", the web browser will actually analyze the OS's list of installed fonts, find the appropriate font file which needs to be used to render the text, and apply it.

The OpenType Layout processin is the "layout step" of text rendering. Once completed, the "rasterization step" happens: the OT Layout engine passes the series of glyph IDs and positions to the font rasterizer, which then parses a different set of font tables ("glyf" for .ttf fonts, "CFF " for .otf fonts, and a few others), looks up the outline definition for a glyph using its glyph ID, and performs a rasterization of the glyph outline at a requested point size.

In Microsoft Windows, Uniscribe is such an OpenType Layout processor, which ships with the OS. In addition, there is a series of 3rd party OpenType Layout engines that can perform this work: Monotype WorldType, Bitstream Panorama (both commercial) as well as HarfBuzz and ICU Layout (both opensource, ICU Layout is part of a large library called ICU, but can be used separately).

In case of Windows API, GDI functions such as TextOut() perform those steps behind the scenes, but don't allow you to specify the features you want, since GDI TextOut() is ancient, and was created way before OpenType Layout was developed. TextOut() uses Uniscribe to "know" about the OT Layout tables, but when Uniscribe is used by TextOut(), it only applies the features it should apply by default (these are different features depending on the writing system/alphabet of the text, i.e. different features are applied by default for text written in the Latin script, the Arabic script, or the Devanagari script.)

So, using my HTML analogy, TextOut() works like HTML without CSS. If you use the HTML "strong" or "b" element, the web browser will automatically switch the Verdana Regular font to Verdana Bold, but you have no additional access to the font variants as you would have in CSS.

Newer, alternative Windows APIs such as DirectWrite or WPF (Windows Presentation Foundation, which is based on XAML) allow you to additionally specify the features you want, so you can get the glyph variants you want. So those APIs are more like HTML+CSS.

But ALL of the above is limited to "normal" (non-mathematical) OpenType fonts. As John has written above, the "MATH" font table is not part of the official OpenType standard, but is a custom extension by Microsoft.

So in order to perform mathematical typesetting using Cambria Math, the OpenType processor would need to be extended to process the "MATH" table.

AFAIR, the Windows 7 RichEdit control provides such functionality, i.e. it allows you to switch to the math context and then perform mathematical typesetting. But again, the RichEdit control has a simple API and does not expose all the low-level details to you. So if you're not happy with the results, you'd need to do your own OpenType Layout and MATH table processing.

Fortunately, there are a few opensource projects that do this, i.e. include a MATH-extended OpenType Layout engine. XeTeX (and extension of the TeX layout system) is one such project. XeTeX includes a modified version of the ICU Layout library, and adds code to process the MATH table. Depending on the licensing, you could use that code, or at least study it in order to understand in more detail how all this needs to be done. One approach that you might consider, if you're really into it, would be to develop a "MATH" table parser as a "plugin" for the HarfBuzz layout library. HarfBuzz is licensed under a very liberal license (Old MIT, I think) so you could use its OpenType Layout processing capabilities (which don't cover MATH but cover everything else), and on top of that, you could write your own MATH table handler. If you do this in a HarfBuzz-compatible manner, I believe the opensource community would be delighted if you donated that code to the HarfBuzz project.

Best regards,
Adam Twardoch

Belloc's picture

I have been very busy for the last few days delving into the MS OpenType specification. I can tell you, I've learned quite a lot since I started this discussion 8 days ago. I must thank you all for your professional and courteous guidance. That was outstanding !

@Adam

Thank you for your detailed and impressive input above. I'm glad that I was able to follow almost everything you said, given that I was completely illiterate on this subject a few days ago.

But I'm still struggling to understand some very fine details on this
MS specification, about cmap, specially regarding this pharagrah :

Each segment is described by a startCode and endCode, along with an idDelta and an idRangeOffset, which are used for mapping the character codes in the segment. The segments are sorted in order of increasing endCode values, and the segment values are specified in four parallel arrays. You search for the first endCode that is greater than or equal to the character code you want to map. If the corresponding startCode is less than or equal to the character code, then you use the corresponding idDelta and idRangeOffset to map the character code to a glyph index (otherwise, the missingGlyph is returned). For the search to terminate, the final endCode value must be 0xFFFF. This segment need not contain any valid mappings. (It can just map the single character code 0xFFFF to missingGlyph). However, the segment must be present.
If the idRangeOffset value for the segment is not 0, the mapping of character codes relies on glyphIdArray. The character code offset from startCode is added to the idRangeOffset value. This sum is used as an offset from the current location within idRangeOffset itself to index out the correct glyphIdArray value. This obscure indexing trick works because glyphIdArray immediately follows idRangeOffset in the font file. The C expression that yields the glyph index is:

*(idRangeOffset[i]/2 + (c - startCount[i]) + &idRangeOffset[i])

For me, this formula does not make sense. Instead, it should be :

12 * sizeof(unsigned short) + (c - startCount[i]) + idRangeOffset[i], where

12 * sizeof(unsigned short) is the offset for the glyphIDArray[], on a Format 4, cmap subtable.

I understand that my question may be out of topic for this forum, but the response I've got so far, has been so inspiring, that I decided to give it a try.

I thank you in advance, for any feedback on this.

twardoch's picture

A good place for these kind of questions are:
* the OpenType list, subscribe by sending an email to:
opentype-migration-sub@indx.co.uk
* the mpeg-OT list, join by visiting:
http://tech.groups.yahoo.com/group/mpeg-OTspec/

Rather than parsing font tables yourself, it's better to use well-developed libraries. Adobe Tin is such library (opensource), while the FreeType library is another popular alternative. Among others, FreeType does "cmap" parsing, and lots more.

Regards,
Adam

Belloc's picture

Adam:

Thanks again for your advice.

Sincerely,

Ayrosa

Syndicate content Syndicate content