double-encoding for unicase?

eliason's picture

I have a unicase font that has a bunch of OpenType scripting (swsh, dlig, etc.).

I have designed the letters in the A-Z slots, and had presumed that I would just put one-component copies of them into the a-z slots.

Would it be a mistake to simply double-encode, say, the /A/ glyph with the Unicodes for /A/ and /a/ instead of making lowercase copies? That way I wouldn't have to work all the lowercase permutations into all the scripting, kerning classes, etc.

cuttlefish's picture

It's generally not recommended (in the FontForge manual at least) to double-encode glyphs, but it certainly can be done and may indeed be the practical solution for a unicase or caps-only display font.

Ray Larabie's picture

Has anyone tried this before? What are the consequences?

cerulean's picture

I was told this would be a bad thing, but I can't quite remember why. Something to do with glyph names remaining relevant when the font is used in some applications.

I can imagine a scenario wherein you select some text, apply your double-encoded font, and suddenly "a" is now "A" and you've lost case information in that text that doesn't come back if you then decide to change it to another font. But this is pure supposition.

Nick Shinn's picture

Great idea Craig.
I'd be interested to know if it's practical.

oldnick's picture

On more or less the same subject, is there an advantage to handling unicase kerning one or the other of these two ways?

1. Simply expand uppercase classes to include lowercase instances; e.g., A' Aacute Acircumflex Adieresis Agrave Aring Atilde Abreve Aogonek (plus) a aacute acircumflex adieresis agrave aring atilde abreve aogonek.

2. In the past, I have...
(a) exported uppercase only kerning, opened the .afm file and cut and pasted the kerning data into a simple text file;
(b) opened the text file in Excel, specifying "delimited" with "space" as the delimiter;
(c) made two copies of the kerning info;
(d) run a macro to change the second column of uppercase letters into lowercase letters in the first copy, then done the same for both columns in the second copy;
(e) pasted the altered info beneath the original, then saved the results as a .prn file;
(f) pasted the .prn info into the original .afm file; then, finally
(g) imported the metrics back into FontLab.

gargoyle's picture

I was told this would be a bad thing, but I can't quite remember why. Something to do with glyph names remaining relevant when the font is used in some applications.

The rationale against double-encoding seems to involve embedded fonts potentially losing their Unicode information, in which case the glyph names could be used for proper glyph-to-character translation ... there's a more extended discussion in the FontLab manual under "Advanced Naming and Encoding" (p145).

Working with the copied glyphs isn't as big of an inconvenience if you're using classes for scripting as well as kerning (pretty much like Nick's suggestion #1 above), and using components for the copies. Rather than double-encoding the "A", make a named "a" glyph that's a component of "A" and create appropriate classes for every pair or series of similar glyphs. Then in your code, just refer to the classes.

eliason's picture

Thanks to all. It's working in experiments on my computer, but after reading that manual section (thanks for pointing it to me), I'm leaning towards not being the guinea pig on this one.

John Hudson's picture

There is nothing wrong with assigning multiple Unicode codepoints to single glyphs. There are lots of situations in which this is sensible and efficient. The TT/OT Unicode cmap subtables are designed to be able to handle this sort of thing.

The only caveats to be observed are:

1. If a PDF is distilled from a print stream, it will not be possible for Acrobat to reconstruct the character string (for searching, copying/pasting text) unless there is a one-to-one relationship between correctly named glyphs and individual Unicode characters.

2. It is not generally adviseable to encode the same glyph as characters from different writing systems, e.g. unifying the encoding of Latin a and Cyrillic а, but only because it is easier to manage things like kerning and OpenType Layout if glyph sets for different scripts are kept separate.

agisaak's picture

@John Hudson,

With regards to your first caveat, this brings up a related issue which I find myself going back and forth on:

The fact that the unicode stream might be lost in PDFs seems to be the major argument against having OT features map from one character onto another (rather than onto a glyph variant). If one adheres rigorously to this, it can often result in a large amount of glyph duplication (e.g. many fonts will include separate sets of small caps for the 'smcp' and 'c2sc' features so that case is preserved in PDFs). While such duplication might not drastically increase the size of the font file, it can lead to an extremely cluttered font, and ins some cases greatly increases the complexity of coding (and thus the possibility of uncaught bugs) - for example, in cases where there are several, cumulative variants of the small capitals which must all be duplicated.

Are there other reasons besides PDF-related issues which argue against having two different characters map to the same glyph (e.g. mapping both A and a to a single a.smcp glyph)? If not, how high is the percentage of PDFs created which fail to preserve the original character stream?

How rigorous are people out there when it comes to assuring that no two characters map onto a single glyph? For those who are not, how often do you run into problems as a result of this?

While I use small capitals as my example above, I'm more interested in cases where more than case-sensitivity would be sacrificed. For example, mapping zero to U2070 (superscript zero) or s to U017F (long s) rather than creating a separate s.hist glyph.

Apart from the potential for bugs, lots of glyph duplication also results in an overly long glyph-palette which can frustrate the user looking for a particular glyph.


John Hudson's picture

My practice depends on what my clients want. I always explain the PDF issue to them, and then they decide whether this is important to them.

If not, how high is the percentage of PDFs created which fail to preserve the original character stream?

As I understand it, it's any situation in which Acrobat is used to distil from a print stream, as opposed to an application using some kind of write-to-PDF function that makes use of PDF character string storage. ‘Printing to PDF’ sends only the glyph IDs for the embedded font(s) which then need to be mapped to glyph names in the font(s) in order for Acrobat to try to parse these and reconstruct the original text.

eliason's picture

I can report that the double-encoded version of the font, when installed with Font Explorer X on my Mac, produced all sorts of problems (other apps hanging, lots of beach ball cursors, screwy renderings of system fonts (like Dock contextual menus), etc.). Guess I'll start building those classes.

eliason's picture

Just for the record, the errors I was finding I think were due to caching problems. In the end I did use this double-encoding solution for my Ambicase fonts and it seems to have worked well.

Syndicate content Syndicate content