How does Font Book know the supported languages of a font?

Tim Ahrens's picture

When I do Cmd-I on a a font in Apple Font Book, it shows me a list of languages supported by the font. How is it generated? I assume there is some internal database? Is there any way to extract it?

Synthview's picture

Hello,
good question! I’m interested into an answer too.
surely an internal algorithm.
But I have the impression the language listed are less than the effective ones. Doing it by hand following iso-8859-×, I’ve listed much more languages than FontBook list.

clauses's picture

Perhaps you could get a similar db from Georg Seifert. Alternatively you could compile one from Michael Everson's PDF over European languages. I also remember an Excel-file with a lot of languages and their glyphs in a matrix. I think Miguel Sousa did it or just posted it in an old Adobe blog; I can't seem to find it now.

Santiago Orozco's picture

it should be an algorithm

Theunis de Jong's picture

.. it should be an algorithm

It is an algorithm. It could be sort a database of all possible languages, each with a set of glyphs this language needs, and all this algorithm needs to do is to check for each language

∀(char) ∈ Language

-- of course it depends on the database how many different "languages" (dialects? idolects?) it contains, and how strict the check is (as perhaps not all possible characters are required for each language). And, of course, "language" itself is a fluent concept.

Igor Freiberger's picture

I thought this kind of information came from the codepages supported, what is defined in the font and interpreted by the OS algorithm.

If this is verified on the fly by MacOS taking into account the actual glyphs of the font, its database would be extremely useful. I'm searching for months to get all Latin-script languages mapped and it's an almost insane job.

But if this is an algorithm, how it evalues the support given through combining diacritics? A proper support would need the base glyph, the combining diacritic and the ccmp declaration (and also a mark definition for most stacking diacritics). Does MacOs verify all this?

In Windows I believe the OS simply takes language support from font codepages.

blank's picture

If you generate a font containing only Latin characters but enable the Arabic and Hebrew codepages Fontbook will report that the font supports Arabic and Hebrew. It’s definitely not one of Apple’s better applications.

Santiago Orozco's picture

of course, first you should map every glyph for all the languages, then implement a tree search with recursion

Tim Ahrens's picture

Thanks for your thoughts, everyone.

I was interested in the database more than in the lookup implementation (which is surely very simple, so no fuss necessary). So, is it not possible to get hold of the database? I looked into the package contents of Font Book.app but couldn't find it there. Maybe it is stored somewhere else in the system?

Khaled Hosny's picture

I usually use fontconfig's orthography database, either using their command line utilities (like fc-query) or parsing the database on my own. Pretty good coverage, even for languages that can be written in different scripts.

twardoch's picture

Tim,

my assumption is that the data comes from the ICU library which is bundled with Mac OS X.

A.

Igor Freiberger's picture

Tim, as it seems FontBook does not evaluates the actual font contents but just the codepage declaration (which is arbitrary and may be wrong), the database may be simply a table with languages supported by each codepage. If this is what you need, a good startpoint is here.

If you need more information about languages and alphabets, this thread has some good tips, including a link to Excel table referred above. There are also Omniglot and LanguageGeek sites. For Latin script, a good general table is published in Wikipedia.

Andreas Stötzner's picture

For Andron Mega FontBook reports a number of 57 languages supported. My own counting resulted in about 280 languages.

I based my record on composing samples utilizing text strings from the UDHR project, which I found very useful to perform the task. (see full listing here.)

Si_Daniels's picture

Has anyone thought to ask Apple?

clauses's picture

Has anyone thought to ask Apple?

Pointless as they don't answer.

clauses's picture

Andreas I would take those lists with a big spoonful of salt. Just checking the Danish list http://www.unicode.org/udhr/d/udhr_dan.charcount, and I can see Q, X, W, Z missing. A better list for Danish is the http://www.evertype.com/alphabets/danish.pdf from Michael Everson's Alphabets of Europe, but that list is extremely inclusive, hence the characters in brackets are for loan-words, translitterations &c.

Si_Daniels's picture

>Pointless as they don't answer.

:-) Well here's the answer Apple sent me...

We compare the cmap to the ICU exemplar strings for each language.
We just pick up the latest copy of the open source ICU database for each system release.

 CLDR Survey Tool http://unicode.org/cldr/apps/survey (jump to other items pop-up/characters)
 e.g. http://unicode.org/cldr/apps/survey?_=be&x=characters
 Unicode Set definition http://icu.sourceforge.net/userguide/unicodeSet.html

Andreas Stötzner's picture

Clause, I didn’t mean the character listings you have linked to (they’re new to me, and look rather alien). Yes, there is something missing – not useful. The Everson language records (which I know and appreciate, of course) are, as you say, over-inclusive – which I find less helpful as well.
I was referring to the bits of actual text (The Declaration of Human Rights) at the UDHR project site. This gives me some real-world proof for the scope of my fonts, yet still not entirely reliable since not every piece of text contains neccessarily all characters belonging to a language.

Igor Freiberger's picture

Simon, thanks for the information. CLDR is a great tool, although a bit confuse. I was not aware about it and surely will take advantage of this tip.

Anyway, use of these data may explain the problems with FontBook.

First: the list of languages covered by CLDR is still limited. For example, under Z it lists just Zulu, so any support or reference about Záparo, Zapotec, Zarma, Zazaki, Zhuang or Zuni are missed by evaluations based on CLDR.

Second: CLDR classifies characters as "approved" and "proposed". If FontBook (or any other tool) evalues the language support based only in "approved" (the safer way), it may miss some essential characters. And if it includes "proposed" (aiming a more complete analysis) it would return characters actually not needed. For example: it's known Latin script for Azerbaijani uses Schwa, but this character is not approved in CLDR until now. Thus, evaluation for Azerbaijani support based on CLDR would be inconsistent.

I'm not saying CLDR is a bad tool – far from that, it's really useful. But it is under development and automated evaluations based on its data (as FontBook does) need to be understood as partial.

Mel N. Collie's picture

> I'm not saying CLDR is a bad tool – far from that, it's really useful...development and automated evaluations based on its data (as FontBook does) need to be understood as partial.

So, if I might ask, is there anything better than fontbook at doing this? And how much of the "partial" is up to the user regardless? Vs the other part of "partial" that's lacking time completeness in language/nation mapping? And etc? Thanks.

Igor Freiberger's picture

Is there anything better than fontbook at doing this?

ALAIK, not. While I was searching for languages and its alphabets, I did surveys in various places –Unicode charts, SIL, Ethnologue, LanguageGeek, Omniglot, Wikipedia, eki.ee, Evertype, Signographie, dozens of NGO related to minorities and also linguistic departments in several universities. I did ask for a place where these informations are compiled, but no one know such a site or book.

CLDR is a nice addition to the list, but still not the one-does-all resource.

riccard0's picture

And if someone could be able/willing to create/compile such a resource, which characteristics/tools would be needed?

JanekZ's picture

Such a .txt could be very useful


I changed the TAHOMA to my (under construction http://typophile.com/node/73413 ) font, and

everything is clear, no question marks!

Disclaimer: These pictures are distributed on an AS IS basis, without warranty.

John Hudson's picture

Suddenly, it's all very 1997. [PDF]

Igor Freiberger's picture

And if someone could be able/willing to create/compile such a resource, which characteristics/tools would be needed?

Riccardo, I think the best is to build a relational database. This could be used in queries through the web or in some automatic tool. As any relational db can have its data exported in several ways, the information would feed since Python macros to standalone applications.

This db structure would include a table for languages, a table for scripts, a table for glyphs, a table for countries and tables with associations between all these. The separation in various tables is needed to handle the n-to-n relationships (glyphs–alphabets–languages–scripts–countries). Glyph table needs a field to insert the glyph image.

About tools, I think any db manager could do this –like phpMyAdmin for MySQL or even MS Access.

k.l.'s picture

Instead of setting up one more of the same, why not help Unicode improve CLDR – now that they are asking?

My impression is that what is missing is a reliable set of characters per language. And I like the idea of having e.g. CLDR as a single source for a variety of locale information. CLDR's approach to classify character sets is an advantage and indicates that they are honest enough to point out "not sure". Applications can chose either approved or proposed sets and ideally would tell users about their choice.

Tim Ahrens's picture

...reliable set of characters per language

The first question is: what are these databases used for, what is the objective?

In the case of FontBook it is obviously to determine the set of supported languages for a given font.

As a font maker, it would be the other way round: Which characters do I have to provide in order to support a certain language? It all boils down to minimizing the number of fail cases where a document requires a character that is not in the font. Or, to achieve a certain coverage (say, 99.95%) while minimizing the costs. What are the costs? Design effort is a cost, which is not the same for each character – some accented characters can be generated with two clicks, whereas other rarely used characters need to be designed from scratch. What if the rationale is to minimize the file size, e.g. for web fonts? Then the cost is in kB, not work. But we have a similar scenario: some characters cost more, others less (accented characters have very low data volume).

So, we want to cover as many cases as possible with our font. In that sense, I think a linguistic approach does not help much. I believe trying to find out which characters are “required” is the wrong paradigm and 100% coverage is impossible to achieve. Also, “cases” is very difficult to define in the first place. Do we distinguish between geographic locations as well? Are we talking about the web? Print? The past, present or future? How about foreign words or place names? We can find “coöperation” in English texts, and “Café” with a French accent is standard in German. On the other hand, in German, the »angular« quotes, are very rarely used on websites although they are the classic form in novels. An “official” character set is very difficult to determine for most languages so in my opinion, a real-world basis (scanning large amounts of text) would make more sense than an academic one. Another difficult question is, what constitutes a language – only letters and letter-like symbols, also punctuation? Are figures part of a language? For example, In most middle-east countries they use “real” Arabic figures whereas in Northern Africa the “Latin Arabic” numerals we use seem to be preferred. How about mathematical punctuation, is that still language-specific, how much of it is “required”? I think this whole thing should not be restricted to letters. What does it help to know which letters I need to satisfy the needs of most of my users, but not the other characters.

What we really need is the frequency of occurrence for each character in each language, found by a real-world survey. Then we can set a threshold and determine our set of required characters. Or, we can assign a nominal cost to each character, and then work out a tradeoff.

gaultney's picture

I did ask for a place where these informations are compiled, but no one know such a site or book.

We're working to change that. ScriptSource is a site dedicated to gathering info on scripts and writing systems. It supports 'tabular' data - lists of scripts, languages, characters, etc. - and the associations between them as well as text+graphics information. It's built on a database designed specifically for this type of linguistic data, and pulls in data from Unicode, the Ethnologue, various international standards (15924, 639-3) and CLDR. Text descriptions, articles, graphics, bibliographic links and even software can be connected to this skeleton of data.

We're working closely with the CLDR committee, and working on strategies to dramatically expand CLDR data for minority languages. CLDR already contains a huge volume of data in a useful but rather complex system that supports important concepts such as inheritance, but as a result can be tricky to navigate. We hope that ScriptSource can present some of that data in an easier to read format, and serve as a place to collect documentation and data toward further CLDR submissions. We see CLDR and international standards - not ScriptSource - as the long-term repositories of writing system data, but use our site to pull them together and enhance them.

ScriptSource is currently in testing and being refined for a public release later in the spring. If you're interested in finding our more go to www.scriptsource.org. If you want to look at what we have so far and give us feedback drop me a line and I can give you an invitation, especially if you think that you might want to contribute to the site.

Tim Ahrens's picture

Victor, ScriptSource looks very promising! Looking forward to seeing the results of you efforts. Will it help me decide on the character set (not only letter-set) to put in my fonts?

gaultney's picture

I hope so. The main 'objects' are script, language, writing system (what you get when you connect a script and a language, and a bit like a CLDR 'locale') and character. So you will be able to see what characters are used for a particular writing system - the main characters as well as auxiliary ones that are used for loan words, etc. We've focused more on characters than glyphs, although you'll be able to post details of glyph variants, cultural preferences, style differences, etc.

You'll also be able to see what writing systems use a particular character, although the completeness of that answer depends on the completeness of the underlying data, and there is still a lot of that data that doesn't exist yet.

Syndicate content Syndicate content