Letter frequency data incl. uppercase

John Hudson's picture

There are some good resources online for finding letter frequency data for English and other major languages, including both corpus and dictionary keyword derived data. But none of the corpus data I've looked at so far maintains a distinction between upper- and lowercase letters. Has anyone seen such data? I want to know what the average frequency of individual uppercase letters is.

hrant's picture

I'm pretty sure I've never seen that.

But here's a way to extract such data: take a list of word frequencies, and use a thesaurus (I mean a digital one) to eliminate all the entries that are not proper nouns. The first letters of the words you're left with coupled with the frequencies of the respective words should constitute a decent estimate.

hhp

paul d hunt's picture

John, this SHOULD be fairly easy to find out with Unicode encoded text, yeah? If you have a corpus, I have a small app developed by typophile Aric Bills that should crunch the numbers for you. Let me get in contact with him and see if he will let me pass the app along to you, or perhaps he may contact you directly as he did me.

oprion's picture

This one does the trick. Both upper and lower cases in multiple languages.

http://www.characterfrequencyanalyzer.com/english/index.php

Even counts obscure letters like Yat or long s.
_____________________________________________
Personal Art and Design Portal of Ivan Gulkov
www.ivangdesign.com

John Hudson's picture

I don't have a corpus, Paul, but could probably find one. For my purposes, a corpus of scholarly works in English would be best. In case you're wondering why I'm after this, I'm trying to quantify the amount of space saved by using one typeface instead of another; since the capitals in the new type are quite a lot narrower than in the old one, they may have a significant impact.

hrant's picture

There's more than one way to skin this cat. A lot of it I think
depends on what the client wants to see in terms of analysis.

For one thing, this is all deterministic; there can be no strict quantification in any case. Letter frequencies are always estimates, in some ways better but in other ways worse than simply setting a longish typical text supplied by the client to get an actual reading of comparative economy.

What you might do instead is simply figure out what proportion of text is caps (I've heard it's ~5%*), calculate the average set width of each font's caps, and use those three numbers to arrive at the economic difference. A refinement of this would be to weight the averaging based on general (non-cap) letter frequencies; a further refinement would be to use frequencies of initial letters of words only (whereby for example "Y" would have a very low weight).

* 15% for German?

Another twist is that letter frequencies are not enough when it comes to figuring out the effect of letter widths on economy. The proportion of paragraph breaks is just as important if not more so. A font that saves a certain degree of horizontal space can have all its savings trashed by a paragraph break! The greater the proportion of paragraph breaks, the more moot narrowness of glyphs becomes.

All this makes me think that using a sample text from the client to compare the fonts would make more sense than relying on frequency stats. But frankly in any case I think the savings will be so small (except maybe for German) that the awkwardness of very narrow caps will kill it.

hhp

John Hudson's picture

A font that saves a certain degree of horizontal space can have all its savings trashed by a paragraph break! The greater the proportion of paragraph breaks, the more moot narrowness of glyphs becomes.

Very true.

All this makes me think that using a sample text from the client to compare the fonts would make more sense than relying on frequency stats.

Oh, I'm doing that too. The frequency stats provide a generic baseline comparison between different typefaces, but real documents provide a better indication of actual benefits/costs.

Syndicate content Syndicate content