Letter combination frequencies

nina's picture

How would one go about finding out about relative frequencies of two-letter combinations in different languages* – ideally even discerning between uppercase and lowercase letters? I'm essentially looking for something like a list that would say, an "r" is (I'm making this up for the sake of illustration) 13% likely to be followed by an "e" in English, but only 2% likely to be followed by a "k".
I'm aware there will not be a Final Verdict on this to be found, but there must be some approximative data to work with?
* I'm most interested in German and English for the moment.

A quick-ish web search has brought up charts with relative single letter frequencies in different languages, as well as word frequency lists, but not a lot about pairs of letters. Wikipedia has a list of Bigram Frequencies in the English language, but this is single-case – and also only lists the most common combinations, while I'm looking for something more in the vein of a "pairing" frequency table.

Any good, reasonably reliable references/literature you guys can recommend?
I'm done looking at wonky charts that don't say what data they're based on…
Of course, please re-direct me if this has been discussed before. I couldn't find anything, but then the search has been hiccuping.

Frode Bo Helland's picture

Here's another one counting single letters: http://www.typotheque.com/type_utilities/lettermeter
Not sure about pairs though, but someone who knows their way around Python should be able to modify this in a snap.

Frode Bo Helland's picture

Actually, judging from the illustration it appears to count pairs as well. Not sure.

nina's picture

Frode, thanks. I've found a number of such applets written in different programming languages, but they're always based on a pretty narrow scope of the text you feed into them. I'd be more interested in some scientific analyses based on, I don't know, the quantified vocabulary of a given language?

BTW, it just occurred to me that since this sort of research would likely be undertaken by linguists and/or cryptographers, there might not be much data that distinguishes between UC and lc?

apankrat's picture

> I’d be more interested in some scientific analyses based on, I don’t know, the quantified vocabulary of a given language?

The obvious idea is to feed a vocabulary into one of those "applets" you saw. Though you probably want to feed a book rather than a vocabulary. That's because latter would have exactly one entry for "the", but former will have it aplenty thus severely affecting the counts for "th" and "he" ... so it really depends on what exactly you want to measure.

A program that builds a histogram of letter combination probabilities (even if it's case-aware) is really a 5 minutes of work for any decent programmer in virtually any language. Find one, feed one cold beer, collect one "lettermeter" program.

nina's picture

"Though you probably want to feed a book rather than a vocabulary."

Ah, I wasn't talking about dictionaries, but about the collective, uh, "vocabulary" in approximately the sense that Wikipedia says "A person's vocabulary is the set of words they are familiar with in a language" – only I meant it in a non-subjective way, so a better word would likely be lexis ("the total bank of words and phrases of a particular language, the artifact of which is known as a lexicon"). Sorry about the confusing wording.
Note I did say "quantified", to account for frequencies of words.

I suspect that defining and extracting these masses of data is trickier than analysing it, and must have been done before by researchers who knew what they were doing.

cschroeppel's picture

From a programmers perspective, it's not that difficult as long as all the texts that are being supplied have the same encoding and as long as such things as line-breaks can be ignored. Remember that ck becomes k-k in German, so just discarding the line-break will not work. On the other hand, you might want to know how often a certain letter is followed or preceded by a linebreak, i.e. a hyphen. Different spaces are also rather difficult if you are interested in characteres at the beginning and the end of words. A rough estimate, however, should be not to difficult to obtain in C or even TeX. As for the sources, I'd try to find large chunks of text on the internet. (The quantified lexis of books might be more important than the overall quantified lexis of the language, as I assume the purpose is related to typography.) Ghostview can also extract plain text from PDFs. -- cs

nina's picture

By "quantified vocabulary" (which obviously is impossible to define exactly – apart from being worded awkwardly) I guess I meant a sampling large and broad enough that the results gleaned from evaluating it would be substantial enough to work from on a somewhat general level. That means the data basis must be *much* broader than a single text you can fill into those convenient boxes in typical online tools… plus selected/compiled in a non-arbitrary way.

The Brown corpus appears to provide such a basis to work from: http://en.wikipedia.org/wiki/Brown_Corpus

See the related discussion on this thread:
http://typophile.com/node/54942

Dan Gayle's picture

Large amounts of text can be found in the Gutenburg project, free for use.

I'm assuming that you are using this to single out kerning pairs? You might want to do a search for "most common kerning pairs" as I know I've seen something like that somewhere.

nina's picture

Oh, right. But I think it's not the amount of text only; it's the way it's selected and collated. I guess I've spent too much time studying that collating "something, just a lot of text" myself would not feel like a hack job (and not enough time to actually do it myself, satisfactorily). Maybe I put too much trust in research, but I do feel this needs a basis that's thought through in a way.
Like, what's in the Gutenberg project? It won't have daily newspapers. Words used are highly context specific, which is one reason why I think the Brown corpus is (or was?) doing a pretty good job of collating different language usage contexts.

"I’m assuming that you are using this to single out kerning pairs?"
Um, not yet. It was something that crossed my mind when I started thinking about spacing more seriously, which is also premature for me, but highly interesting. (Like wanting to know how often on average an "r" would be followed by a "round glyph" vs. a straight.)

kentlew's picture

> (Like wanting to know how often on average an “r” would be followed by a “round glyph” vs. a straight.)

Your dedication is admirable. But I have to ask: Would it really matter how often an "r" is followed by round vs. straight? Won't you want both combinations to look acceptable and well-spaced?

There was a time when disk space and processing time were at more of a premium and there was a point of diminishing returns for adding kerning pairs for unlikely (or practically non-existent) combinations. But things have changed quite a bit in the past several years. Also, with class kerning capabilities in OpenType, these frequency issues are much less important considerations, it seems to me.

-- Kent.

AtoZ's picture

For a more extensive and up-to-date word list than the Brown Corpus, take a look at the Corpus of American English. Here's a description from Wikipedia:

The freely-searchable 385+ million word Corpus of Contemporary American English (COCA) is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres....

The corpus is composed of more than 385 million words in more than 150,000 texts, including 20 million words each year from 1990-2008. For each year (and therefore overall, as well), the corpus is evenly divided between the five genres of spoken, fiction, popular magazines, newspapers, and academic journals. The texts come from a variety of sources:

Spoken: (79 million words) Transcripts of unscripted conversation from nearly 150 different TV and radio programs.

Fiction: (75 million words) Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts.

Popular Magazines: (81 million words) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc).

Newspapers: (76 million words) Ten newspapers from across the US, with a good mix between different sections of the newspapers, such as local news, opinion, sports, financial, etc.

Academic Journals: (76 million words) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system.

For your purposes you probably wouldn't want the spoken words, but they can easily be filitered out.
 
         ~~~~~~~~~~~~~~~~~
         When going from A to Z,
         I often end up At Oz.

nina's picture

Kent,
"Your dedication is admirable. But I have to ask: Would it really matter how often an “r” is followed by round vs. straight? Won’t you want both combinations to look acceptable and well-spaced?"
Indeed I'm probably off on kind of a longish-term learning tangent here, rather than aiming for an actual, quick, tangible result that I'm then going to put to use in my font.
This curiosity about the frequency of letter combinations (especially regarding the "r" example mentioned) also wasn't triggered by thinking about kerning, but rather the lettershapes themselves. In different fonts, the "r" – not due to spacing/kerning, but judging from the shape of the arm/beak – seems to go especially well with subsequent round shapes; or it has particular problems with an "a" following it; or it may look too close to an "m" when followed by an "n", etc.; that made me wonder which combinations would be most important to consider in an "r" (& I'm assuming there must be similar considerations considering other glyphs as well).
Of course I agree the goal has to be to make it look decent at the very least next to as close to "anything" as possible – but I'm wondering if there is one especially frequent pattern of combination that would take precedence due to its frequency.

AtoZ: Thank you! That's awesome. I was wondering if the Brown corpus isn't somewhat out of date by now.

Syndicate content Syndicate content