character frequencies?

paul d hunt's picture

does any know of a good resource of character frequency information for various languages, preferably online, but otherwise is also fine.
i have found a few sites, like this: letter frequencies (rankings for various languages). but i'm looking for something fairly extensive, perhaps with statistics and definitely Unicode compliant. I've also found a tool that compiles letter frequencies. has anyone used this or know of a better one that might do what i've outlined above? (generate character frequencies, stats, keeping Unicode intact) Just thought I'd put a few feelers out there...

Bert Vanderveen's picture

Didn't Luc(as) de Groot collect extensive itnformation regarding this? Couldn't find it on his site, but you could contact him directly.

. . .
Bert Vanderveen BNO

Gus Winterbottom's picture

You could try the codebreakers at the NSA. No, seriously -- this NSA website lists a number of declassified printed documents you might be able to request regarding letter or digraph frequencies in Russian, French, Polish, and Japanese.

This site has a letter frequency list for English, French, German, and Spanish, and Wikipedia adds Esperanto. And Appendix A of Army field manual 34-40-2, Basic Cryptanalysis, has a list of English digraph frequencies. The whole manual is available as a zipped collection of PDFs here.

(Later edit: I also found this NSA document (PDF), an introduction to cryptanalysis, that has some interesting information on pages 11 through 17. Unfortunately, it's a book that was scanned into PDF, and dates back to 1938, so it's not likely to be Unicode friendly -- but it does have a claim to being authoritative.)

Si_Daniels's picture

>Didn’t Luc(as) de Groot collect extensive itnformation regarding this?

Luc's data extends to common pairs for various languages. Helps plan kerning.

Cheeers, Si

Linda Cunningham's picture

When I lived in DC, a friend of mine worked for the NSA, and their stuff doesn't get released for, as has been noted, at least a substantial number of years after it's been collected, so it's probably not all that useful.

russellm's picture

I doubt the language would have changed all that much.

-=®=-

Linda Cunningham's picture

You'd be surprised -- between the start of WWII and now, there's been some serious Anglicization of most other languages in the world, and that radically alters character frequency.

(Except for France, of course, where they are quite rude wrt "English" invading "their" language. Fold them in with many non-Latin languages that are using English words written in their own character forms and all bets are off....)

Tim Ahrens's picture

I have done some extensive analysis in this field. The texts generated by my test text generator are synthesized on the basis of triplets frequency lists, which I obtained by a very thorough analysis of texts.
For example, for English, as an input I used several texts, 5 literature, 5 scientific and 2 economic, each 5-10 MB in size. In the end, I did not take the arithmetic average but something similar to the median so as to make sure subject-specific key words in a certain text do not spoil the overall result.
As you can see, I have data for 22 languages but some of them are only based on 4-5 texts.
I could convert my frequency lists to character frequencies or pair frequencies if there is a general interest. Btw, why were you interested in the first place, Paul? What would you use them for?

dan_reynolds's picture

Paul, you can use Typotheque's Letter Frequency Meter to an extent, even with non-Latin scripts. The first column of the results it gives seem to me at first glance to list glyph occurrences correctly in any Unicode-encoded text. It is Mac only, but you could get on one of your roommates' machines…

Syndicate content Syndicate content