New to Typophile? Accounts are free, and easy to set up.
does any know of a good resource of character frequency information for various languages, preferably online, but otherwise is also fine.
i have found a few sites, like this: letter frequencies (rankings for various languages). but i'm looking for something fairly extensive, perhaps with statistics and definitely Unicode compliant. I've also found a tool that compiles letter frequencies. has anyone used this or know of a better one that might do what i've outlined above? (generate character frequencies, stats, keeping Unicode intact) Just thought I'd put a few feelers out there...
15 Feb 2008 — 8:49am
Didn't Luc(as) de Groot collect extensive itnformation regarding this? Couldn't find it on his site, but you could contact him directly.
. . .
Bert Vanderveen BNO
15 Feb 2008 — 10:57am
You could try the codebreakers at the NSA. No, seriously -- this NSA website lists a number of declassified printed documents you might be able to request regarding letter or digraph frequencies in Russian, French, Polish, and Japanese.
This site has a letter frequency list for English, French, German, and Spanish, and Wikipedia adds Esperanto. And Appendix A of Army field manual 34-40-2, Basic Cryptanalysis, has a list of English digraph frequencies. The whole manual is available as a zipped collection of PDFs here.
(Later edit: I also found this NSA document (PDF), an introduction to cryptanalysis, that has some interesting information on pages 11 through 17. Unfortunately, it's a book that was scanned into PDF, and dates back to 1938, so it's not likely to be Unicode friendly -- but it does have a claim to being authoritative.)
15 Feb 2008 — 11:19am
>Didn’t Luc(as) de Groot collect extensive itnformation regarding this?
Luc's data extends to common pairs for various languages. Helps plan kerning.
Cheeers, Si
15 Feb 2008 — 9:15pm
When I lived in DC, a friend of mine worked for the NSA, and their stuff doesn't get released for, as has been noted, at least a substantial number of years after it's been collected, so it's probably not all that useful.
15 Feb 2008 — 9:29pm
I doubt the language would have changed all that much.
-=®=-
15 Feb 2008 — 9:35pm
You'd be surprised -- between the start of WWII and now, there's been some serious Anglicization of most other languages in the world, and that radically alters character frequency.
(Except for France, of course, where they are quite rude wrt "English" invading "their" language. Fold them in with many non-Latin languages that are using English words written in their own character forms and all bets are off....)
16 Feb 2008 — 3:10am
I have done some extensive analysis in this field. The texts generated by my test text generator are synthesized on the basis of triplets frequency lists, which I obtained by a very thorough analysis of texts.
For example, for English, as an input I used several texts, 5 literature, 5 scientific and 2 economic, each 5-10 MB in size. In the end, I did not take the arithmetic average but something similar to the median so as to make sure subject-specific key words in a certain text do not spoil the overall result.
As you can see, I have data for 22 languages but some of them are only based on 4-5 texts.
I could convert my frequency lists to character frequencies or pair frequencies if there is a general interest. Btw, why were you interested in the first place, Paul? What would you use them for?
16 Feb 2008 — 3:47am
Paul, you can use Typotheque's Letter Frequency Meter to an extent, even with non-Latin scripts. The first column of the results it gives seem to me at first glance to list glyph occurrences correctly in any Unicode-encoded text. It is Mac only, but you could get on one of your roommates' machines…