New to Typophile? Accounts are free, and easy to set up.
Does anyone know of any resources which would provide information on which letter *pairs* frequently occur in different languages? It would be especially useful if this included information on diacritics.
I'm currently dealing with some consonant-vowel ligatures, and want to figure out if there are diacritical combinations which can be safely omitted. I'd tried googling for various diacritical combinations, but the useful data ends up buried amid results drawn from a miscellany of legacy CJK encodings.
André
17 Aug 2009 — 7:27pm
Chthonic, Django Reinhart, Jzanus, Ljubjana, llama...you get the idea: too many possibilities...
17 Aug 2009 — 8:30pm
One source of information I have used for similar purposes is Open Office Dictionaries and other dictionaries for spelling checkers. For instance, if you click on the link for Canadian English (zip file) you get a folder containing a file with extension .dic with 62341 entries (including "derived" entries). Other dictionaries can be much larger. The .dic file is plain text. If you remove what follows the slash after each word, you get a file on which you can run programs to extract pairs, count them, etc. Of course, that gives no information on the frequency with which those pairs occur in actual texts but that gives information on possible pairs for the language you chose. Some dictionaries are utf-8 encoded, others are latin1 and so on. The encoding is given at the first line of a second file with extension .aff. Some programming ability is thus required.
17 Aug 2009 — 8:31pm
Well, yes there will be lots of possibilities, but some pairs are still going to be cross-linguistically more common than others, and diacritics which are not commonly used may not occur adjacent to others -- for example, I *think* that if one had an sa ligature in a font, that it would be more important to also implement sá than șä . But I'm basing that on the fact that ä doesn't occur in Rumanian and AFAIK that's the only language which uses ș. Even within a language which contains a variety of diacritics, it's not necessarily the case that all of those diacritics will occur adjacent to one another, and while it's relatively easy to find information on which diacritics are used in which languages, I haven't found information on diacritic pairs..
André
17 Aug 2009 — 8:36pm
Thanks Michael -- I'd tried using the Mac OS built-in dictionary for those languages I've installed, but it doesn't support wildcards (or if it does, the asterisk isn't used for this). Never thought, though, to try opening the actual file (a senior moment).
André
17 Aug 2009 — 9:10pm
I use terminal windows and unix utilities to find those files and process them. Maybe you can do better with Mac utilities, I don't know. For dictionaries installed by Firefox, I type the command "cd $HOME/Library/Ap*ort/Firefox" in a terminal window and then
find . -name "*.dic"
gives me the list of those dictionaries. They can be copied in some temporary folder and batch processed.
Michel
17 Aug 2009 — 9:15pm
I always wish there would be some linguistics textbook that covers this stuff. Maybe Steve Peters will chime in here with some help.
If you have time to figure out the syntax to sift through text file wordlists it’s pretty easy to put this stuff together using Python or just Bash scripting (grep "*öö*" file.txt | wc -l). The OpenWall wordlists disk is worth it’s low price if you don’t need to analyze actual text. Ask around in the netsec world and I’m sure even more dictionaries exist. Project Gutenberg and similar resources probably have real texts covering many of the languages you need to analyze.
17 Aug 2009 — 9:26pm
I have somewhere a python script that counts bigrams in a utf-8 encoded source. To get the list of words, I just use "awk 'BEGIN{FS="/"}{print $1}' *.dic". If that can be useful, I'll try to find the script. That's just a few lines of code, never more.
17 Aug 2009 — 10:31pm
James wrote: I always wish there would be some linguistics textbook that covers this stuff.
Linguistics texts generally aren't that concerned with orthography, so this isn't a likely source. You'll find lots of information on the pairings of various sounds , but any statistics presented will likely involve IPA rather than orthographic representations.
André
18 Aug 2009 — 1:08am
Frequency analysis is what you really need - a dictionary would not be enough. This would require some long texts in all the languages of interest. I don't know of a good general source for these, but someone must have compiled such.
Some years ago Luc(as) de Groot (http://www.lucasfonts.com/) did some good work on compiling resources for kerning and building some tools for it. I think he called it Kernologica. He should be able to point you in some useful directions.
18 Aug 2009 — 5:20am
Frequency analysis is what you really need.
Most obviously. To get frequencies (absolute or relative) of bigrams, all you need is a very basic script that can be run on some utf-8 encoded input. To get such a script (for alphabetic bigrams), you can just copy what is between the cut lines and paste it in a terminal window and you will get an executable file named
bigramsin your current folder.----
cat >bigrams <<'EOF'
#!/usr/bin/python
# M. Boyer 2009
import codecs, sys
infile=codecs.open(sys.argv[1],"r","utf-8")
text=infile.read(); infile.close()
tallies={}; nbdata=0; prev=' '
def tallyq(c):
return c.isalpha()
for char in text:
if (tallyq(prev) and tallyq(char)):
datum=prev+char # ; datum=datum.lower()
nbdata=nbdata+1
if datum in tallies:
tallies[datum]=tallies[datum]+1
else:
tallies[datum]=1
prev=char
for d in tallies:
print('%s;%d;%.3f%%' %
(d.encode('utf-8'), tallies[d], 100.0*tallies[d]/nbdata))
EOF
chmod 755 bigrams
----
Then you decide what you want to run it on. For instance, if you want to run it on Chekhov's text Дама с собачкой (The lady with the little dog), you can type (or copy and paste) the line
lynx -dump http://lib.ru/LITRA/CHEHOW/d.txt > dama.txtand then run (maybe after removing some html references at the bottom)
./bigrams dama.txt | sortHere is a copy paste of part of the output
то;372;1.927%
тп;1;0.005%
тр;85;0.440%
тс;31;0.161%
ту;26;0.135%
тф;2;0.010%
тх;1;0.005%
тч;7;0.036%
There were 372 occurrences of то which reprensents 1.927% of all bigrams (after cleaning the text).
With the internet, there are now many sources of texts in all languages. There is also nothing to prevent you from running the script on a dictionary to know possible combinations; it seems you then don't need the frequencies but it may still be interesting to see what were the words containing bigrams with very low frequencies. A simple
grepanswers the question.Michel
[added] I guess the mac does not come with lynx installed. I must have installed it myself. That example may be more for Linux than mac users. Sorry.
18 Aug 2009 — 6:29am
Nice!
18 Aug 2009 — 11:41am
Ohai.
The LetterMeter from Peter Bilak and Just van Rossum can run a text for single letter and letter pair occurence. Then it is just a matter of feeding it with the texts you deem appropriate.
Says the website:
LetterMeter is a text analysis tool, used in the Type&Media classes (postgraduate course of type design) at the Royal Academy of Art in The Hague. LetterMeter is designed for comparing multilingual texts and measuring the frequency of particular glyphs.
Because it is Unicode based, it will work with the majority of languages. The current version will recognize Latin, Greek and Cyrillic glyphs, and sort them according to their formal attributes. LetterMeter's results include statistics for the incidences of round/square/open/diagonal left and right sides of glyphs, ratios of vowels/consonants, and counts of glyphs with accents, ascenders and descenders, in any given text(s).
LetterMeter was developed jointly by Peter Bilak and Just van Rossum, whom I would like to thank for the Python programming. Vera Evstafieva helped with the Cyrillic specifications, and Panos Haratzopoulos with the Greek.
LetterMeter is created using Python. and works only on Mac OS X. Although it is available for free, it is copyrighted, and you may not redistribute it. All rights reserved, © 2003, Peter Bilak, Just van Rossum.
For TEH DOWNLOADS at Typotheque
19 Aug 2009 — 12:59pm
Here is another tool I made using the above code (I replaced semicolons by tabs, and added basic choices). It can be used from absolutely any computer (well... you tell me if it works on an iPhone). Link.
On a PC, if you save the resulting statistics as a text file, you can then import it in Excel for further processing. On the mac, I have found no way to import utf-8 text into Excel. Hard to believe!
Michel
20 Aug 2009 — 5:33pm
Here are some results I got for English. http://groups.google.ca/group/comp.lang.postscript/msg/34c2bb049b42f668?...
I used to use a C program to count the most common digrams, then augment it against punctuation, to generate kerning pair lists for URW Kernus.
21 Aug 2009 — 7:22am
I guess there are indeed good references for English.
Before continuing, let me say that Lynx for Mac OS X can be downloaded from http://www.apple.com/downloads/macosx/unix_open_source/lynxtextwebbrowse.... To use it at the command line, you add
/Applicationsto your path. I assume this is done, and that"Terminal > Window Settings > Display"is set toUnicode (UTF-8). What follows is then good for Linux and Mac users that are used to unix commands.Now, some digrams may cause more than kerning problems. For instance, in the Typophile thread f + umlauts, Florian Hardwig mentions that the diagrams fä, fö, fü may cause a clash between the umlauts and the f. Those combinations occur in German. How often? Let's check.
On the Project Gutenberg Catalog, I find Kant's Kritik der reinen Vernunft. On that page I see no html version, and no utf-8 version. I see a plain text iso-8859-1 file and if I right click the "main site" link and paste it I get that the iso-8859-1 text has url
http://www.gutenberg.org/dirs/etext04/8ikc210.txt
I will thus need to tell lynx to expect iso8859-1 text; I will save the result in kritik.txt as follows (on the command line):
lynx --dump -assume_charset=ISO8859-1 http://www.gutenberg.org/dirs/etext04/8ikc210.txt > kritik.txt
The resulting file kritik.txt now contains the utf8 text (lynx did the reencoding).
Now I look at the digrams in kritik.txt; I do not try to be efficient; the bigrams code above is not, and as long as I get my answer in reasonable time, that's fine with me. I'll just find all bigrams in the text and then
egrepthose containing fä, fö, fü (I replaced semicolons by tabs in thebigramscode)./bigrams kritik.txt | egrep "f[äöü]"
and I get the output
fö 27 0.003%
fü 697 0.079%
fä 255 0.029%
which means that there is a total of 27+697+255 = 979 possible clashes in Kant's text. In my library, the book is 847 pages. On the average, that is more than one possible clash per page. A few simple an inefficient scripts, unix commands and pipes often give answers faster than sophisticated programs.
Michel