Python and codepoints above FFFF

Michel Boyer's picture

Here is a Python script that dumps to the output a utf-8 input file. The script works fine on Linux but if the input contains characters above U+FFFF it does not behave as expected on the Mac with whatever version of Python I use (I tried with Python 2.5, and 2.6 on OS X 10.5, and with Python 2.5, 2.6 and 3.1 on OS X 10.6).

Here is the dumpchars script:
---
import sys, codecs, unicodedata
infile=codecs.open(sys.argv[1],"r","utf-8")
text=infile.read(); infile.close()
 
def dump(char):
   try:
      print('%04X %s' % (ord(char), unicodedata.name(char)))
   except:
      print('%04X' % (ord(char)))
  
for char in text:
   dump(char)
---

If, on Linux, (even on the OPLC) I run python dumpchars in.txt on the file in.txt containing a line with the characters a, b and a CJK unified ideograph (that makes Typophile's editor crash, so I had to remove the following line)

I get

0061 LATIN SMALL LETTER A
0062 LATIN SMALL LETTER B
230B7 CJK UNIFIED IDEOGRAPH-230B7
000A

as expected. If I do the same on a Mac with any version of Python, from 2.5 to 3.1, I get this:

0061 LATIN SMALL LETTER A
0062 LATIN SMALL LETTER B
D84C
DCB7
000A

Characters above FFFF are not handled properly. I get that same output on a PC (I tried with python2.6)

Does anyone know of a solution to this problem aside from always having to rely on a "Linux box" ?

Michel

twardoch's picture

Michel,

unfortunately, by default Python is compiled in the UCS-2 version so it only supports Unicode BMP. It's possible to install Python from sources in the UCS-4 mode, but it means you need to do some compilation. It's also possible to install the UCS-4 version of Python from MacPorts (using Porticus).

Adam

Michel Boyer's picture

Adam

I now have a Macports version of python2.6 that works as I expected.

Thanks,

Michel

Syndicate content Syndicate content