Unicode strings will not always have length equal to the number of characters inside them. (This probably depends on the unicode library Python was compiled with.)
Two one character unicodes:
>>> u'\U00010000'
u'\U00010000'
>>> u'\U00008000'
u'\U00008000'
But they aren't exactly the same:
>>> len(u'\U00008000')
1
>>> len(u'\U00010000')
2
Can you guess what the two characters will be?
>>> u'\U00010000'[0]
u'\ud800'
>>> u'\U00010000'[1]
u'\udc00'
>>> u'\ud800' + u'\udc00'
u'\U00010000'
(Mahmoud):
The length of unicode characters is actually their length as represented in memory. The first character (耀 for the curious) is half the size of the second character (𐀀). They were arbitrarily chosen because one fits into two bytes in memory, and the other, spills over into three bytes.
You can check how your Python build stores these characters in memory by running
>>> import sys
>>> sys.maxunicode
If it's > 65536 then you've got UCS-4 (wide) in-memory representation and will get a len of 1 for the characters above. If it's <= 65536, then you've got UCS-2 (narrow), and you'll get the confusing and arguably wrong lengths.
These settings are configured when Python is built, and cannot be changed at runtime. Future versions of Python seek to eliminate this distinction altogether.
Kind of like "hey guys, check it out you can just duct tape down the dead-man's switch on this power tool and use it one handed". In Python.
Thursday, March 17, 2016
unicode + ord
The ord() built-in may return very large values when handed a 1-character unicode string:
>>> ord(u'\U00008000')
32768
This means that chr(ord(s)) will not always work.
>>> chr(ord(u'\U00008000'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(256)
>>> ord(u'\U00008000')
32768
This means that chr(ord(s)) will not always work.
>>> chr(ord(u'\U00008000'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(256)
Subscribe to:
Posts (Atom)