Verdana Hates Pinyin
I stumbled across an article on lostlaowai.com
www.lostlaowai.com/survival-chinese
which lead me to poke around the site a bit. At the above URL, I noticed that some of the combining diacritical marks (tone marks) used in writing pinyin were not rendering properly. I had not seen this problem before. It didn’t make sense.
Things that don’t make sense bug me. And being something of a character geek, I couldn’t let it go. So I tried to reproduce the problem in a test example. I couldn’t. That’s when I discovered a quirky Mac OS X copy+paste issue. I sensed there was a problem but the truth was elusive. You can’t see that copy+paste changes the string characters unless you look at a binary dump of the file (which I did).
Okay, the mandarin word for ‘good’ is 好 and in pinyin is written ‘hǎo’. It’s possible to write the pinyin using codpoints from just the unicode Latin block.
Latin Extended-B (Latin) latin small letter a with caron Unicode 01CE UTF-8 C7 8E h ǎ o 0068 01CE 006F
It’s also possible to write the pinyin using Combining Diacritical Marks.
Combining Diacritical Marks (Combining Marks) combining caron Unicode 030C UTF-8 CC 8C h a 030C o 0068 0061 030C 006F
Note that the combining mark comes after the character it decorates. This is in contrast to Mac OS X’s U.S. Extended Keyboard input method which preceeds the character to decorate with a modifier letter. However, the modifier letter is not a combining mark. You cannot create a byte sequence that a browser renders as hǎo, it will come out as hˇao.
Spacing Modifier Letters (Modifier Letters) caron Unicode 02C7 UTF-8 C8 87 h 02C7 a o 0068 02C7 0061 006F NOTE: the caron does not combine with the a; OS X does not modify the 'a' to have a caron above.
OS X input method uses the modifier letter to lookup an equivalent codepoint in unicode’s latin block.
Using OS X's US Extended Keyboard Input Method opt-v + a h 02C7 a o h ǎ o 0068 02C7 0061 006F ==> 0068 01CE 006F Note: the caron combines with the a; OS X automatically converts 02C7 + 0061 into 01CE.
To check the code points, I used this handy tool:
people.w3.org/rishida/scripts/uniview/conversion.php
- open the OS X character pallete
- Go to the URL above
- place the cursor in the upper left box labeled Characters
- type the letter h into the box
- type the letter a into the box
- from character pallete, insert character 030C into the box
- type the letter o into the box
- click the convert button just above the Characters box, the UTF-16 Code units box will have the sequence (in unicode code points) 0068 0061 030C 006F
- select and copy (cmd+c) the contents of the Characters box
- immediately paste contents back into the Characters box
- click the convert button just above the Characters box, the UTF-16 Code units box now has the sequence 0068 01CE 006F
Aha! The copy and paste operation changed the string’s character code points! Imagine my surprise.
That mystery solved, I next dove into the lostlaowai source code. This was my first encounter with using character entity encoding of the combining diacritical marks. Rather than type the characters directly into the source code, like this
hǎo
lostlaowai encoded the non-ascii characters like this
hǎo
even though the page encoding was declared as UTF-8
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
Maybe it’s a joomla thing. lostlaowai uses joomla.
After a quick bout of deleting blocks of source code, I isolated the culprit!
<html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> <title>wonderful.html</title> </head> <body> <pre> 好極了! 1. hǎo jíle! 2. hǎo jíle! <span style="font-family: Verdana, Arial, Helvetica, sans-serif;"> 3. hǎo jíle! 4. hǎo jíle! </span><span style="font-family: Arial, Helvetica, sans-serif;"> 5. hǎo jíle! 6. hǎo jíle! </span></pre> </body> </html>
Source code: wonderful.html
Adding Verdana to the font family causes the problem. I searched to see if anyone else had seen this problem. Indeed. Wikipedia.org has en entry on a similar bug and fileformat.info lists the five marks supported by Verdana. That’s sad. Verdana only supports 5 of the 112 code points in unicode’s Combining Diacritical Marks block.
The Verdana typeface, released in 1996, was created for and is owned by Microsoft. If Microsoft hasn’t fixed Verdana after more than a decade, I’ll assume they never will and prudence suggests avoid it.
At least avoid using Verdana in writing pinyin using combining diacritical marks. If you must use Verdana, then use codepoints from unicode’s latin block. On the Mac, this is the default when typing these characters in directly using the U.S. Extended keyboard.
Character ā á ǎ à Unicode 0101 00E1 01CE 00E0 ------------------------------ Character ē é ě è Unicode 0113 00E9 0118 00E8 ------------------------------ Character ī í ǐ ì Unicode 0128 00ED 01D0 00EC ------------------------------ Character ō ó ǒ ò Unicode 014D 00F3 0102 00F2 ------------------------------ Character ū ú ǔ ù Unicode 0168 00FA 01D4 00F9 ------------------------------ Character ǖ ǘ ǖ ǜ Unicode 01D6 01D8 01D6 01DC
If you have to convert an existing web page (like the lostlaowai page mentioned above), you could take advantage of the copy+paste quirk in OS X. Simply open the web page, copy the pinyin and paste it into a text editor (e.g., back into the source). The original text is not rendered properly but that’s ok. The character codes are correct. When you paste it into the editor, OS X will convert the the char+mark into a single char from the latin code block.
Finally, the character ‘a’ in pinyin is sometimes written using using the unicode codepoint 0251 ‘ɑ’ which is still in the latin block but in the section called “IPA Extensions”. It has a different look from the standard ascii character ‘a’. There is no set codepoints that replace the accented characters in the chart above.
1 Comment
Wow Kelly, I have never met anyone so amazingly versed in digital character… umm… stuff. Well done!
Thanks for the note about the pinyin. I’ve fixed it all. Couldn’t bring myself to nix Verdana (as I like how it renders), but I simply replaced the encoded stuff with the appropriate characters and it seems to be displaying on a Mac now.
I *believe* (and this is a test of the memory) the reason they were encoded in the first place (as nothing else on the site is) was a character to pinyin tool I used to expedite the process back 3 years ago or so when I initial set up the site. The text has just been carried over in various formats since then.
Thanks again!
Ryan
2009.08.1103:47