Unicode Normalization and NSString

Although I use Korean everyday, I didn’t have enough chance to figure out how the Unicode is actually designed.

The Mac OS X supports the Unicode, and when they say “Unicode”, it usually means the UTF-16, which is a 2 byte version of the Unicode.

One day I noticed a strange symptom. An NSString made with “자연” looked different from the same one in an FCP project file.

I wonder why and asked to the Apple’s Cocoa mailing list.

I got an answer from Ken Thomas and he suggested to check Unicode normalization.

Here is the link :

Unicode Standard Annex #15
Unicode Normalization Forms

There are 4 different way of representing Unicode.

  1. Normalization Form D (NFD)
  2. Normalization Form C (NFC)
  3. Normalization Form KD (NFKD)
  4. Normalization Form KC (NFKC)

Form D means Canonical “D”ecomposition, while Form C means Canonical “C”omposition.
The K in KD and KC mean “Compatibility”.
The link above explains what they are very well using pictures. So, take a look at it.

Now, you will be able to understand what -precomposed.. and -decomposed.. mean in the NSString documentation for these methods.

  1. -precomposedStringWithCanonicalMapping : Form C
  2. -precomposedStringWithCompatibilityMapping : Form KC
  3. -decomposedStringWithCanonicalMapping : Form D
  4. -decomposedStringWithCompatibilityMapping : Form KD

There are two additional issues I would like to mention.
First, we know that the Mac OS X uses Unicode natively. However which Unicode representation does it use?

– (const char*)fileSystemRepresentation of the NSString and – (const char*)fileSystemRepresentationWithPath:(NSString *)path returns the file name in Unicode in a way the MacOS X uses.

It is said to be mostly decomposed version. So, when you try opening a file by choosing one using NSOpenPanel, it would contain the string in decomposed way.

Second… then how we compare two Unicode strings? I guess we should take care of the two cases. But, the NSString is smart enought to handle them for us.

The compare: methods of the NSString can accept whatever forms and it can compare a composed version with a decomposed version. If the twos are actually for the same characters or words, it returns NSOrderedSame, which means that they are the same.

Impressive, isn’t it?

5 responses to this post.

  1. NSString의 compare: 메쏘드가 normalization form에 상관없이 identical한지 비교해준다는 말씀이죠? 그렇다면 내부적으로 모두 NFD로 변환한뒤 비교한다는 이야기가 되는 것 같은데요.
    하지만, 제가 최근에 하늘입력기의 한자입력을 구현하면서 확인한 바로는 NSPredicate을 통한 문자열 검색에서는 normalization form이 다르면 못 찾더라구요. 이는 아마도 NSPredicate 보다는 sqlite의 문제일 것 같지만요.


  2. Posted by jongampark on July 21, 2008 at 8:25 PM

    When I tested, the one is from a video clip name in a Final Cut Pro project, which was in decomposed form for “자연” and another string in composed form. So, I thought that they were transformed to a same form and compared.

    Can there be something special for the NSPredicate? Did you use strings returned by querying? Doesn’t the SQLite return strings as they are?
    I’m curious. Can you test it and post its result?

    By the way, it was done on a OS X 10.5. So, things can be somewhat different on 10.4. I noticed that there were some difference between those.


  3. sqlite에 다음과 같은 쿼리를 날리기 위해 NSPredicate를 만든겁니다.

    select * from Expansion where abbrev = ‘한국’

    NSPredicate는 다음과 같이 만들었겠죠..

    [NSPredicate predicateWithFormat:@”abbrev == %@”, aKey];

    aKey에 들어있는 ‘한국’이 NFD이고 Expansion 테이블에 들어있는 ‘한국’이 NFC일때 검색이 안된다는 것이지요.
    아마 NSString이나 NSPredicate하고는 상관없는 sqlite 내부의 쿼리엔진 문제일 꺼라는 생각이 드는군요.


  4. Posted by jongampark on July 24, 2008 at 6:45 PM

    Well.. So.. do you mean that it successfully find data if the akey contains NFC “한국”, and the one in the table is in NFC?
    If so, I don’t think it is not the sqlite problem. Did you check how they store data in the sqlite DB?

    By the way, do you use the Core Data?


  5. Ah, great tip. Thanks! I was trying to interface my Cocoa application with Ruby to be able to decompose Unicode strings. I never saw these methods in NSString before; they are normally called something like normalize_D. This saves me a lot of trouble :-)


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: