ICU Text Transforms in Foundation

Updates:

  1. Jan 31, 2016
    Added an example for the Publishing rule.
  2. Feb 1, 2016
    Added a sample code snippet to illustrate how you would use the NSString API with a custom transform rule.
  3. Dec 13, 2016
    Updated the code to Swift 3.

ICU string transforms are cool. The ICU libraries provide a bunch of powerful text transformations that are very useful for processing user input, especially if your code needs to handle languages other than English and scripts other than Latin. For instance, you could transliterate a text written in Simplified Chinese to Latin characters, strip accents and other diacritical marks, delete invisible characters, and convert the input to lowercase before you feed the normalized string into your database’s search API, all in a single line of code.

On Apple platforms, string transforms have been exposed through the CFStringTransform function in Core Foundation for a long time. Read Mattt Thompson’s excellent overview on NSHipster for more on this API.

With iOS 9 and OS X 10.11, string transforms have made the jump to the Foundation framework and are also exposed on Swift’s String type. Documentation for the new applyingTransform(_:reverse:) method is still sparse, but the docs for CFStringTransform tell you what you need to know, and Nate Cook shows a few examples in this NSHipster article. Here’s how you would do the conversion from Chinese to Latin:

import Foundation
let shanghai = "上海"
shanghai.applyingTransform(.toLatin, reverse: false)
// → "shàng hǎi"

So far, so good. Apple currently provides constants for 16 possible transforms. Most of these refer to script transliterations, and then there are some others that let you strip combining marks and diacritics from the input, or convert characters to their code point numbers or official Unicode names. Additionally, most transforms can be reversed via the second argument to applyingTransform. This is already very powerful, especially when you chain multiple transformations. For example, here we transliterate the Chinese text first, then strip combining and diacritical marks:

shanghai.applyingTransform(.toLatin, reverse: false)?
    .applyingTransform(.stripCombiningMarks, reverse: false)?
    .applyingTransform(.stripDiacritics, reverse: false)
// → "shang hai"

Freeform Transforms

What I never realized, although it is mentioned both in the CFStringTransform documentation and in the NSHipster article, is that you can even go a step further. ICU defines its own syntax for specifying a transform, and if you pass a string conforming to this syntax to applyingTransform or CFStringTransform, it just works. Like this:

// Convert non-ASCII characters to ASCII,
// convert to lowercase, delete spaces
let toLowercaseASCIINoSpaces =
    StringTransform(rawValue: "Latin-ASCII; Lower; [:Separator:] Remove;")
"Café au lait".applyingTransform(toLowercaseASCIINoSpaces, reverse: false)
// → "cafeaulait"

The CFStringTransform API takes a normal string. With Swift 3, the loosely-typed string constants were converted to the dedicated StringTransform type as part of SE-0033. As a consequence, you’ll have to use the StringTransform.init(rawValue:) initializer for your custom transform rules. If you want to use a custom transform in many places, consider defining it as an extension to StringTransform:

extension StringTransform {
    static let toLowercaseASCIINoSpaces =
        StringTransform(rawValue: "Latin-ASCII; Lower; [:Separator:] Remove;")
}

"Café au lait".applyingTransform(.toLowercaseASCIINoSpaces, reverse: false)
// → "cafeaulait"

The documentation for this in the ICU User Guide is very good and includes lots of examples. I encourage you to check it out. Here are some of my own examples:

Convert to lowercase.

Input Transform Result
HELLO WORLD Lower hello world

Convert only vowels to lowercase. The square brackets specify a filter. The rule following the filter is only applied to characters that match the filter.

Input Transform Result
HELLO WORLD [AEIOU] Lower HeLLo WoRLD

Convert to Latin, then to ASCII, then to lowercase. Separate multiple rules with semicolons. The Latin-ASCII step removes diacritical marks and will also try to convert symbols and punctuation from outside the ASCII range to their nearest ASCII equivalent.

Input Transform Result
上海 Any-Latin; Latin-ASCII; Lower shang hai
København Any-Latin; Latin-ASCII; Lower kobenhavn
กรุงเทพมหานคร Any-Latin; Latin-ASCII; Lower krungthephmhankhr
Αθήνα Any-Latin; Latin-ASCII; Lower athena
“Æ « © 1984” Any-Latin; Latin-ASCII; Lower "ae << (c) 1984"

Delete punctuation. The Remove rule can be very powerful. The filter (in brackets) can either consist of a string of characters the rule should apply to (see above), or as in this case, a named Unicode character category.

Input Transform Result
“Make it so,” said Picard. [:Punctuation:] Remove Make it so said Picard

Delete everything that is not a letter. Use a caret ^ to negate a filter.

Input Transform Result
5 plus 6 equals 11 👍! [:^Letter:] Remove plusequals

Convert to typographical punctuation. The Publishing rule converts straight punctuation marks into their typographical equivalents.

Input Transform Result
"How's it going?" Publishing “How’s it going?”

Convert to hex representation. Several formats are supported. The default format is Java. Note that Java outputs UTF-16 code units (the emoji is encoded in two parts) while the other formats output code points.

Input Transform Result
😃! Hex \uD83D\uDE03\u0021
😃! Hex/Java \uD83D\uDE03\u0021
😃! Hex/Unicode U+1F603U+0021
😃! Hex/Perl \x{1F603}\x{21}
😃! Hex/XML &#x1F603;&#x21;

Normalize to different normalization forms.

Input Transform Result
é NFD; Hex/Unicode U+0065U+0301
é NFC; Hex/Unicode U+00E9
2⁸ NFKD 28
2⁸ NFKC 28

Imagine you’d have to write all this yourself.

I learned about freeform transform rules from Florian and Daniel’s Core Data book. They explain how you can use string transforms to normalize search terms the user has entered before feeding it to the database. This can vastly improve search performance and yield better search results.