AttributedString’s Codable format and what it has to do with Unicode

Here’s a simple AttributedString with some formatting:

import Foundation

let str = try! AttributedString(
  markdown: "Café **Sol**",
  options: .init(interpretedSyntax: .inlineOnly)
)

AttributedString is Codable. If your task was to design the encoding format for an attributed string, what would you come up with? Something like this seems reasonable (in JSON with comments):

{
  "text": "Café Sol",
  "runs": [
    {
      // start..<end in Character offsets
      "range": [5, 8],
      "attrs": {
        "strong": true
      }
    }
  ]
}

This stores the text alongside an array of runs of formatting attributes. Each run consists of a character range and an attribute dictionary.

Unicode is complicated

But this format is bad and can break in various ways. The problem is that the character offsets that define the runs aren’t guaranteed to be stable. The definition of what constitutes a Character, i.e. a user-perceived character, or a Unicode grapheme cluster, can and does change in new Unicode versions. If we decoded an attributed string that had been serialized

  • on a different OS version (before Swift 5.6, Swift used the OS’s Unicode library for determining character boundaries),
  • or by code compiled with a different Swift version (since Swift 5.6, Swift uses its own grapheme breaking algorithm that will be updated alongside the Unicode standard)1, the character ranges might no longer represent the original intent, or even become invalid.

Normalization forms

So let’s use UTF-8 byte offsets for the ranges, I hear you say. This avoids the first issue but still isn’t safe, because some characters, such as the é in the example string, have more than one representation in Unicode: it can be either the standalone character é (Latin small letter e with acute) or the combination of e + ◌́ (Combining acute accent). The Unicode standard calls these variants normalization forms.2 The first form needs 2 bytes in UTF-8, whereas the second uses 3 bytes, so subsequent ranges would be off by one if the string and the ranges used different normalization forms.

Now in theory, the string itself and the ranges should use the same normalization form upon serialization, avoiding the problem. But this is almost impossible to guarantee if the serialized data passes through other systems that may (inadvertently or not) change the Unicode normalization of the strings that pass through them.

A safer option would be to store the text not as a string but as a blob of UTF-8 bytes, because serialization/networking/storage layers generally don’t mess with binary data. But even then you’d have to be careful in the encoding and decoding code to apply the formatting attributes before any normalization takes place. Depending on how your programming language handles Unicode, this may not be so easy.

Foundation’s solution

The people on the Foundation team know all this, of course, and chose a better encoding format for Attributed String. Let’s take a look.3

let encoder = JSONEncoder()
encoder.outputFormatting = [.prettyPrinted, .sortedKeys]
let jsonData = try encoder.encode(str)
let json = String(decoding: jsonData, as: UTF8.self)

This is how our sample string is encoded:

[
  "Café ",
  {

  },
  "Sol",
  {
    "NSInlinePresentationIntent" : 2
  }
]

This is an array of runs, where each run consists of a text segment and a dictionary of formatting attributes. The important point is that the formatting attributes are directly associated with the text segments they belong to, not indirectly via brittle byte or character offsets. (This encoding format is also more space-efficient and possibly better represents the in-memory layout of AttributedString, but that’s beside the point for this discussion.)

There’s still a (smaller) potential problem here if the character boundary rules change for code points that span two adjacent text segments: the last character of run N and the first character of run N+1 might suddenly form a single character (grapheme cluster) in a new Unicode version. In that case, the decoding code will have to decide which formatting attributes to apply to this new character. But this is a much smaller issue because it only affects the characters in question. Unlike our original example, where an off-by-one error in run N would affect all subsequent runs, all other runs are untouched.

Related forum discussion: Itai Ferber on why Character isn’t Codable.

Storing string offsets is a bad idea

We can extract a general lesson out of this: Don’t store string indices or offsets if possible. They aren’t stable over time or across runtime environments.

  1. On Apple platforms, the Swift standard library ships as part of the OS so I’d guess that the standard library’s grapheme breaking algorithm will be based on the same Unicode version that ships with the corresponding OS version. This is effectively no change in behavior compared to the pre-Swift 5.6 world (where the OS’s ICU library determined the Unicode version).

    On non-ABI-stable platforms (e.g. Linux and Windows), the Unicode version used by your program is determined by the version of the Swift compiler your program is compiled with, if my understanding is correct. ↩︎

  2. The Swift standard library doesn’t have APIs for Unicode normalization yet, but you can use the corresponding NSString APIs, which are automatically added to String when you import Foundation:

    import Foundation
    
    let precomposed = "é".precomposedStringWithCanonicalMapping
    let decomposed  = "é".decomposedStringWithCanonicalMapping
    precomposed == decomposed // → true
    precomposed.unicodeScalars.count // → 1
    decomposed.unicodeScalars.count  // → 2
    precomposed.utf8.count // → 2
    decomposed.utf8.count  // → 3
    

    ↩︎

  3. By the way, I see a lot of code using String(jsonData, encoding: .utf8)! to create a string from UTF-8 data. String(decoding: jsonData, as: UTF8.self) saves you a force-unwrap and is arguably “cleaner” because it doesn’t depend on Foundation. Since it never fails, it’ll insert replacement characters into the string if it encounters invalid byte sequences. ↩︎