How OrderedSet works

Updates:

  1. May 1, 2021
    Clarified that existing array indices in the hash table must be adjusted on insertion.

The new Swift Collections library provides efficient Swift implementations for useful data structures, starting with Deque, OrderedSet, and OrderedDictionary in the initial release.

The documentation for OrderedSet says this about the implementation:

An OrderedSet stores its members in a regular Array value (exposed by the elements property). It also maintains a standalone hash table containing array indices alongside the array; this is used to implement fast membership tests.

I didn’t understand how this standalone hash table can get away with only storing array indices — wouldn’t it also need to store the corresponding elements? So I checked out the source code to find out.

I imagined a pseudo implementation of OrderedSet that uses an array (for the ordered elements) and a dictionary that maps elements to array indices (for fast lookups), like this:

struct OrderedSet<Element: Hashable> {
  private var elements: [Element]
  /// For fast lookups. Maps from Element to Array.Index.
  private var hashTable: [Element: Int]
}

This is conceptually correct, but storing every element twice is obviously not memory-efficient. So how does OrderedSet do it? My error was to equate “hash table” with “set or dictionary” in my head, but it turns out there are other kinds of hash tables.

OrderedSet implements a custom hash table that stores one set of values (the array indices), but uses entirely different values (the set’s elements) for hashing and equality checks.

For testing if an item is in the set, OrderedSet performs these steps:

  1. Compute the item’s hash value.
  2. Find the bucket for this hash value in the hash table.
  3. If the bucket is empty, the item is not in the set ➡ return false.
  4. Otherwise, use the array index in the bucket to access the corresponding element in the elements array.
  5. Compare the item to be tested with the stored element.
  6. If they’re equal, we found the item ➡ return true.
  7. Otherwise, advance to the next bucket and go back to step 3.

For inserting an item at a specific index (appending an item is a special case of this):

  1. Check if the item is already in the set (see above). If yes, we’re done because set elements must be unique. If no, we now have an empty bucket in the hash table to store the item’s array index in.
  2. Store the array index in the empty bucket we found in step 1.
  3. Adjust the existing array indices in the hash table to account for the inserted item. Conceptually, we need to increment every array index that’s larger than the index of the inserted item. The actual implementation is cleverer, but that’s not relevant for this discussion.
  4. Insert the item into the elements array at the specified index.

This strategy saves a ton of memory, especially for large element types.

In addition, the hash table saves memory by only allocating as many bits per bucket as needed for the current capacity. For example, an OrderedSet with a current capacity of 180 elements only needs 8 bits per bucket to store every possible array index. As elements are added, OrderedSet will regenerate the hash table with larger buckets before it becomes too full.

The source code for the custom hash table implementation is in the OrderedCollections/HashTable directory.