In last week’s Swift Talk episode, Florian and Chris wrote a SortedArray
type: an array that keeps its elements sorted according to a given sort predicate at all times. This is great because it encodes one more invariant in the type system — clients that use this type in place of a regular Array
never have to worry about accidentally forgetting to keep the array sorted manually.
Florian and Chris intentionally left some desirable features out in order to keep the video reasonably short. I’d like to show you an implementation here that is more efficient for some operations and has some nice additional features. None of this is particularly difficult to write, but I think it shows quite well how you can approach writing a custom collection type that fits in seamlessly with the rest of the standard library.
You can look at the full code on GitHub, but here are some highlights.
Specialized collection protocols
In addition to the Collection
conformance shown in the video, SortedArray
conforms to RandomAccessCollection
. It makes accessing elements at arbitrary indices much faster, and we need this later to implement an efficient binary search algorithm.
The implementation is straightforward because we can forward everything to the array we use for backing storage. Because we use Int
as our Index
type, we don’t even have to implement the otherwise required index(_:offsetBy:)
and distance(from:to:)
methods — the standard library provides default implementations for Strideable
indices.
SortedArray
can’t conform to either MutableCollection or RangeReplaceableCollection
because their semantics of inserting/replacing elements at specific positions are incompatible with our invariant to keep elements sorted.
ExpressibleByArrayLiteral
SortedArray
does not conform to ExpressibleByArrayLiteral
, i.e. you can’t do this:
let sorted: SortedArray = [3,1,2]
This would be nice to have, but since you can’t pass an explicit sort predicate alongside the array literal, it would only work with elements that are Comparable
, and Swift 3 doesn’t support conditional protocol conformance yet, which is needed to write:
extension SortedArray: ExpressibleByArrayLiteral where Element: Comparable {
...
}
Maybe in Swift 4.
Binary search
One of the big advantages of having a sorted collection is that you can find an element very efficiently using binary search. Binary search finds an element in logarithmic rather than linear time.
To implement this I first wrote a helper method named search(for:)
. You can check out the full code on GitHub; here I’d like to specifically discuss the return type:
fileprivate enum Match<Index: Comparable> {
case found(at: Index)
case notFound(insertAt: Index)
}
extension SortedArray {
/// Searches the array for `newElement` using binary search.
///
/// - Returns: If `newElement` is in the array, returns `.found(at: index)`
/// where `index` is the index of the element in the array.
/// If `newElement` is not in the array, returns `.notFound(insertAt: index)`
/// where `index` is the index where the element should be inserted to
/// preserve the sort order.
/// If the array contains multiple elements that are equal to `newElement`,
/// there is no guarantee which of these is found.
///
/// - Complexity: O(_log(n)_), where _n_ is the size of the array.
fileprivate func search(for newElement: Element) -> Match<Index> {
...
}
}
The standard library’s index(of:)
method returns an Optional<Index>
, which is nil
if no match was found. search(for:)
does something similar, but its return type is a custom enum that carries an element index as payload in both the .found
and the .notFound
case. This allows us to use the same algorithm for searching and inserting: the returned index is where we need to insert a new element to maintain the sort order.
With the algorithm in place, here are the implementations for index(of:)
and contains(_:)
:
extension SortedArray {
/// Returns the first index where the specified value appears
/// in the collection.
///
/// - Complexity: O(_log(n)_), where _n_ is the size of the array.
public func index(of element: Element) -> Index? {
switch search(for: element) {
case let .found(at: index): return index
case .notFound(insertAt: _): return nil
}
}
/// Returns a Boolean value indicating whether the sequence contains
/// the given element.
///
/// - Complexity: O(_log(n)_), where _n_ is the size of the array.
public func contains(_ element: Element) -> Bool {
return index(of: element) != nil
}
}
Note that these are not just more efficient than the default implementations in the standard library — they are also more generic. The standard library’s versions of these methods require a where Iterator.Element: Comparable
constraint that we can omit because SortedArray
always has a sort predicate.
Inserting elements
The next task is to improve the efficiency of inserting elements by taking advantage of binary search. I decided to provide two insert methods: the first inserts a single element at the correct position to maintain the sort order. It uses binary search to find the correct insertion index in O(log n). Inserting the new element into the backing store array takes at worst O(n) time because all subsequent elements have to be moved to make room.
The second method inserts a sequence of new elements. Here I opted to append the new elements to the backing array first and then re-sort it once. This can be much faster than finding the correct insertion index repeatedly (if the inserted sequence is longer than log n elements).
extension SortedArray {
/// Inserts a new element into the array, preserving the sort order.
///
/// - Returns: the index where the new element was inserted.
/// - Complexity: O(_n_) where _n_ is the size of the array. O(_log n_) if the new
/// element can be appended, i.e. if it is ordered last in the resulting array.
@discardableResult
public mutating func insert(_ newElement: Element) -> Index {
let index = insertionIndex(for: newElement)
// This should be O(1) if the element is to be inserted at the end,
// O(_n) in the worst case (inserted at the front).
_elements.insert(newElement, at: index)
return index
}
/// Inserts all elements from `elements` into `self`, preserving the sort order.
///
/// This can be faster than inserting the individual elements one after another because
/// we only need to re-sort once.
///
/// - Complexity: O(_n * log(n)_) where _n_ is the size of the resulting array.
public mutating func insert<S: Sequence>(contentsOf newElements: S) where S.Iterator.Element == Element {
_elements.append(contentsOf: newElements)
_elements.sort(by: areInIncreasingOrder)
}
}
Other efficiency gains
Chris and Florian already showed in the episode that we can provide more efficient variants of min()
and max()
because the minimum and maximum elements in a sorted collection are always the first and last:1
extension SortedArray {
/// Returns the minimum element in the sequence.
///
/// - Complexity: O(1).
@warn_unqualified_access
public func min() -> Element? {
return first
}
/// Returns the maximum element in the sequence.
///
/// - Complexity: O(1).
@warn_unqualified_access
public func max() -> Element? {
return last
}
}
The @warn_unqualified_access
directive tells the compiler to issue a warning when you call one of these methods from inside the type’s implementation (or an extension) without an explicit self.
prefix. This helps avoid confusion between the methods and the free functions min(_:_:)
and max(_:_:)
.
As with index(of:)
and contains(_:)
, our min()
and max()
variants are more generic because they don’t rely on the element type being Comparable
. We get higher performance with fewer constraints.
Only protocol requirements create customization points
Note that none of these four methods are protocol requirements of the Sequence
or Collection
protocols, i.e. they aren’t part of the protocols’ definitions. They are only default implementations without being requirements. As a consequence, calls to these methods are statically dispatched because they aren’t customization points2.
The implementations in SortedArray
don’t override the default implementations (because only requirements can be overridden), they only shadow them. Your code will take advantage of the more efficient implementations when you work directly with a variable of type SortedArray
, but they will never be called in a generic context. Example:
let numbers = SortedArray(unsorted: [3,2,1])
// This will call SortedArray.max(), as expected:
let a = numbers.max()
func myMax<S: Sequence>(_ sequence: S ) -> S.Iterator.Element?
where S.Iterator.Element: Comparable {
return sequence.max()
}
// This will call Sequence.max() (less efficient):
let b = myMax(numbers)
There’s nothing we can do about this, short of lobbying on swift-evolution to make these methods protocol requirements (and I’m not sure it’s a good idea).
Update February 9, 2017: I missed that methods like index(of:)
, contains(_:)
etc. can’t currently be requirements on Sequence
or Collection
because they need an Iterator.Element: Equatable
constraint and there’s no way to define a protocol requirement with a generic constraint. Becca Royal-Gordon raised this issue on swift-evolution and asks if this is something that should be added to the language.
Slicing
I toyed with making SortedArray
’s backing store an ArraySlice
rather than an Array
. The advantage would be that it would be very easy to define SortedArray.SubSequence = ArraySlice
, i.e. to make SortedArray
its own slice type. This would make working with slices very convenient because something like sortedArray.prefix(5)
would return another SortedArray
and not the default type RandomAccessSlice
.
In the end I decided against it because it is discouraged to hold ArraySlice
instances for long periods of time. Even a tiny slice of a very large array holds on to its base array forever, and this can lead to high memory usage the user doesn’t expect — even if the base array’s memory technically hasn’t leaked, the slice prevents it from being freed in a timely manner.
Performance of generic types imported from other modules
If you intend to use the SortedArray
type in your own code (or any other performance-critical generic type, for that matter), I suggest you don’t import it as a third-party module but add the source file directly to your module.
As of Swift 3, Swift can’t perform generics specialization across module boundaries. In other words, if you use a SortedArray<Int>
in your code and SortedArray
is defined in another module, the compiler can’t generate optimized code for an array of Int
elements — it can only use the standard code path that boxes each generic value in a container and performs method dispatch through an associated witness table. This can easily slow down your code by one or two orders of magitude:
Current versions of the Swift compiler are unable to specialize generic types that are imported from external modules other than the standard library. … This limitation puts a considerable limit on the raw performance achievable by collection types imported from external modules, especially if they are parameterized with simple, extremely optimizable value types such as
Int
or evenString
. Relying on import will incur a 10-200x slowdown when your collection is holding these most basic value types. (The effect is much reduced for reference types, though.)
The standard library is the only module that is exempt from this limitation. Standard library types are always available for specialization from any module.
I hope the Swift compiler team finds a way around this in the future. I don’t know how that would work, though. The compiler currently ships with an unoffical attribute called @_specialize
(which will get a new syntax soon, it seems). Annotating a function with this attribute and a list of types instructs the compiler to emit specialized code for the specified types. The in-progress version of the attribute also seems to support constraints like _Trivial64
to cover all trivial value types of a certain size.
Conclusion
The full implementation is 200 lines including comments, plus about the same in tests.
As you can see, there is a lot to consider in a custom collection type. And it’s all about the interface — we haven’t even touched the underlying implementation (that’s done by our backing array). But I think it’s worth it for the result. We end up with a type that behaves almost exactly like the built-in collection types and can be passed interchangeably to any algorithm that operates on Sequence
or Collection
.
The very real performance impact of using generic types across module boundaries is a bummer, though.
-
The semantics of the
min()
andmax()
methods I present here are not the same asSequence.min()
andSequence.max()
. While theSequence
methods always use<
as the comparison predicate, the variants forSortedArray
use the type’s own comparison predicate.As a result, if you initialize a
SortedArray
with a different predicate than<
, the methods will return different results than theSequence
methods (but consistent withSortedArray
’s semantics):let numbers = SortedArray(unsorted: [3,5,1], areInIncreasingOrder: >) numbers.min() // → 5 Array(numbers).min() // → 1
(The standard library provides
min(by:)
andmax(by:)
to give you the option to use a custom comparison predicate.)If this distinction is important to you, it can make sense to introduce a
SortedCollection
protocol that codifies these semantics. The protocol would include the comparison predicate as one of its requirements:protocol SortedCollection: Collection { var areInIncreasingOrder: (Iterator.Element, Iterator.Element) -> Bool { get } ... }
-
Search for “customization point” in the transcript of the linked WWDC session:
The answer is that a protocol requirement creates a customization point.