How We Beat C++ STL Binary Search

At Realm, we’re always looking for ways to optimize for speed. This is the story of how we developed our own fast binary search function, with an execution time on average 24% faster than the C++ STL.


Realm stores data in binary tree structures. For example, suppose Realm was holding objects with the following IDs in a linked list:

If you wanted to find the object with ID equaled to 14 then this would require 5 operations or rather O(n) complexity.

One way to speed up this up is to use binary search:

“The binary search algorithm begins by comparing the target value to value of the middle element of the sorted array. If the target value is equal to the middle element’s value, the position is returned. If the target value is smaller, the search continues on the lower half of the array, or if the target value is larger, the search continues on the upper half of the array. This process continues until the element is found and its position is returned, or there are no more elements left to search for in the array and a “not found” indicator is returned.”
-Wikipedia

Instead of a linked list, the objects can be held in a binary tree. This structure enables identifying objects via a binary search by traversing the tree, starting from the root node. The binary tree for the IDs would look like this:

2015-07-08 20:13ZCanvas 1 Layer 1 117610201422

The search to identify the object with ID equaled to 14, now only takes 3 operations or rather O(log2(N)) complexity.

Given this reliance in Realm, binary search was therefore an obvious thing to try and optimize.

In the C++ standard library (STL) there are several binary search functions, including std::lower_bound() and std::upper_bound(). We decided to mimic the behavior of std::upper_bound() and use this for our reference implementation in unit tests. This function is a version of binary search that attempts to find the position of the first element that is greater than the specified value:

Get more development news like this

// For a sorted list, vec, return the index of the first element that compares
// greater to the value, or return vec.size() if no such element is found.
size_t match = std::upper_bound(vec.begin(), vec.end(), value) - vec.begin();

For example, using the same IDs from above, let’s apply std::upper_bound() with the value 13. It will then return 4 because 14 is the first element that compares greater to 13.

Our Test Setup

We are currently targeting mobile devices, with additional architectures in the future, so we selected following devices for our benchmarks:



Top: Huawei P7 phone with a 32-bit Huawei Kiri 910T (ARMv7 / Cortex-A9)
Bottom: HP Envy dv6 laptop with a 64-bit Intel Core i7-3630QM

We use the clang++ compiler for iOS, g++ for Android and Linux and Visual Studio for Windows, so we chose the following setups:

We did not use Thumb mode on ARM because it turned out to yield around half the performance of ARM mode, at least for our particular code.

Baseline

Let’s first investigate std::upper_bound() itself to get a baseline to compare with. We use the size_t type for everything (referred to as T) because it will match the native register size of the processor (32 or 64 bits).

We chose the size of the list to be 8192 elements, which takes up 32 kilobyte on the ARM and fits in its L1 data cache. On x64 it takes up 64 kilobyte which is double the size of its cache, so we get many L1 misses. Because of such differences, this blog should not be seen as an ARM vs x64 benchmark.

We created a wrapper around std::upper_bound() which returns a size_t instead of an iterator:

 template <class T> INLINE size_t fast_upper_bound0(const vector<T>& vec, T value)
{
  return std::upper_bound(vec.begin(), vec.end(), value) - vec.begin();
}

Let’s take a peek at the inner loop of upper_bound() inside the stl_algo.h header file of libstdc++…

while (__len > 0)
{
    _DistanceType __half = __len >> 1;
    _ForwardIterator __middle = __first;
    std::advance(__middle, __half);
    if (__val < *__middle)
        __len = __half;
    else
    {
        __first = __middle;
        ++__first;
        __len = __len - __half - 1;
    }
}
return __first;

… and benchmark it:

Version 1

If we look at the STL version, it has two case blocks for the if-statement. The first block contains only an assignment:

__len = __half;

and the second block contains 3 statements:

__first = __middle;
++__first;
__len = __len - __half - 1;

Our first attempt to optimize this, instead, used only 2 and 1 statements respectively:

template <class T> INLINE size_t fast_upper_bound1(const vector<T>& vec, T value)
{
    size_t size = vec.size();
    size_t i = 0;
    size_t size_2 = size;

    while (size_2 > 0) {
        size_t half = size_2 / 2;
        size_t mid = i + half;
        T probe = vec[mid];
        if (value >= probe) {
            i = mid + 1;
            size_2 -= half + 1;
        }
        else {
            size_2 = half;
        }
    }
    return i;
}

As can be seen from the assembly below, the compiler generates a simple one-to-one translation of this on all our platforms; it contains one code block for each of the cases, and branching instructions to control what block to execute.

This means the compilers make no algorithmic rewrites, and as such the performance doesn’t vary much from STL. We can see that g++ 4.9.1 on ARM has some loop unrolling and also places the second if-case in a separate distant code block which is branched to and from.

Version 2

We tried simplifying the algorithm to still contain an if-statement with two cases, but this time with just one simple assignment in each block, and with both assignments having the same source variable:

template <class T> INLINE size_t fast_upper_bound2(const vector<T>& vec, T value)
{
    size_t m_len = vec.size();
    size_t low = -1;
    size_t high = m_len;
    assert(m_len < std::numeric_limits<size_t>::max() / 2);
    while (high - low > 1) {
        size_t probe = (low + high) / 2;
        T v = vec[probe];
        if (v > value)
            high = probe;
        else
            low = probe;
    }

    if (high == m_len)
        return m_len;
    else
        return high;
}

This gave a significant speedup for g++ and clang because they turned the branching into conditional moves. Visual Studio did not, and did not improve on performance:

When to use conditional branches and when to use conditional moves depends on whether or not the condition can be correctly predicted by the CPU for each iteration of the loop.

If the condition is almost always true, or almost always false, or if it has a predictable pattern such as:

[true, false, true, false, true, false]

then the branch predictor of the CPU will predict it correctly most of the time. In this case, a conditional branch is very cheap and rivals the speed of a conditional move.

Which method is faster depends on the microarchitecture and context. CPUs don’t have prediction logic for conditional moves, and some microarchitectures could choose to read ahead a source operand from expensive main memory even for the cases where the operand turns out not to be needed. Benchmarking is recommended for your own use case.

However, if the condition is not predictable, a conditional branch is expensive because the CPU loses all the work of speculated execution of the instruction pipeline. This is the case for our binary search when deciding whether to use the left or right partition of the remaining list to search in, which has no clear predictable pattern. As a result, a conditional move is preferred.

Version 3

In this version we eliminate the if-statement completely, at the cost of more unconditional arithmetic.

template <class T> INLINE size_t fast_upper_bound3(const vector<T>& vec, T value)
{
    size_t size = vec.size();
    size_t low = 0;

    while (size > 0) {
        size_t half = size / 2;
        size_t other_half = size - half;
        size_t probe = low + half;
        size_t other_low = low + other_half;
        T v = vec[probe];
        size = half;
        low = v >= value ? other_low : low;
    }

    return low;
}

We ran into the problem that clang refused to turn the ternary operator low = v >= value ? low : other_low into conditional moves, which can be seen from the assembly below:

So we commented it away and replaced it with a function called choose() which uses inline assembly to issue a cmovaeq instruction:

template <class T> INLINE size_t choose(T a, T b, size_t src1, size_t src2)
{
#if defined(__clang__) && defined(__x86_64)
    size_t res = src1;
    asm("cmpq %1, %2; cmovaeq %4, %0"
        :
    "=q" (res)
        :
        "q" (a),
        "q" (b),
        "q" (src1),
        "q" (src2),
        "0" (res)
        :
        "cc");
    return res;
#else
    return b >= a ? src2 : src1;
#endif
}

This resulted in the following results for our third attempt:

For x64 g++ 4.8.2 and x64 clang 3.4 we now get just one single conditional move instruction, compared to two in the second attempt. This is expected since we now only have a single conditional assignment to low in the C++ code.

On ARM, some of the conditional moves have simply been replaced by other conditional instructions which gives no speedup.

Unrolled Version 3

Because our lists are usually long, we tried unrolling the while loop into 3 repeated sequences:

template <class T> INLINE size_t fast_upper_bound4(const vector<T>& vec, T value)
{
    size_t size = vec.size();
    size_t low = 0;

    while (size >= 8) {
        size_t half = size / 2;
        size_t other_half = size - half;
        size_t probe = low + half;
        size_t other_low = low + other_half;
        T v = vec[probe];
        size = half;
        low = choose(v, value, low, other_low);

        half = size / 2;
        other_half = size - half;
        probe = low + half;
        other_low = low + other_half;
        v = vec[probe];
        size = half;
        low = choose(v, value, low, other_low);

        half = size / 2;
        other_half = size - half;
        probe = low + half;
        other_low = low + other_half;
        v = vec[probe];
        size = half;
        low = choose(v, value, low, other_low);
    }

    while (size > 0) {
        size_t half = size / 2;
        size_t other_half = size - half;
        size_t probe = low + half;
        size_t other_low = low + other_half;
        T v = vec[probe];
        size = half;
        low = choose(v, value, low, other_low);
    };

    return low;
}

This gave minor speedups for all platforms except x64 g++ which remained the same.

We also wanted to test the effect of the -Os flag (optimize for size) over -O3 (optimize for speed). We did this for ARM and x64 with the following results:

We did not create assembly listings of this unrolled attempt.

Summary

Here are all of the benchmarks over the various attempts:

Benchmarks per Architecture

Benchmarks per Version

Concluding Thoughts

We developed a fast binary search function that has an execution time on average 24% faster than STL:

(76% + 81% + 80% + 75% + 70%) / 5 = 76%

We saw the cost of mispredicted branches and how to hint the compiler into using conditional moves instead. Such hinting is difficult and requires you to verify the generated assembly code.

Some binary search algorithms end with a short linear search. In our unrolled attempt we replaced the tailing non-unrolled search with a linear search. This resulted in tiny speedups of ~1% for the average case, but with some slowdown for lots of special cases. We wanted an optimization that was faster for any input and rejected it.

We have also experimented with interpolated binary search. These divide the list at a position which is estimated by looking at the leftmost and rightmost values of the remaining list and compare it to the value to be searched for. It requires an integer division which is a very expensive instruction, and we found that it only pays off if the value distribution of the list is very homogeneous. At the same time its worst case runtime is much worse than our O(log2(n)).

Next Steps

Haven’t tried Realm and you’re looking for a performant mobile database? Download the latest version for iOS and Android:

Do want you want to work on optimizations like this? Realm is hiring and we’d love to meet you! Feel free to email us at jobs@realm.io and introduce yourself.

Source code

Complete source code can be downloaded as a file here.

Next Up: Understanding Realm #3: Designing a Database: Realm Threading Deep Dive

General link arrow white

About the content

This content has been published here with the express permission of the author.


Adam Fish

Adam is the Director of Product at Realm, where he manages the product development for currently supported mobile platforms and upcoming new products. He has a strong background in entrepreneurship and software development, having previously co-founded Roobiq, a mobile-first sales productivity app used by sales teams worldwide.

4 design patterns for a RESTless mobile integration »

close