Concurrent, lock-free vector

While a concurrent alternative to an std::vector is generally not as useful as, for example, a queue or hash-map, it is still an interesting data structure to develop.

This post describes the design choices that were made when designing a lock-free data structure, and hopefully will be of interest to some. By no means does it aim to be the most performant or “the best” way to write a lock-free vector or otherwise.

The aim is to create a concurrent vector with push back, erase, random access look-up, iteration and dynamic resize facilities.

We will reuse the double-reference-counted pointer structure explained and used in the concurrent hash-map post. A double reference guarded pointer is a garbage-collecting mechanism that ensures that data evicted from a concurrent data structure, data that might still be in use, isn’t destroyed prematurely. It is an integral part to the design of the concurrent vector.

Source code is available under the CC Attribution-NonCommercial 4.0 license at my GitHub repository.

## Design Philosophy

As with every concurrent data structure, we need to be aware of some caveats to design a robust data structure, and make the most out of it.

First of all, concurrent data structures will always be out-performed by their non-concurrent counterparts for obvious reasons. Atomic operations are slow. It is generally wise to make sure that your requirements are well defined, before using or writing lock-free procedures.

Data integrity and correctness must be ensured, in the sense that data won’t be lost unless erased or overwritten. Likewise the data structure won’t be left in an undefined state, even if an exception is thrown (e.g. by a copy constructor).

While all read/write access to the data structure is thread-safe, mutating the user data stored by the vector is not guaranteed to be so.

Regarding visibility of mutations to readers: As this is a concurrent data structure, we make just a single guarantee about what read operations actually return. A mutation performed by a write operation is guaranteed to be visible to a reader if the read operation has started after the write operation was concluded, and there is a release-acquire memory fence between the operations. However the read operation might see a newer mutation, even one that is still in progress, and multiple concurrent readers might see different values. It is up to the consumer to synchronize and utilize memory barriers where applicable.

The major issues to overcome is erasing and inserting elements in the middle of the vector. Traditionally this requires moving elements around to compact the vector or make space for new elements, however moving elements is not “lock-free friendly”. To overcome that we simply don’t move elements. Deleted elements become “tombstones”, i.e. empty slots, and we don’t support inserting elements in the middle of the vector. Of course, new elements can still be inserted overwriting existing elements, or tombstones.

For the very same reasons, we also never shrink the vector.

Resizing the vector might happen in parallel with other read/write operations, therefore we can’t move elements and the elements must be copy-constructible. It is the consumers responsibility to ensure that the copy constructors are safe with regard to the consumer’s access patterns.

## Structure

The vector allocates a contiguous block of memory for N slots (“storage”), each slot a double-reference-counted pointer to an (possible empty) element. Each slot takes

2 x sizeof(void*)
bytes on 32-bit architectures, or
1 x sizeof(void*)
bytes on x86-64 architectures as we can exploit the 48-bit address space (see canonical form addresses of AMD64 architectures), and use the upper 16-bit of the pointer for the counter, this also allows us to use atomic operations on the pointer.

As the vector is resizeable, we also need to use a double-reference-counted pointer for the storage itself, this way we can replace it with a new one without compromising current readers. We call that pointer the “root” of the vector.

The size of the vector is an integer that keeps track of where we push back. The size is a global vector property and is not tied to the storage (unlike the capacity), therefore its lifetime should also be unaffected by the storage.

For example:

template <typename T>
class concurrent_vector {
using element_type = double_reference_guard;

struct storage {
element_type *data;
std::size_T capacity;
};

mutable double_reference_guard root;
std::atomic elements{ 0 };
};


All the read/write operations on the vector are done in form of iterators. However those iterators have to be slightly different from STL iterators as in our case the underlying data can change while we iterate. An iterator holds an acquired guard to the vector’s root and an index. As the vector never shrinks, random-access iterators are safe and can support all fancy arithmetic operators. The iterator only points to the double-reference-counted element pointers, to read a value the consumer should first acquire a guard from an iterator, check that it is valid and then read.

## Resizing

Once an operation (like pushing an element or reserving space) initiates a resize, it needs to create the new vector and atomically compare-exchange it to a dedicated resize pointer. Only one thread will “win” the exchange and the rest should destroy their created vectors and join the resize.

When resizing we need to copy element from the old storage to a new one and then atomically replace the root with the new storage. As there are multiple threads that might participate in the resize, each thread should choose a starting index at random and iterate over the storage. At each slot (that wasn’t copied yet) a resizer attempts to acquire a guard to the element in the old storage and if the slot is not empty copy the element to the new storage.

Once a thread finishes, it compare-exchanges the root pointer with the new vector. Here as well only one thread will be successful in setting the new root. Resizers can terminate early if they detect that somebody has already replaced the root.

To make sure that the resize data and its lifetime is shared correctly between threads, we employ a std::shared_ptr, and its atomic overlods (std::atomic_compare_exchange_strong and std::atomic_load).

In-line with our design principles we also must make sure that any mutation that happened while resizing is reflected in the new vector. To enforce correctness, all the writers that mutated the vector check if a resize is ongoing, in which case they join the resize and once it is done they repeat the mutation again, on the vector.

## Writes

Mutating an element in the vector can only be done by calling the emplace() method on that element’s double-reference-counted pointer to create a new element in its place, or by using the drop() method to erase an element. Those mutations are atomic, safe and are the building blocks for the rest of the write operations.

As discussed, to ensure to correctness mutations must make sure that a resize didn’t replace the old root:

template <typename... Ts>
void emplace(iterator &iter, Ts&&... ts) {
do {
iter->emplace(std::forward<Ts>(ts)...);
} while (!mutate(iter));
}

void erase(iterator &iter) {
do {
iter->drop();
} while (!mutate(iter));
}


mutate(iter) should checks if the root of the vector is the same root as acquired by the iterator. If it is, it returns true and we are done, otherwise it joins a resize (if any) and updates the root pointed of the iterator.

This gives rise to more complex write operations, like emplace_back:

template <typename... Ts>
void emplace_back(Ts&&... ts) {
std::size_t location;
auto r = root.acquire();

// Atomically increase vector size

// If we have space, emplace
if (r->capacity > location) {
iterator iter(location, std::move(r));
emplace(iter, std::forward<Ts>(ts)...);
return;
}

// Otherwse, resize and emplace at choosen location.
resize((location + 1) * 2);
r = root.acquire();
emplace(iterator(location, std::move(r)), std::forward<Ts>(ts)...);
}}


emplace_back atomically increases the size counter, taking ownership of a slot. At this point it simply inserts the element in this slot, initiating a resize first if necessary. The size counter can grow sharply beyond the capacity with multiple writers pushing into the vector, therefore the resize operation should make sure that the completed resize actually satisfies the size requirements, otherwise initiate a new resize.

Other, fancier mutators are possible, some are available with the source.

## Conclusion

Admittedly, a concurrent vector is an odd animal and its performance will be hampered by the amount of atomic operations required for the most basic tasks. Nonetheless it might be of use in situations where locking mutex on a rendering/ui/networking thread is unacceptable. Iterating and reading is guaranteed constant, lock-free performance.