From the make-it-go-fast dept.
Use the full machine's capabilties
Introduction
Any software that does significant amount of processing is required to adopt to modern CPU design, which has for some decades included the usage of multiple cores or multiple CPUs in a system.
There are several ways to approach, and conquer, the multi-core battle in a software family like Flowee and I’ll address the various ways and their benefits in this page.
Terminology
- Event-queue
- When a single thread is provided with an ’event-queue’ this thread can be used for any job-type which are processed one by one.
- Chunking
- The practice of taking a single long process and splitting it into individual ‘chunks’ in order to make it possible to run that in an event-queue. The goal is to spent only a limited amount of time before returning to the event-queue.
- Locking
- When data can’t be accessed from multiple threads, we can use locking (mutex) to make a second thread wait until it’s safe.
- Lock-Free
- Lock-free programming moves the protection of data-members from the programmer to the CPU and in many cases is vastly preferred for its speed and avoidance of waiting.
- Worker-pool
- An array of threads, typically the same number as hardware cores, each of them running an event-queue ready to process any tasks posted to them.
Multi-threading in Flowee the Hub
The starting point for multi-threading is knowing that we use a worker-pool
owned by the WorkerThreads
class (implemented in boost asio). The worker pool
allows any component anywhere in the application to schedule a job to be
executed. Posting a job can be done from any thread and the job can be
executed in any thread (owned by the worker-pool).
It is thus important to realize that practically all parts of the application should be made thread-safe.
In new classes we do this by encapsulating the data and using either
locking or making the data members lock-free (or a combination of the two).
Specifically, those data members should be private to the close.
This is in start contrast to old code that still uses global or public
locks (implemented as mutexes) like cs_main. New code introducing such
constructs will not be accepted.
Example 1
Lets take an example from the Validation::Engine
namespace Validation {
class Engine {
public:
Settings addBlock(const FastBlock &block, std::uint32_t onResultFlags, CNode *pFrom = nullptr);
};
}
class Settings {
public:
Settings();
Settings(const Validation::Settings &other);
Validation::Settings start();
const std::string &error() const;
...
void setCheckPoW(bool on);
void waitUntilFinished() const;
void waitHeaderFinished() const;
private:
std::shared_ptr<ValidationSettingsPrivate> d;
};
void test() {
// just add and let it complete in the background.
validator.addBlock(block, Validation::PunishBadNode, pFrom);
// wait until done, we can ask for errors afterwards.
auto future = validator.addBlock(block2, 0).start();
future.waitUntilFinished();
// set settings on the job before we start it.
auto settings = validator.addBlock(block3, 0)
settings.setCheckPoW(false);
settings.start();
}
In this example we show how the new validation engine allows you to add a block to be validated. The ‘addBlock()’ method returns immediately because the base assumption in Flowee is that APIs are non-blocking. We need to assume this because we share the same thread in the validation engine as well as the API network handling. Blocking means we waste CPU time where we could have handled another user request.
In our first example we pass in the PunishBadNode
. We delegate the
punishing of a node to the validation engine in order to allow the thread
that handled the incoming data to forget about this block the moment it
passed the block to the validation engine.
In specific cases the programmer can decide to be blocking, for instance
when mining, and then you can call waitUntilFinished()
.
For this reason the ‘addBlock()’ method returns a Settings instance which
can be used as a ‘future’ to wait for the block validation to finish in
another thread.
The 3rd example is when we use the settings object to pass in more options.
In Flowee we love maintainable code, as such the option of having a method
with a dozen arguments didn’t sound very wise. The solution is shown at the
bottom of the code shown above. We return a ‘Settings’ object which can be
used to set more validation options on, and when we are happy we call
start()
to actually commence the validation. We could do a
waitUntilFinished on that settings if we would wish to do so.
Please note that in the above example we use std::shared_ptr
and various
similar classes. Those are effective examples of lock-free programming.
They are thread-safe and require no mutexes.
Example 2
An example from the transaction validation to show chunking.
struct BlockValidationState {
FastBlock m_block;
mutable std::atomic<int> m_txChunkLeftToStart;
void calculateTxCheckChunks(int &chunks, int &itemsPerChunk) const;
void checkSignaturesChunk() {
const int totalTxCount = (int) m_block.transactions().size();
int chunkToStart = m_txChunkLeftToStart.fetch_sub(1) - 1;
int chunks, itemsPerChunk;
calculateTxCheckChunks(chunks, itemsPerChunk);
const int txMax = std::min(txIndex + itemsPerChunk, totalTxCount);
for (int txIndex = itemsPerChunk * chunkToStart; txIndex < txMax; ++txIndex) {
// validate a tx
}
}
The scenario is this: we have a block with a large number of transactions. We have a specific number of cores the current hardware has and we want to divide the total number of transactions to check evenly over each of those cores. Each does its own chunk. What we could do is prepare everything, but here is a simpler way.
We call the checkSignaturesChunk() a total of 8 times, should the machine have 8 cores. They can be run in parallel. The fun detail is the usage of the m_txChunkLeftToStart atomic int which allows for lock-free operation. We set it to 8 before we start and as each core starts to execute the checkSignaturesChunk() method, it reaches this line:
int chunkToStart = m_txChunkLeftToStart.fetch_sub(1) - 1;
The fetch_sub(1)
on the atomic ensures that each core gets a different
number out without any locking going on. It is practically free as long as
two threads don’t happen to access that specific variable at exactly the
same time. In that case the CPU just tries again, so not expensive either.
The end result is that each core will validate the scripts of a different
sub-set of transactions, in parallel without any locking.
For the curious among you, in the final code there is a second atomic
called m_txChunkLeftToFinish
which also started out at 8, can you guess
what the point of this atomic is?