paint-brush
A Study on Parallel Execution: Everything You Need to Knowby@sin7y
9,966 reads
9,966 reads

A Study on Parallel Execution: Everything You Need to Know

by Sin7YJanuary 3rd, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

FISCO-BCOS 2.0 uses a graph structure in transaction processing. The developers designed a Parallel Transaction Executor (PTE) based on the Directed Acyclic Graph model (DAG) PTE can help you fully utilize the advantages of a multi-core processor so that the transactions in the block can be executed in parallel.
featured image - A Study on Parallel Execution: Everything You Need to Know
Sin7Y HackerNoon profile picture

Preface

This research compares implementation systems similar to Ethereum and analyzes the difficulties and possibilities of achieving parallel execution of transactions.


It’s worth noting that the chains analyzed for this research are based on the Account model design scheme, not including the UTXO scheme.

Research Objects

  1. FISCO-BCOS, one of the consortium blockchains that support parallel execution of transaction verification within blocks.


  2. Khipu public chain, scala implementation of the Ethereum protocol.


  3. Aptos public chain, Move Virtual Machine.

Difficulties With Parallel Execution

Let’s take a look at the traditional transaction execution process.


The execution module takes out each transaction from the block and executes it sequentially.


The latest world state will be modified during the execution process, and then the state will be added up after the completion of a transaction to reach the latest world state after the completion of the block.


The execution of the next block is strictly dependent on the world state from the current/previous block, hence, this sequential, single-threaded execution process is not very suited for parallel execution.



Below, are the main conflicts in the current Ethereum parallel execution methods:


  1. Account Conflict: If two threads process the balance or other attributes of an address account at the same time, how do we make sure it is consistent with the result of sequential processing, that is, whether the world state is a definite finite state machine?


  1. Storage Conflict of the Same Address: where both contracts have modified the storage of the same global variable.


  2. Cross-Contract Call Conflict: If contract A is deployed first, contract B needs to wait until the deployment of contract A is completed to call contract A. However, when the transactions are parallel, there is no such sequence, which leads to conflict.

Parallel Execution Schemes

FISCO-BCOS

Abstract

FISCO-BCOS 2.0 uses a graph structure in transaction processing. The developers designed a Parallel Transaction Executor (PTE) based on the Directed Acyclic Graph model (DAG).


PTE can help you fully utilize the advantages of a multi-core processor so that the transactions in the block can be executed in parallel to the extent possible.


At the same time, it provides a simple and friendly programming interface for the user, so that the user does not have to care about the tedious details of parallel implementation.


The experimental results of the benchmark test program show that compared with the traditional serial transaction execution scheme, PTE running on a 4-core processor under ideal conditions can achieve about 200%~300% performance improvement, and the computational improvement is proportional to the number of cores.


The more cores, the better the performance.

General Scheme

An acyclic-directed graph is often referred to as Directed Acyclic Graph (DAG).


In a batch of transactions, mutually exclusive resources occupied by each transaction are identified; then a transaction-dependent DAG is constructed according to the sequence of transactions in the block and the occupation relationship of mutually exclusive resources.


As shown in the figure below, all transactions with an inbound degree of 0 (no dependent preorder tasks) can be executed in parallel. The transaction DAG on the right can be obtained by topological sorting based on the order of the original transaction list on the left.


Modular Architecture


  • Users initiate transactions directly or indirectly through the SDK.


  • The transaction is then synchronized between nodes, and the node with the same packaging rights invokes the Sealer (TxPool) to take a certain amount of transactions from (txpool) and package them into a block. After that, the blocks are sent to the Consensus unit in preparation for inter-node consensus.


  • Transaction validation is performed before consensus is reached, and this is where PTE begins its work process. As can be seen from the architecture diagram, PTE first reads the transactions in the block in order and inputs them into the DAG Constructor, which constructs a transaction DAG containing all transactions according to the dependencies of each transaction. PTE then wakes up the worker pool. Use multiple threads to execute the transaction DAG in parallel. The Joiner suspends the main thread until the DAG has been executed by all the threads in the worker pool. At this point, the Joiner computes the state root and receipt root based on the state modification record of each transaction and returns the result to the caller at the upper level.


  • After the block is verified, the block is uploaded to the chain. After a transaction is executed, if the state of each node is consistent, a consensus is reached and the block is then written to the underlying storage, which is permanently recorded on the blockchain.

Construction Process of Transaction DAG


  1. Take all the transactions in the block from the packed block.


  2. Initialize a DAG instance with the number of transactions as the maximum number of vertexes.


  3. Read all transactions in order. Should a transaction be mergeable, resolve its conflict field and check whether any previous transactions conflict with it. If so, construct a dependency edge between the corresponding transactions. If the transaction is not mergeable, it is considered to have to be executed after all previous transactions have been executed, so a dependency edge is created between the transaction and all its predecessors.


Note: Once all dependent edges have been created, they cannot be merged and can only be executed sequentially.

DAG Execution Process


  1. The main thread will first initialize a small group of threads based on the number of hardware cores, and if the hardware cores fail, no other threads will be created.


  2. When the DAG is not completed, the thread loop waits for the ready transaction with the in-degree of 0 to be taken out from the waitPop method of the DAG. If the transaction to be executed is successfully taken out, the transaction will be executed. If it fails, the DAG has completed execution, and the thread exits.

Problems and Solutions

  1. For the same block, how do we ensure that all nodes have completed execution and are in the same state (the three root nodes match)?


​FISCO BCOS verifies that the triples, i.e., state root, transaction root, and receipt root, are equal to each other to determine whether the states are agreed upon. A transaction root is a hash value calculated based on all the transactions in the block.


As long as all consensus nodes process the same block data, the transaction root must be the same, which is relatively easy to guarantee. The key is to ensure that the state and receipt root generated after the transaction is the same.


It is well known that the order of execution between instructions executed in parallel on different CPU cores cannot be predicted in advance, and the same is true for transactions executed in parallel.


In the traditional transaction execution scheme, the state root changes once every transaction is executed, and the changed state root is written into the transaction receipt.


After all transactions are executed, the final state root represents the current state of the blockchain. At the same time, a receipt root is calculated based on all transaction receipts.


It can be seen that in the traditional implementation, the state root acts as a global shared variable.


When transactions are executed in parallel and out of order, the traditional calculation of state root is no longer applicable because transactions are executed in a different order on different machines and the final state root is not guaranteed to be consistent, nor is the receipt root guaranteed to be consistent.


In FISCO BCOS, transactions are first executed in parallel and the history of each transaction’s state change is recorded. After all transactions are executed, a state root is calculated based on history.


At the same time, the state root in the transaction acknowledgment becomes the final state root after all transactions have been executed, thus ensuring that the consensus nodes can still reach an agreement even if transactions are executed in parallel.


  1. How to determine whether two transactions are dependent?


If two transactions are not dependent but are judged to be, it will lead to unnecessary performance loss. Conversely, if both transactions rewrite the state of the same account but are executed in parallel, the final state of the account may be uncertain.


Therefore, the determination of dependency is an important issue that affects performance and can even determine whether the blockchain can work properly.


In a simple transfer transaction, we can judge whether two transactions are dependent based on the addresses of the sender and receiver. Take the following three transfer transactions as an example, A→B, C→D, and D→E.


It is easy to see that the D→E transaction depends on the result of the C→D transaction, but the A→B transaction has nothing to do with the other two transactions, so it can be executed in parallel.


This kind of analysis is true in a blockchain that only supports simple transfers, but it may not be as accurate in a Turing-complete blockchain that runs smart contracts, because we don’t know exactly what’s going on in a user-written transfer contract. Here’s what might happen.


It seems that the transaction of A→B has nothing to do with the account status of C and D, but in the underlying implementation of the user, A is a special account, and a certain fee must be deducted from the account of C for every money transferred through the account of A.


In this scenario, the three transactions are all related, so they cannot be executed in parallel. If the transactions are divided according to the previous dependency analysis method, it is bound to cause mistakes.


Can we automatically deduce what dependencies actually exist in a transaction based on the content of the user’s contract? The answer is no. As mentioned earlier, it is difficult to analyze contractual dependencies and the execution process in static analysis.


In FISCO BCOS, the assignment of trade dependencies is left to developers who are more familiar with the contract content. Specifically, the mutually exclusive resources that the transaction depends on can be represented by a set of strings.


FISCO BCOS exposes the interface to the developer, who defines the resources that the transaction depends on in the form of strings and informs the executor on the chain.


The executor will automatically arrange all the transactions in the block into the transaction DAG according to the transaction dependencies specified by the developer.


For example, in a simple transfer contract, the developer simply specifies that the dependency for each transfer transaction is {sender address + receiver address}.


Further, if the developer introduces another third-party address in the transfer logic, then the dependency needs to be defined as {sender address + recipient address + third-party address}.


This approach is intuitive, simple, and general, and applies to all smart contracts, but it also increases the responsibility on the developer’s shoulders.


The developer must be very careful when specifying transaction dependencies. If the dependencies are not written correctly, the consequences are unpredictable.

Parallel Framework Contract

In order for developers to use the framework of parallel contracts, FISCO BCOS has set some specifications for contract writing. The specifications are as follows:

Parallel Mutually Exclusive

Whether two transactions can be executed in parallel depends on whether the two transactions are mutually exclusive. Mutual exclusion refers to the intersection of the set of storage variables of two transactions.


For example, in an asset transfer scenario, a transaction is a transfer operation between users. transfer(X, Y) represents the transfer interface from user X to user Y, and the mutual exclusion is as follows.



  • Mutually exclusive parameter: Parameter related to the “read/write” operation of the contract storage variable in the contract interface. Take transfer interface transfer(X, Y) for example. X and Y are mutually exclusive parameters.


  • Mutex: The specific mutex content extracted from a transaction according to the mutex parameters. Take transfer interface transfer(X, Y) for example. In A transfer transaction using this interface, the specific parameter is transfer(A, B), and the mutex of this operation is [A, B]. For another transaction, transfer(A, C) is called, and the mutex for this operation is [A, C].


To determine whether two transactions can be executed in parallel at the same time is to determine whether the mutex of two transactions intersects. Transactions whose intersections are empty can be executed in parallel.


FFISCO-BCOS provides two ways to write parallel contracts, precompiled contracts, and solidity contracts, only the latter of which are described here. The same goes for pre-compiled contracts.

Solidity Contract Parallel Framework

To write a parallel solidity contract, on top of that, simply makeParallelContract.sol the base class for the contracts you want to parallel. TheregisterParallelFunction()method is called to register interfaces that can be parallelized.


The Parallel Contract code is as follows:

pragma solidity ^0.4.25;

//Precompile the contract interface
contract ParallelConfigPrecompiled
{
    function registerParallelFunctionInternal(address, string, uint256) public returns (int);
    function unregisterParallelFunctionInternal(address, string) public returns (int);    
}

//The parallel contract base class needs to be registered and the subcontract needs to be implement enable or disable interface
contract ParallelContract
{
    ParallelConfigPrecompiled precompiled = ParallelConfigPrecompiled(0x1006);
    
    function registerParallelFunction(string functionName, uint256 criticalSize) public 
    {
        precompiled.registerParallelFunctionInternal(address(this), functionName, criticalSize);
    }
    
    function unregisterParallelFunction(string functionName) public
    {
        precompiled.unregisterParallelFunctionInternal(address(this), functionName);
    }
    
    function enableParallel() public;
    function disableParallel() public;
}


The following example is a transfer contract written based on a parallel framework contract:


pragma solidity ^0.4.25;

import "./ParallelContract.sol";  // Introduce ParallelContract.sol

contract ParallelOk is ParallelContract // useParallelContract as a base class
{
    // Contract implementation
    mapping (string => uint256) _balance;  // Global mapping
    
  	// The mutually exclusive variables from and to are the first two parameters at the beginning of transfer (). It can be seen that the contract requirements are still very strict, which will make users uncomfortable to write
    function transfer(string from, string to, uint256 num) public
    {
        _balance[from] -= num; // From is the key of the global mapping, and is a mutually exclusive parameter
        _balance[to] += num;  //// To is the key of the global mapping, and is a mutually exclusive parameter
    }
		
  	// The mutex variable name comes first as an argument to the beginning of set()
    function set(string name, uint256 num) public
    {
        _balance[name] = num;
    }

    function balanceOf(string name) public view returns (uint256)
    {
        return _balance[name];
    }
    
    // Register contract interfaces that can be parallel
    function enableParallel() public
    {
        // The function definition string (note that there are no Spaces after ",") and the first few arguments are mutex arguments (mutex arguments must be first when designing a function)
       //The number 2 indicates that the first two are mutex parameters, and the system decodes the mutex according to the function signature and abi
        registerParallelFunction("transfer(string,string,uint256)", 2); // critical: string string
      	//
        registerParallelFunction("set(string,uint256)", 1); // critical: string
    } 

    // Deregister the parallel contract interface
    function disableParallel() public
    {
        unregisterParallelFunction("transfer(string,string,uint256)");
        unregisterParallelFunction("set(string,uint256)"); 
    } 
}


Determine Whether the Interface Can Be Parallel

A parallel contract interface must satisfy:

  • No external contracts are called.
  • No other function interfaces are called.

Determine the Mutex Parameter

Before programming an interface, determine the mutually exclusive parameters of the interface. The mutually exclusive parameters of the interface are mutually exclusive to global variables. The rules for determining mutually exclusive parameters are as follows:


  • If the global mapping is accessed by the interface, the key of the mapping is the mutually exclusive parameter.


  • If the global array is accessed by the interface, the subscript of the array is the mutually exclusive parameter.


  • If the interface accesses global variables of a simple type. All global variables of a simple type share a mutex parameter and use different variable names as mutex objects.


For example, there are multiple global variables of simple types in the contract, and different interfaces access different global variables.


If you want to parallel different interfaces, you need to define a mutex parameter in the interface parameter with the modified global variable to indicate which global variable is used during the call.


When called, the mutex parameter is actively passed the modified “variable name” of the global variable to identify the mutex of the transaction.


Such as: If setA(int x) modifies globalA as a global parameter, setA needs to be defined as set(string aflag, int x). When called, setA("globalA", 10) is passed. Use the variable name  “globalA” to indicate that the mutex for this transaction is globalA.

Determine the Parameter Type and Order

After determining mutually exclusive parameters, determine the parameter type and order according to the rules. The rules are as follows:


  • Interface parameters are limited to string, address, uint256, and int256 (more types to be supported in the future).


  • Mutually exclusive parameters must all appear in interface parameters.


  • All mutually exclusive parameters are in the first place of interface parameters.


It can be seen that the parallel transaction of FISCO-BCOS largely depends on the specifications of contracts written by users.


If the specifications of contracts written by users are not standardized, the system hastily carries out parallel execution, which may cause the root inconsistency of account books.

Khipu

Abstract

Khipu believes that it is unrealistic for users to identify and label the range of addresses that will create static conflicts at the time of writing the contract without error. This is in contrast to the view of FISCO-BCOS.


Whether, where, and under what conditions the race condition will appear can be judged only when the certainty acquisition involves the current state.


This kind of judgment, with current contract programming languages, makes it almost impossible for static analysis of the code to get completely correct and unmissed results.


Khipu has made a more comprehensive attempt to address this issue and has completed a process to implement it.

General Scheme

In Khipu, each transaction in the same block starts from the world state of the previous block, and then executes in parallel, recording the above three race conditions encountered along all the ideal experience paths during execution.


Following the parallel execution phase is the merge phase, when parallel world states are merged one by one. When merging a transaction, first judge whether you have a conflict with the previously merged race conditions from the recorded static conditions.


If not, merge directly. If so, the transaction is executed again starting with the previous state of the world that has been merged.


The last merged world state is checked against the hash of the block. This is the last line of defense. If the check is incorrect, the previous merge is abandoned and the block is executed again.

Parallelism Index

Here, Khipu introduces an index of parallelism, which refers to the proportion of transactions in a block that can directly combine results without having to be executed again.


Khipu’s observation of Ethereum replay over several days from the creation block to the newest block shows that this ratio (parallelism) can reach 80% on average.


In general, if computing tasks can be fully parallelized, the scalability of a single chain is infinite. Because you can always add more CPU cores to a node. If this is not the case, then the maximum theoretical rate is limited by Amdahl’s theorem:


The limit to which you can speed up the system depends on the reciprocal of the parts that cannot be parallelized. So, if you can parallelize 99%, you can speed up to 100 times. But if you can only achieve 95% parallelization, then you can only get up to 20 times faster.


Of all transactions on Ethereum, about 80% can be parallelized and 20% cannot, so Khipu’s speed limit is around 5 times.

Conflict Markers

By understanding the instructions in the evm code, it was found that a limited number of instructions had created read and write processes for the storage, so it was possible to record these read and write processes to form a read and write collection, but static code analysis could not ensure that these processes were recorded.


Therefore, it is necessary to pre-execute each transaction once when processing each block. The pre-execution process tells us whether the transactions are reads and writes to the same account or storage, and creates a readSet and a writeSet for each transaction.


If there are 100 transactions in the blockchain, then these 100 transactions can be executed in parallel via the thread pool. Each contract has the same initial world state, and 100 readSets and writeSets will be created during execution, as well as 100 new states each.


When the pre-execution is over, the next stage of processing begins. Ideally, if the 100 readSet and writeSet entries do not conflict, then they can be merged directly to produce the final world state of all the transactions in the block. However, the transaction is often not so ideal.


The correct way to deal with it is to compare the readSet and writeSet after the execution of the first transaction with the readSet and writeSet after the execution of the second contract, and see whether they have read and written the same account or storage.


If so, that means the two deals are in conflict. Then the second transaction will start following the completion of the first transaction and will be executed again.


Similarly, as the merge state machine continues, the conflict set will continue to accumulate, and as long as subsequent transactions conflict with previous transactions, they will be executed sequentially until all transactions have been executed.


Through the replay of transactions on the mainnet of Ethereum, it is found that where there are a lot of conflicts, most of the cases are exchanges in the same block with interrelated transactions, which is also consistent with this process.


General Process


Specific Parallel Process

Aptos

Abstract

Aptos is built on Diem’s Move language and MoveVM to create a high-throughput chain that enables parallel execution. Aptos’ approach is to detect associations while being transparent to users/developers.


That is, transactions are not required to explicitly state which part of the state (memory location) they use.

General Scheme

Aptos uses a modified version of Software transaction memory called Block-STM and implements its parallel execution engine based on Block-STM.


Block-STM uses MVCC (Multi-version Concurrency Control) to avoid write-write conflicts. All writes to the same location are stored with their versions, which contain their TX-ID and the number of times the write tx has been re-executed.


When a transaction (tx) reads a value for a memory location, it gets the value written from MVCC to that location that occurred before tx along with the associated version to determine if there is a read/write conflict.


In Block-STM, transactions are pre-sorted within blocks and are divided between processor threads for parallel execution during execution. In parallel execution, it is assumed that there are no dependencies to execute the transaction.


The memory locations modified by the transaction are recorded. After execution, verify all transaction results. During validation, if a transaction is found to access a memory location modified by a previous transaction, the transaction is invalid.


Refresh the result of the trade, and then re-execute the trade. This process is repeated until all transactions in the block have been executed. Block-STM speeds up execution when multiple processor cores are used. The acceleration depends on how interdependent the transactions are.


It can be seen that the scheme used by Aptos is roughly similar to the Khipu mentioned above, but there are some differences in implementation, which are detailed as follows:


  • Khipu uses parallel execution and sequential verification for intra-block transactions. However, Aptos implements parallel execution and verification for the transactions within the block. These two schemes have advantages and disadvantages. Khipu is easy to implement, and the efficiency is slightly lower. Through Block-STM, Aptos uses synchronization and signal operation in many threads, which is highly efficient but is difficult to implement code.


  • Since Move supports global resource addressing natively, Aptos will reorder transactions, even across blocks, when it is conducive to parallel execution. Aptos claims that this scheme can not only improve the efficiency of parallel but also solve the MEV problem. However, whether this will affect the user experience remains to be considered.


  • Aptos stores the resulting write set in memory during execution for maximum execution speed and then uses it as a cache for the next block to be executed. Any repeated writes only need to be written once to stable memory.

Benchmark Test

Aptos made a corresponding benchmark after block-STM integration and compared between sequential execution and parallel execution of a Block of 10k transactions. The comparison result is shown as follows:


It can be seen from the above figure that Block STM achieves 16 times faster than sequential execution with 32 threads in parallel, and over 8 times faster under high contention.

Conclusion

Based on the above comparison and analysis, it can be concluded that some schemes require users to write storage according to established rules when writing contracts so that dependencies can be found by static and dynamic analysis.


Solana and Sui use similar schemes, but the user perception is different. This scheme is essentially a storage model change to obtain better analysis results.


Khipu and Aptos are user-agnostic schemes. The overhead of parallel execution is not upon the developers, and they don’t need to think about this when writing their contracts.


The virtual machine dynamically analyzes the dependency relationships before execution, thus implementing the parallel execution without dependency relationships.


This is difficult to implement, and the degree of parallelism depends to some extent on the account division of the transaction. When there are a lot of transaction conflicts, the performance deteriorates significantly by constantly re-executing.


Aptos mentioned that they will make future optimization of user-authored contracts to analyze dependencies better and thus achieve faster execution.


Simply modifying a serial-based scheme to a parallel scheme can bring 3~16 times a transactional throughput improvement in a public chain environment, and if that can be combined with large blocks and large gas limits, L2 throughput will be further optimized, potentially about 100 times.


From an engineering perspective, concerning implementation and efficiency, OlaVM will most likely adopt the Khipu scheme plus a customized storage model solution, which can improve performance while avoiding the complexity caused by the introduction of Block-STM and facilitate better engineering optimization.


References

  1. FISCO-BCOS Github, FISCO-BCOS
  2. Khipu GitHub, GitHub - khipu-io/khipu: An enterprise blockchain platform based on Ethereum
  3. Aptos GitHub, GitHub - aptos-labs/aptos-core: Aptos is a layer 1 blockchain built to support the widespread use of of blockchain through better technology and user experience.

About us

Founded in 2021 and powered by top-notch blockchain developers, Sin7y is a project incubator and blockchain technology research team that explores the most important and cutting-edge technologies, including EVM, Layer2, cross-chain, privacy computing, autonomous payment solutions, etc.


We are currently working on an EVM-compatible, fast, and scalable ZKVM called OlaVM. If you are interested in talking with us, feel free to join our TG group or email us at [email protected]