Compiler Optimizations¶

A core tenet of MP-SPDZ is to reduce the number of communication rounds by parallelizing operations. The are two mechanisms for this, called merging and CISC. Merging is the more basic mechanism, doing just round reduction and nothing else. CISC in addition uses vectorization, which reduces the cost during compilation and the footprint. In the following, we will explain the two mechanisms using examples.

Merging¶

Merging is the original mechanism forming the basis for Keller et al. It is based on extending the common three-operand instruction design to any number of argument tuples. Consider the following high-level code:

sint(1) * sint(2)
sint(3) * sint(4)

Putting this into Programs/Source/2mul.py and running ./compile.py 2mul -a debug results in the following content in debug-2mul-0:

# 2mul-0--0
ldsi s2, 1 # 0
ldsi s3, 2 # 1
ldsi s4, 3 # 2
ldsi s5, 4 # 3
muls 8, 1, s0, s2, s3, 1, s1, s4, s5 # 4

You can see all the inputs loaded in the first few lines appear in the last one. The arguments are follows (see also Compiler.instructions.muls): 8 indicates the number of arguments to follow, and every four arguments correspond to one multiplication. The first number withing every four arguments is the vector size, following by the result register and the input registers.

On the backend side, the multiplication is implemented in the muls() member function in ../Processor/Processor.hpp. Notice that protocol.exchange() (where the communication happens) is only called once. See Low-Level Interface for more information.

Within the compiler, the optimization is executed in the Merger class in ../Compiler/allocator.py. The dependency_graph() member function builds a dependency graph for all instructions, and instructions are merged in longest_path_merge().

CISC¶

This mechanism takes its name from Complex Instruction Set Computer, which refers to the fact that more complex operations are treated as instructions internally before being merged and spelled in vectorized instructions. For a simple example, consider the following high-level code:

program.use_trunc_pr = True
sfix(1) * sfix(2)
sfix(3) * sfix(4)

The resulting file debug-2fmul-0 starts similarly to the one above:

# 2fmul-0-cisc-1
ldsi s8, 65536 # 0
ldsi s9, 131072 # 1
ldsi s10, 196608 # 2
ldsi s11, 262144 # 3
muls 8, 1, s0, s8, s9, 1, s7, s10, s11 # 4

This corresponds to the fact that fixed-point multiplication starts with an integer multiplication before truncation. The file continues as follows:

# 2fmul-0--2
concats 5, s3(2), 1, s0, 1, s7 # 5
vmovs 2, s3(2), s3(2) # 6
# 2fmul-0-update-3
jmp 11 # 7

The first instruction creates a vector of size from the multiplication results, and the last instruction jumps over the next code block just to end up here:

# 2fmul-0-end-TruncPr(2)_47_16-5
ldint ci0, 3 # 19
stmint ci0, 8192 # 20
jmp -14 # 21

This code prepares for a function call. Functions are used for code reusability, i.e., the same code only has to be compiled once per tape and vector size. The last instruction jumps to the start of the function here (the code block jumped over above):

# 2fmul-0-begin-TruncPr(2)_47_16-4
ldarg ci1 # 8
vldi 2, c0(2), 32768 # 9
vmulci 2, c2(2), c0(2), 2147483647 # 10
vaddci 2, c0(2), c2(2), 32768 # 11
vaddm 2, s5(2), s3(2), c0(2) # 12
vtrunc_pr 2, 4, s3(2), s5(2), 47, 16 # 13
vsubsi 2, s5(2), s3(2), 1073741824 # 14
vmovs 2, s3(2), s5(2) # 15
vmovs 2, s1(2), s3(2) # 16
ldmint ci1, 8192 # 17
jmpi ci1 # 18

The actual truncation happens in the vectorized instructions starting with v. The only communication-relevant instruction is (v)trunc_pr, where truncation by 16 bits is done via a protocol defined in the virtual machine. For example, the implementation for Rep3 is found in ../Protocols/Replicated.hpp. The other vectorized instructions are required to turn negative values into positive ones, which is a precondition for the protocol. See Protocol in 3.1 in Catrina and Saxena for an explanation. Lastly, the last two instructions load where to jump back to, which is here:

# 2fmul-0-call-TruncPr(2)_47_16-6
picks s0, s1(2), 0, 1 # 22
picks s0, s1(2), 1, 1 # 23

The two instructions extract the results from the vector, which makes them available individually for further computation.

While this example saves relatively little by using the CISC functionality, this isn’t the case for other usages. For example, removing program.use_trunc_pr results in several hundred instructions, and more involved mathematical functions corresponds to thousands of instructions.

Internally, the CISC functionality is in implemented in functions decorators in ../Compiler/instructions_base.py. Our example uses ret_cisc() on TruncPr() in ../Compiler/floatingpoint.py. This is because TruncPr() returns an sint. Other decorators are cisc() (the result is stored in the first argument, which must be an sint) and sfix_cisc() (the result and all arguments are instances of sfix).