Compiler Optimizations¶
A core tenet of MP-SPDZ is to reduce the number of communication rounds by parallelizing operations. The are two mechanisms for this, called merging and CISC. Merging is the more basic mechanism, doing just round reduction and nothing else. CISC in addition uses vectorization, which reduces the cost during compilation and the footprint. In the following, we will explain the two mechanisms using examples.
Merging¶
Merging is the original mechanism forming the basis for Keller et al. It is based on extending the common three-operand instruction design to any number of argument tuples. Consider the following high-level code:
sint(1) * sint(2)
sint(3) * sint(4)
Putting this into Programs/Source/2mul.py
and running
./compile.py 2mul -a debug
results in the following content in
debug-2mul-0
:
# 2mul-0--0
ldsi s2, 1 # 0
ldsi s3, 2 # 1
ldsi s4, 3 # 2
ldsi s5, 4 # 3
muls 8, 1, s0, s2, s3, 1, s1, s4, s5 # 4
You can see all the inputs loaded in the first few lines appear in the
last one. The arguments are follows (see also
Compiler.instructions.muls
): 8 indicates the number of
arguments to follow, and every four arguments correspond to one
multiplication. The first number withing every four arguments is the
vector size, following by the result register and the input
registers.
On the backend side, the multiplication is implemented in the
muls()
member function in
../Processor/Processor.hpp
. Notice that
protocol.exchange()
(where the communication happens) is only
called once. See Low-Level Interface for more information.
Within the compiler, the optimization is executed in the
Merger
class in ../Compiler/allocator.py
. The
dependency_graph()
member function builds a dependency graph
for all instructions, and instructions are merged in
longest_path_merge()
.
CISC¶
This mechanism takes its name from Complex Instruction Set Computer, which refers to the fact that more complex operations are treated as instructions internally before being merged and spelled in vectorized instructions. For a simple example, consider the following high-level code:
program.use_trunc_pr = True
sfix(1) * sfix(2)
sfix(3) * sfix(4)
The resulting file debug-2fmul-0
starts similarly to the one
above:
# 2fmul-0-cisc-1
ldsi s8, 65536 # 0
ldsi s9, 131072 # 1
ldsi s10, 196608 # 2
ldsi s11, 262144 # 3
muls 8, 1, s0, s8, s9, 1, s7, s10, s11 # 4
This corresponds to the fact that fixed-point multiplication starts with an integer multiplication before truncation. The file continues as follows:
# 2fmul-0--2
concats 5, s3(2), 1, s0, 1, s7 # 5
vmovs 2, s3(2), s3(2) # 6
# 2fmul-0-update-3
jmp 11 # 7
The first instruction creates a vector of size from the multiplication results, and the last instruction jumps over the next code block just to end up here:
# 2fmul-0-end-TruncPr(2)_47_16-5
ldint ci0, 3 # 19
stmint ci0, 8192 # 20
jmp -14 # 21
This code prepares for a function call. Functions are used for code reusability, i.e., the same code only has to be compiled once per tape and vector size. The last instruction jumps to the start of the function here (the code block jumped over above):
# 2fmul-0-begin-TruncPr(2)_47_16-4
ldarg ci1 # 8
vldi 2, c0(2), 32768 # 9
vmulci 2, c2(2), c0(2), 2147483647 # 10
vaddci 2, c0(2), c2(2), 32768 # 11
vaddm 2, s5(2), s3(2), c0(2) # 12
vtrunc_pr 2, 4, s3(2), s5(2), 47, 16 # 13
vsubsi 2, s5(2), s3(2), 1073741824 # 14
vmovs 2, s3(2), s5(2) # 15
vmovs 2, s1(2), s3(2) # 16
ldmint ci1, 8192 # 17
jmpi ci1 # 18
The actual truncation happens in the vectorized instructions starting
with v. The only communication-relevant instruction is (v)trunc_pr,
where truncation by 16 bits is done via a protocol defined in the
virtual machine. For example, the implementation for Rep3 is found in
../Protocols/Replicated.hpp
. The other vectorized
instructions are required to turn negative values into positive ones,
which is a precondition for the protocol. See Protocol in 3.1 in
Catrina and Saxena for an
explanation. Lastly, the last two instructions load where to jump back
to, which is here:
# 2fmul-0-call-TruncPr(2)_47_16-6
picks s0, s1(2), 0, 1 # 22
picks s0, s1(2), 1, 1 # 23
The two instructions extract the results from the vector, which makes them available individually for further computation.
While this example saves relatively little by using the CISC
functionality, this isn’t the case for other usages. For example,
removing program.use_trunc_pr
results in several hundred
instructions, and more involved mathematical functions corresponds to
thousands of instructions.
Internally, the CISC functionality is in implemented in functions
decorators in ../Compiler/instructions_base.py
. Our
example uses ret_cisc()
on TruncPr()
in
../Compiler/floatingpoint.py
. This is because
TruncPr()
returns an sint
. Other decorators are
cisc()
(the result is stored in the first argument, which
must be an sint
) and sfix_cisc()
(the result and
all arguments are instances of sfix
).