Machine Learning

The purpose of this document is to demonstrate the machine learning functionality of MP-SPDZ, a software implementing multi-party computation, one of the most important privacy-enhancing techniques. Please see this gentle introduction for more information on multi-party computation and the installation instructions on how to install the software.

MP-SPDZ supports a number of machine learning algorithms such as logistic and linear regression, decision trees, and some common deep learning functionality. The latter includes the SGD and Adam optimizers and the following layer types: dense, 2D convolution, 2D max-pooling, and dropout.

The machine learning code only works in with arithmetic machines, that is, you cannot compile it with -B.

This document explains how to input data, how to train a model, and how to use an existing model for prediction.

Data Input

It’s easiest to input data if it’s available during compilation, either centrally or per party. Another way is to only define the data size in the high-level code and put the data independently into the right files used by the virtual machine.

Integrated Data Input

If the data is available during compilation, for example as a PyTorch or numpy tensor, you can use Compiler.types.sfix.input_tensor_via() and Compiler.types.sint.input_tensor_via(). Consider the following code from breast_logistic.mpc (requiring scikit-learn):

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)

# normalize column-wise
X /= X.max(axis=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

X_train = sfix.input_tensor_via(0, X_train)
y_train = sint.input_tensor_via(0, y_train)

This downloads the Wisconsin Breast Cancer dataset, normalizes the sample data, splits it into a training and a test set, and then converts it to an the relevant MP-SPDZ data structures. Under the hood, the data is stored in Player-Data/Input-Binary-P0-0, which is where binary-encoded inputs for player 0 are read from. You therefore have to copy said file if you execute it in another place than where you compiled it.

MP-SPDZ also allows splitting the data input between parties, for example horizontally:

a = sfix.input_tensor_via(0, X_train[len(X_train) // 2:])
b = sfix.input_tensor_via(1, X_train[:len(X_train) // 2])
X_train = a.concat(b)

a = sint.input_tensor_via(0, y_train[len(y_train) // 2:])
b = sint.input_tensor_via(1, y_train[:len(y_train) // 2])
y_train = a.concat(b)

The concatenation creates a unified secret tensor that can be used for training over the whole dataset. Similarly, you can split a dataset vertically:

a = sfix.input_tensor_via(0, X_train[:,:X_train.shape[1] // 2])
b = sfix.input_tensor_via(1, X_train[:,X_train.shape[1] // 2:])
X_train = a.concat_columns(b)

The three approaches in this section can be run as follows:

Scripts/compile-run.py -E ring breast_logistic
Scripts/compile-run.py -E ring breast_logistic horizontal
Scripts/compile-run.py -E ring breast_logistic vertical

In the last variants, the labels are all input via party 0.

Finally, MP-SPDZ also facilitates inputting data that is also available party by party. Party 0 can run:

a = sfix.input_tensor_via(0, X_train[:,:X_train.shape[1] // 2])
b = sfix.input_tensor_via(1, shape=X_train[:,X_train.shape[1] // 2:].shape)
X_train = a.concat_columns(b)
y_train = sint.input_tensor_via(0, y_train)

while party 1 runs:

a = sfix.input_tensor_via(0, shape=X_train[:,:X_train.shape[1] // 2].shape)
b = sfix.input_tensor_via(1, X_train[:,X_train.shape[1] // 2:])
X_train = a.concat_columns(b)
y_train = sint.input_tensor_via(0, shape=y_train.shape)

Note that that the respective party only accesses the shape of data they don’t input.

You can run this case by running on one hand:

./compile.py breast_logistic party0
./semi-party.x 0 breast_logistic-party0

and on the other (but on the same host):

./compile.py breast_logistic party1
./semi-party.x 1 breast_logistic-party1

The compilation will output a hash at the end, which has to agree between the parties. Otherwise the virtual machine will abort with an error message. To run the two parties on different hosts, use the networking options.

Data preprocessing

Sometimes it’s necessary to preprocess data. We’re using the following code from torch_mnist_dense.mpc to demonstrate this:

ds = torchvision.datasets.MNIST(root='/tmp', train=train, download=True)
# normalize to [0,1] before input
samples = sfix.input_tensor_via(0, ds.data / 255)
labels = sint.input_tensor_via(0, ds.targets, one_hot=True)

This downloads the default training or the test set of MNIST (depending on train) and then processes it to make it usable. The sample data is normalized from an 8-bit integer to the interval \([0,1]\) by dividing by 255. This is done within PyTorch for efficiency. Then, the labels are encoded as one-hot vectors because this is necessary for multi-label training in MP-SPDZ.

Independent Data Input

The example code in keras_mnist_dense.mpc trains a dense neural network for MNIST. It starts by defining tensors to hold data:

training_samples = sfix.Tensor([60000, 28, 28])
training_labels = sint.Tensor([60000, 10])

test_samples = sfix.Tensor([10000, 28, 28])
test_labels = sint.Tensor([10000, 10])

The tensors are then filled with inputs from party 0 in the order that is used by convert.sh in the preparation code:

training_labels.input_from(0)
training_samples.input_from(0)

test_labels.input_from(0)
test_samples.input_from(0)

The virtual machine then expect the data as whitespace-separated text in Player-Data/Input-P0-0. If you use binary=True with input_from(), the input is expected in Player-Data/Input-Binary-P0-0, value by value as single-precision float or 64-bit integer in the machine byte order (most likely little-endian these days).

Training

There are a number of interfaces for different algorithms.

Logistic regression with SGD

This is available via SGDLogistic. We will use breast_logistic.mpc as an example.

After inputting the data as above, you can call the following:

log = ml.SGDLogistic(20, 2, program)
log.fit(X_train, y_train)

This trains a logistic regression model in secret for 20 epochs with mini-batches of size 2. Adding the program object as a parameter uses further command-line parameters. Most notably, you can add approx to use a three-piece approximate sigmoid function:

Scripts/compile-emulate.py breast_logistic approx

Omitting it invokes the default sigmoid function.

To check accuracy during training, you can call the following instead of fit():

log.fit_with_testing(X_train, y_train, X_test, y_test)

This outputs losses and accuracy for both the training and test set after every epoch.

You can use predict() to predict labels and predict_proba() to predict probabilities. The following outputs the correctness (0 for correct, \(\pm 1\) for incorrect) and a measure of how much off the probability estimate is:

print_ln('%s', (log.predict(X_test) - y_test.get_vector()).reveal())
print_ln('%s', (log.predict_proba(X_test) - y_test.get_vector()).reveal())

Linear regression with SGD

This is available via SGDLinear. It implements an interface similar to logistic regression. The main difference is that there is only predict() for prediction as there is no notion of labels in this case. See diabetes.mpc for an example of linear regression.

PyTorch interface

MP-SPDZ supports importing sequential models from PyTorch using layers_from_torch() as shown in this code snippet in torch_mnist_dense.mpc:

import torch.nn as nn

net = nn.Sequential(
  nn.Flatten(),
  nn.Linear(28 * 28, 128),
  nn.ReLU(),
  nn.Linear(128, 128),
  nn.ReLU(),
  nn.Linear(128, 10)
)

from Compiler import ml

ml.set_n_threads(int(program.args[2]))

layers = ml.layers_from_torch(net, training_samples.shape, 128)

optimizer = ml.SGD(layers)
optimizer.fit(
  training_samples,
  training_labels,
  epochs=int(program.args[1]),
  batch_size=128,
  validation_data=(test_samples, test_labels),
  program=program
)

This trains a network with three dense layers on MNIST using SGD, softmax, and cross-entropy loss. The number of epochs and threads is taken from the command line. For example, the following trains the network for 10 epochs using 4 threads:

Scripts/compile-emulate.py torch_mnist_dense 10 4

See Programs/Source/torch_*.mpc for further examples of the PyTorch functionality, fit() for further training options, and Adam for an alternative Optimizer.

Keras interface

The following Keras-like code sets up a model with three dense layers and then trains it:

from Compiler import ml
tf = ml

layers = [
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10,  activation='softmax')
]

model = tf.keras.models.Sequential(layers)

optim = tf.keras.optimizers.SGD(momentum=0.9, learning_rate=0.01)

model.compile(optimizer=optim)

opt = model.fit(
  training_samples,
  training_labels,
  epochs=1,
  batch_size=128,
  validation_data=(test_samples, test_labels)
)

See Programs/Source/keras_*.mpc for further examples using the Keras interface.

Decision trees

MP-SPDZ can train decision trees for binary labels by using the algorithm by Hamada et al. The following example in breast_tree.mpc trains a tree of height five before outputting the difference between the prediction on a test set and the ground truth:

from Compiler.decision_tree import TreeClassifier
tree = TreeClassifier(max_depth=5)
tree.fit(X_train, y_train)
print_ln('%s', (tree.predict(X_test) - y_test.get_vector()).reveal())

You can run the example as follows:

Scripts/compile-emulate.py breast_tree

It is also possible to output the accuracy after every level:

tree.fit_with_testing(X_train, y_train, X_test, y_test)

You can output the trained tree as follows:

tree.output()

The format of the output follows the description of Hamada et al.

MP-SPDZ by default uses probabilistic rounding for fixed-point division, which is used to compute Gini coefficients in decision tree training. This has the effect that the tree isn’t deterministic. You can switch to deterministic rounding as follows:

sfix.round_nearest = True

The breast_tree.mpc uses the following code to allow switching on the command line:

sfix.set_precision_from_args(program)

Nearest rounding can then be activated as follows:

Scripts/compile-emulate.py breast_tree nearest

Data preparation

MP-SPDZ currently support continuous and binary attributes but not discrete non-binary attributes. However, such attributes can be converted as follows using the pandas library:

import pandas
from sklearn.model_selection import train_test_split
from Compiler import decision_tree

data = pandas.read_csv(
  'https://datahub.io/machine-learning/adult/r/adult.csv')

data, attr_types = decision_tree.preprocess_pandas(data)

# label is last column
X = data[:,:-1]
y = data[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

This downloads the adult dataset and convert discrete attributes to binary using one-hot encoding. See easy_adult for the full example. attr_types has to be used to indicates the attribute types during training:

tree.fit(X_train, y_train, attr_types=attr_types)

Loading pre-trained models

It is possible to import pre-trained from PyTorch as shown in torch_mnist_lenet_predict.mpc:

net = nn.Sequential(
  nn.Conv2d(1, 20, 5),
  nn.ReLU(),
  nn.MaxPool2d(2),
  nn.Conv2d(20, 50, 5),
  nn.ReLU(),
  nn.MaxPool2d(2),
  nn.Flatten(),
  nn.ReLU(),
  nn.Linear(800, 500),
  nn.ReLU(),
  nn.Linear(500, 10)
)

# train for a bit
transform = torchvision.transforms.Compose(
  [torchvision.transforms.ToTensor()])
ds = torchvision.datasets.MNIST(root='/tmp', transform=transform, train=True)
optimizer = torch.optim.Adam(net.parameters(), amsgrad=True)
criterion = nn.CrossEntropyLoss()

for i, data in enumerate(torch.utils.data.DataLoader(ds, batch_size=128)):
  inputs, labels = data
  optimizer.zero_grad()
  outputs = net(inputs)
  loss = criterion(outputs, labels)
  loss.backward()
  optimizer.step()

This trains LeNet on MNIST for one epoch. The model can then be input and used in MP-SPDZ:

from Compiler import ml
layers = ml.layers_from_torch(net, training_samples.shape, 128, input_via=0)
optimizer = ml.Optimizer(layers)
n_correct, loss = optimizer.reveal_correctness(test_samples, test_labels, 128, running=True)
print_ln('Secure accuracy: %s/%s', n_correct, len(test_samples))

This outputs the accuracy of the network. You can use eval() instead of reveal_correctness() to retrieve probability distributions or top guesses (the latter with top=True) for any sample data.

Storing and loading models

Both the Keras interface and the native Optimizer class support an interface to iterate through all model parameters. The following code from torch_mnist_dense.mpc uses it to store the model on disk in secret-shared form:

for var in optimizer.trainable_variables:
  var.write_to_file()

The example code in torch_mnist_dense_predict.mpc then uses the model stored above for prediction. Much of the setup is the same, but instead of training it reads the model from disk:

optimizer = ml.Optimizer(layers)

start = 0
for var in optimizer.trainable_variables:
  start = var.read_from_file(start)

Then it runs the accuracy test:

n_correct, loss = optimizer.reveal_correctness(test_samples, test_labels, 128)
print_ln('Accuracy: %s/%s', n_correct, len(test_samples))

Using var.input_from(player) instead the model would be input privately by a party.

Exporting models

Models can be exported as follows:

optimizer.reveal_model_to_binary()

if optimizer is an instance of Compiler.ml.Optimizer. The model parameters are then stored in Player-Data/Binary-Output-P<playerno>-0. They can be imported for use in PyTorch:

f = open('Player-Data/Binary-Output-P0-0')

state = net.state_dict()

for name in state:
    shape = state[name].shape
    size = numpy.prod(shape)
    var = numpy.fromfile(f, 'double', count=size)
    var = var.reshape(shape)
    state[name] = torch.Tensor(var)

net.load_state_dict(state)

if net is a PyTorch module with the correct meta-parameters. This demonstrates that the parameters are stored with double precision in the canonical order.

There are a number of scripts in Scripts, namely torch_cifar_alex_import.py, torch_mnist_dense_import.py, and torch_mnist_lenet_import.py, which import the models output by torch_alex_test.mpc, torch_mnist_dense.mpc, and torch_mnist_lenet_predict.mpc. For example you can run:

$ Scripts/compile-emulate.py torch_mnist_lenet_predict
...
Secure accuracy: 9822/10000
...
$ Scripts/torch_mnist_lenet_import.py
Test accuracy of the network: 98.22 %

The accuracy values might vary as the model is freshly trained, but they should match.