Skip to content

Commit 89bac79

Browse files
Update README.md
1 parent 47f554f commit 89bac79

File tree

1 file changed

+64
-46
lines changed

1 file changed

+64
-46
lines changed

README.md

Lines changed: 64 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Byzantine Resilient Decentralized Learning
1+
# Byzantine-resilient Decentralized Machine Learning: Codebase for ByRDiE, BRIDGE, and Variants
22

33
## Table of Contents
44
<!-- MarkdownTOC -->
@@ -11,125 +11,143 @@
1111

1212
<a name="introduction"></a>
1313
# General Information
14-
This repo includes implementations of **Byzantine-resilient Decentralized Gradient Descent(BRIDGE)** and its variants as well as **Byzantine-resilient Distributed Coordinate Descent for decentralized learning(ByRDiE)** to perform decentralized learning in the presence of Byzantine nodes. Specifically, this codebase impelements the decentralized learning experiment found in [Adversary-resilient Distributed and Decentralized
15-
Statistical Inference and Machine Learning](https://ieeexplore.ieee.org/document/9084329)
14+
This repo provides implementations of **Byzantine-resilient Distributed Coordinate Descent for Decentralized Learning (ByRDiE)**, **Byzantine-resilient Decentralized Gradient Descent (BRIDGE)**, and different variants of the BRIDGE algorithm. In addition, it includes code to implement decentralized machine learning in the presence of Byzantine (malicious) nodes. The codebase in particular can be used to reproduce the decentralized learning experiments reported in the overview paper entitled "[Adversary-resilient Distributed and Decentralized Statistical Inference and Machine Learning](https://ieeexplore.ieee.org/document/9084329)" that appeared in IEEE Signal Processing Magazine in May 2020.
1615

1716
## License and Citation
1817
The code in this repo is being released under the GNU General Public License v3.0; please refer to the [LICENSE](./LICENSE) file in the repo for detailed legalese pertaining to the license. In particular, if you use any part of this code then you must cite both the original papers as well as this codebase as follows:
1918

2019
**Paper Citations:**
21-
- Z. Yang, A. Gang and W. U. Bajwa, "Adversary-Resilient Distributed and Decentralized Statistical Inference and Machine Learning: An Overview of Recent Advances Under the Byzantine Threat Model," in IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 146-159, May 2020, doi: [10.1109/MSP.2020.2973345](https://doi.org/10.1109/MSP.2020.2973345).
22-
- Yang, Zhixiong, and Waheed U. Bajwa. "BRIDGE: Byzantine-resilient decentralized gradient descent." arXiv preprint arXiv:1908.08098 (2019).
23-
- Z. Yang and W. U. Bajwa, "ByRDiE: Byzantine-Resilient Distributed Coordinate Descent for Decentralized Learning," in IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 4, pp. 611-627, Dec. 2019, doi: [10.1109/TSIPN.2019.2928176](https://doi.org/10.1109/TSIPN.2019.2928176).
2420

25-
**Codebase Citation:** J. Shenouda, Z. Yang, W. U. Bajwa, "Codebase---Adversary-Resilient Distributed and Decentralized Statistical Inference and Machine Learning: An Overview of Recent Advances Under the Byzantine Threat Model," GitHub Repository, 2020
21+
- Z. Yang and W.U. Bajwa, "ByRDiE: Byzantine-resilient distributed coordinate descent for decentralized learning," IEEE Trans. Signal Inform. Proc. over Netw., vol. 5, no. 4, pp. 611-627, Dec. 2019; doi: [10.1109/TSIPN.2019.2928176](https://doi.org/10.1109/TSIPN.2019.2928176).
22+
- Z. Yang and W.U. Bajwa, "BRIDGE: Byzantine-resilient decentralized gradient descent," arXiv preprint, Aug. 2019; [arXiv:1908.08098](https://arxiv.org/abs/1908.08098).
23+
- Z. Yang, A. Gang, and W.U. Bajwa, "Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model," IEEE Signal Processing Mag., vol. 37, no. 3, pp. 146-159, May 2020; doi: [10.1109/MSP.2020.2973345](https://doi.org/10.1109/MSP.2020.2973345).
24+
25+
**Codebase Citation:** J. Shenouda, Z. Yang, and W.U. Bajwa, "Codebase---Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model," GitHub Repository, 2020; doi: [TBD](#).
2626

2727
## Summary of Experiments
28-
In these experiments we simulate a decentralized network to learn a linear multiclassifier on the MNIST dataset using a one layer neural network implemented in TensorFlow. Our network consists of twenty nodes each assigned two thousand training samples from the MNIST dataset. Similar to the paper we conduct two experiments.
28+
The codebase uses implementations of ByRDiE, BRIDGE, and BRIDGE variants to generate results for Byzantine-resilient decentralized learning. The generated results correspond to experiments in which we simulate a decentralized network that trains a linear multiclass classifier on the [MNSIT dataset](http://yann.lecun.com/exdb/mnist/) using a one-layer neural network that is implemented in TensorFlow. The network consists of twenty nodes, with each node assigned two thousand training samples from the MNIST dataset. Similar to the overview paper (Yang et al., 2020), the codebase provides two sets of experiments:
2929

30-
1. Train the neural network using Distributed Gradient Descent (DGD), ByRDiE, BRIDGE and the variants of BRIDGE namely, Median, Krum and Bulyan defending against two byzantine nodes while no nodes actually undergo byzantine failure. This is the faultless setting and will produce a plot similar to Figure 3a in the paper.
30+
1. Train the neural network using Distributed Gradient Descent (DGD), ByRDiE, BRIDGE, and three variants of BRIDGE, namely, BRIDGE--Median, BRIDGE--Krum and BRIDGE--Bulyan, with the Byzantine-resilient algorithms defending against at most two Byzantine nodes while no nodes actually undergo Byzantine failure. This is the faultless setting and the code produces a plot similar to Figure 3(a) in the paper (Yang et al., 2020) in this case.
31+
2. Train the neural network using the six methods as above, with the Byzantine-resilient algorithms defending against at most two Byzantine nodes and exactly two nodes undergo Byzantine failure and communicate random values instead of the actual gradient to their neighbors. This is the faulty setting and the code produces a plot similar to Figure 3(b) in the paper (Yang et al., 2020) in this case.
3132

32-
2. Train the neural network using all six methods as above while defending against two byzantine nodes and indeed two nodes undergo byzantine failure communicating random values instead of the actual gradient. This is the faulty setting and will produce a plot similar to Figure 3b in the paper.
33+
For experiments in both the faultless and the faulty setting, we ran ten Monte Carlo trials in parallel and averaged the classification accuracy before plotting.
3334

34-
For each of these experiments, the faultless and faulty setting, we ran each sc ten Monte Carlo trials in parallel and averaged the classification accuracy before plotting.
35+
## Summary of Code
36+
The `dec_BRIDGE.py` and `dec_ByRDiE.py` serve as the "driver" or "main" files where we set up the experiments and call the necessary functions to learn the machine learning model in a decentralized manner. The actual implementations of the various screenings methods (ByRDiE, BRIDGE, and variants of BRIDGE) are carried out in the `DecLearning.py` module. While these specific implementations are written for the particular case of training with a single-layer neural network using TensorFlow, the core of these implementations can be easily adapted for other machine learning problems.
3537

3638
## Computing Environment
3739
All of our computational experiments were carried out on a Linux high-performance computing (HPC) cluster provided by the Rutgers Office of Advanced Research Computing; specifically, all of our experiments were run on:
3840

39-
Lenovo NextScale nx360 servers
41+
Lenovo NextScale nx360 servers:
4042

4143
- 2 x 12-core Intel Xeon E5-2680 v3 "Haswell" processors
4244
- 128 GB RAM
4345
- 1 TB local scratch disk
4446

45-
However we only allocated 4GB of RAM when submitting each of our jobs.
47+
However, we only allocated 4GB of RAM when submitting each of our jobs.
4648

4749
## Requirements and Dependencies
48-
To reproduce the environment with necessary dependencies create a conda environment using the `environment.yml` provided.
50+
This code is written in Python and uses TensforFlow. To reproduce the environment with necessary dependencies needed for running of the code in this repo, we recommend that the users create a `conda` environment using the `environment.yml` YAML file that is provided in the repo. Assuming the conda management system is installed on the user's system, this can be done using the following:
4951

50-
```
51-
conda env create -f environment.yml
52+
```shell
53+
$ conda env create -f environment.yml
5254
```
5355

56+
In the case users don't have conda installed on their system, they should check out the `environment.yml` file for the appropriate version of Python as well as the necessary dependencies with their respective versions needed to run the code in the repo.
57+
5458
## Data
55-
The MNIST dataset we used can be found in the `./data` directory. The `./data/MNIST/raw` directory contains the raw MNIST data and the `./data/MNIST_read.py` script reads the data into numpy arrays which are then pickled for use in the experiments. The pickled numpy arrays are already avaliable in the `./data/MNIST/pickled` directory so there is no need to rerun our script in order to perform the experiments.
59+
The MNIST dataset we used in our experiments can be found in the `./data` directory. The `./data/MNIST/raw` directory contains the raw MNIST data, as available from [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/), while the `./data/MNIST_read.py` script reads the data into `numpy` arrays that are then *pickled* for use in the experiments. The pickled numpy arrays are already available in the `./data/MNIST/pickled` directory, so there is no need to rerun our script in order to perform the experiments.
5660

5761
<a name="bridge"></a>
5862
# BRIDGE Experiments
59-
We performed decentralized learning using BRIDGE and some of its variants based on distributed learning screening methods, namely Median, Krum and Bulyan. To train the one layer neural network on MNIST with BRIDGE or its variants run the `dec_BRIDGE.py` script. When no screening method is training is done with distributed gradient descent and no screening. Each Monte Carlo trial ran in about one hundred seconds on our machines for all screening methods.
63+
We performed decentralized learning using BRIDGE and some of its variants based on distributed learning screening methods, namely Median, Krum and Bulyan. To train the one-layer neural network on MNIST with BRIDGE or its variants, run the `dec_BRIDGE.py` script. When no screening method is selected, training is done with distributed gradient descent (DGD) without screening. Each Monte Carlo trial ran in about one hundred seconds on our machines for each of the screening methods.
6064

61-
```
65+
```shell
6266
usage: dec_BRIDGE.py [-h] [-b BYZANTINE] [-gb GOBYZANTINE]
6367
[-s {BRIDGE,Median,Krum,Bulyan}]
6468
monte_trial
6569

6670
positional arguments:
67-
monte_trial Specify which monte carlo trial to run
71+
monte_trial A number between 0 and 9 to indicate which
72+
Monte Carlo trial to run
6873

6974
optional arguments:
70-
-h, --help show this help message and exit
75+
-h, --help Show this help message and exit
7176
-b BYZANTINE, --byzantine BYZANTINE
72-
Number of Byzantine nodes to defend against, if none
73-
defaults to 0
77+
Maximum number of Byzantine nodes to defend
78+
against; if none then it defaults to 0
7479
-gb GOBYZANTINE, --goByzantine GOBYZANTINE
7580
Boolean to indicate if the specified number of
7681
Byzantine nodes actually send out faulty values
7782
-s {BRIDGE,Median,Krum,Bulyan}, --screening {BRIDGE,Median,Krum,Bulyan}
78-
Screening method to use (BRIDGE,Median, Krum, Bulyan),
79-
default no screening is done regular gradient descent
83+
Screening method to use (BRIDGE,Median,Krum,Bulyan);
84+
default is distributed gradient descent without screening
8085
```
8186
82-
**Example:** BRIDGE defending against two Byzantine nodes with no faulty nodes (faultless).
87+
## Examples
8388
84-
`python dec_BRIDGE.py 0 -b=2 -s=BRIDGE`
89+
1) BRIDGE defending against at most two Byzantine nodes with no faulty nodes in the network (faultless setting).
8590
86-
With two faulty nodes (faulty)
91+
```shell
92+
$ python dec_BRIDGE.py 0 -b=2 -s=BRIDGE
93+
```
94+
2) BRIDGE defending against at most two Byzantine nodes with exactly two faulty nodes in the network (faulty setting).
95+
96+
```shell
97+
$ python dec_BRIDGE.py 0 -b=2 -gb=True -s=BRIDGE
98+
```
8799
88-
`python dec_BRIDGE.py 0 -b=2 -gb=True -s=BRIDGE`
100+
The user can run each of the possible screening methods ten times in parallel by varying `monte_trial` between 0 and 9 for ten independent Monte Carlo trials with predetermined random number generator seeds for each trial meant to reproduce the results in every run.
89101
90-
Run each of the possible screening methods ten times in parallel by varying `monte_trial` between 0 and 9 for ten independent Monte Carlo trials.
91102
<a name="byrdie"></a>
92103
# ByRDiE Experiments
93-
We performed decentralized learning using ByRDiE, both in the faultless setting and in the presence of Byzantine nodes. To train the one layer neural network on MNIST with ByRDiE run the `dec_ByRDiE.py` script. Each Monte Carlo trial for ByRDiE ran in about two days on our machines.
104+
We performed decentralized learning using ByRDiE, both in the faultless setting and in the presence of actual Byzantine nodes. To train the one layer neural network on MNIST with ByRDiE, run the `dec_ByRDiE.py` script. Each Monte Carlo trial for ByRDiE ran in about two days on our machines.
94105
95-
```
106+
```shell
96107
usage: dec_ByRDiE.py [-h] [-b BYZANTINE] [-gb GOBYZANTINE] monte_trial
97108
98109
positional arguments:
99-
monte_trial Specify which monte carlo trial to run
110+
monte_trial A number between 0 and 9 to indicate which
111+
Monte Carlo trial to run
100112
101113
optional arguments:
102-
-h, --help show this help message and exit
114+
-h, --help Show this help message and exit
103115
-b BYZANTINE, --byzantine BYZANTINE
104-
Number of Byzantine nodes to defend against, if none
105-
defaults to 0
116+
Maximum number of Byzantine nodes to defend
117+
against; if none then it defaults to 0
106118
-gb GOBYZANTINE, --goByzantine GOBYZANTINE
107119
Boolean to indicate if the specified number of
108120
Byzantine nodes actually send out faulty values
109121
```
110-
**Example:** ByRDiE defending against two byzantine nodes with no faulty nodes
111-
`python dec_ByRDiE.py 0 -b=2`
112122
113-
with two faulty nodes
123+
## Examples
124+
1) ByRDiE defending against at most two Byzantine nodes with no faulty nodes in the network (faultless setting).
125+
126+
```shell
127+
$ python dec_ByRDiE.py 0 -b=2
128+
```
129+
130+
2) ByRDiE defending against at most two Byzantine nodes with exactly two faulty nodes in the network (faulty setting).
114131
115-
`python dec_ByRDiE.py 0 -b=2 -gb=True`
132+
```shell
133+
$ python dec_ByRDiE.py 0 -b=2 -gb=True
134+
```
116135
117-
Run `dec_ByRDiE.py` ten times in parallel by varying `monte_trial` between 0 and 9 for ten independent Monte Carlo trials.
136+
The user can run ByRDiE ten times in parallel by varying `monte_trial` between 0 and 9 for ten independent Monte Carlo trials with predetermined random number generator seeds for each trial meant to reproduce the results in every run.
118137
119138
<a name="plotting"></a>
120139
# Plotting
140+
All results generated by `dec_BRIDGE.py` and `dec_ByRDiE.py` get saved in `./result` folder. After running ten independent trials for each Byzantine-resilient decentralized method as described above, run the `plot.py` script to generate the plots similar to Figure 3 in the paper (Yang et al., 2020).
121141
122-
All results get saved in `./result` folder, after running ten independent trials for each screening method as described above run the `plot.py` script to generate the plots similar to Figure 3 in the paper.
123-
124-
**Note:** Due to a loss in the original implementation of the decentralized Krum and Bulyan screening methods the experiments with these screening methods will not perfectly reproduce the results found in Figure 3 of the paper. Nonetheless the results from the implementations in this codebase are consistent with the discussions and conclusions made in the paper.
142+
**Note:** Due to a loss in the original implementation of the decentralized Krum and Bulyan screening methods, the experiments with these screening methods will not perfectly reproduce the results found in Figure 3 of (Yang et al., 2020). Nonetheless, the results from the implementations in this codebase are consistent with the discussions and conclusions made in the paper.
125143
126144
# Contributors
127-
The original implementation was provided by the author of the paper:
145+
The algorithmic implementations and experiments were originally developed by the authors of the papers listed above:
128146
129147
- [Zhixiong Yang](https://www.linkedin.com/in/zhixiong-yang-67139152/)
130148
- [Arpita Gang](https://www.linkedin.com/in/arpita-gang-41444930/)
131149
- [Waheed U. Bajwa](http://www.inspirelab.us/)
132150
133-
The publicization and reproducibility of the code was made possible by:
151+
The reproducibility of this codebase and publicizing of it was made possible by:
134152
135153
- [Joseph Shenouda](https://github.com/joeshenouda)

0 commit comments

Comments
 (0)