|
1 | | -# Byzantine Resilient Decentralized Learning |
| 1 | +# Byzantine-resilient Decentralized Machine Learning: Codebase for ByRDiE, BRIDGE, and Variants |
2 | 2 |
|
3 | 3 | ## Table of Contents |
4 | 4 | <!-- MarkdownTOC --> |
|
11 | 11 |
|
12 | 12 | <a name="introduction"></a> |
13 | 13 | # General Information |
14 | | -This repo includes implementations of **Byzantine-resilient Decentralized Gradient Descent(BRIDGE)** and its variants as well as **Byzantine-resilient Distributed Coordinate Descent for decentralized learning(ByRDiE)** to perform decentralized learning in the presence of Byzantine nodes. Specifically, this codebase impelements the decentralized learning experiment found in [Adversary-resilient Distributed and Decentralized |
15 | | -Statistical Inference and Machine Learning](https://ieeexplore.ieee.org/document/9084329) |
| 14 | +This repo provides implementations of **Byzantine-resilient Distributed Coordinate Descent for Decentralized Learning (ByRDiE)**, **Byzantine-resilient Decentralized Gradient Descent (BRIDGE)**, and different variants of the BRIDGE algorithm. In addition, it includes code to implement decentralized machine learning in the presence of Byzantine (malicious) nodes. The codebase in particular can be used to reproduce the decentralized learning experiments reported in the overview paper entitled "[Adversary-resilient Distributed and Decentralized Statistical Inference and Machine Learning](https://ieeexplore.ieee.org/document/9084329)" that appeared in IEEE Signal Processing Magazine in May 2020. |
16 | 15 |
|
17 | 16 | ## License and Citation |
18 | 17 | The code in this repo is being released under the GNU General Public License v3.0; please refer to the [LICENSE](./LICENSE) file in the repo for detailed legalese pertaining to the license. In particular, if you use any part of this code then you must cite both the original papers as well as this codebase as follows: |
19 | 18 |
|
20 | 19 | **Paper Citations:** |
21 | | -- Z. Yang, A. Gang and W. U. Bajwa, "Adversary-Resilient Distributed and Decentralized Statistical Inference and Machine Learning: An Overview of Recent Advances Under the Byzantine Threat Model," in IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 146-159, May 2020, doi: [10.1109/MSP.2020.2973345](https://doi.org/10.1109/MSP.2020.2973345). |
22 | | -- Yang, Zhixiong, and Waheed U. Bajwa. "BRIDGE: Byzantine-resilient decentralized gradient descent." arXiv preprint arXiv:1908.08098 (2019). |
23 | | -- Z. Yang and W. U. Bajwa, "ByRDiE: Byzantine-Resilient Distributed Coordinate Descent for Decentralized Learning," in IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 4, pp. 611-627, Dec. 2019, doi: [10.1109/TSIPN.2019.2928176](https://doi.org/10.1109/TSIPN.2019.2928176). |
24 | 20 |
|
25 | | -**Codebase Citation:** J. Shenouda, Z. Yang, W. U. Bajwa, "Codebase---Adversary-Resilient Distributed and Decentralized Statistical Inference and Machine Learning: An Overview of Recent Advances Under the Byzantine Threat Model," GitHub Repository, 2020 |
| 21 | +- Z. Yang and W.U. Bajwa, "ByRDiE: Byzantine-resilient distributed coordinate descent for decentralized learning," IEEE Trans. Signal Inform. Proc. over Netw., vol. 5, no. 4, pp. 611-627, Dec. 2019; doi: [10.1109/TSIPN.2019.2928176](https://doi.org/10.1109/TSIPN.2019.2928176). |
| 22 | +- Z. Yang and W.U. Bajwa, "BRIDGE: Byzantine-resilient decentralized gradient descent," arXiv preprint, Aug. 2019; [arXiv:1908.08098](https://arxiv.org/abs/1908.08098). |
| 23 | +- Z. Yang, A. Gang, and W.U. Bajwa, "Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model," IEEE Signal Processing Mag., vol. 37, no. 3, pp. 146-159, May 2020; doi: [10.1109/MSP.2020.2973345](https://doi.org/10.1109/MSP.2020.2973345). |
| 24 | + |
| 25 | +**Codebase Citation:** J. Shenouda, Z. Yang, and W.U. Bajwa, "Codebase---Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model," GitHub Repository, 2020; doi: [TBD](#). |
26 | 26 |
|
27 | 27 | ## Summary of Experiments |
28 | | -In these experiments we simulate a decentralized network to learn a linear multiclassifier on the MNIST dataset using a one layer neural network implemented in TensorFlow. Our network consists of twenty nodes each assigned two thousand training samples from the MNIST dataset. Similar to the paper we conduct two experiments. |
| 28 | +The codebase uses implementations of ByRDiE, BRIDGE, and BRIDGE variants to generate results for Byzantine-resilient decentralized learning. The generated results correspond to experiments in which we simulate a decentralized network that trains a linear multiclass classifier on the [MNSIT dataset](http://yann.lecun.com/exdb/mnist/) using a one-layer neural network that is implemented in TensorFlow. The network consists of twenty nodes, with each node assigned two thousand training samples from the MNIST dataset. Similar to the overview paper (Yang et al., 2020), the codebase provides two sets of experiments: |
29 | 29 |
|
30 | | -1. Train the neural network using Distributed Gradient Descent (DGD), ByRDiE, BRIDGE and the variants of BRIDGE namely, Median, Krum and Bulyan defending against two byzantine nodes while no nodes actually undergo byzantine failure. This is the faultless setting and will produce a plot similar to Figure 3a in the paper. |
| 30 | +1. Train the neural network using Distributed Gradient Descent (DGD), ByRDiE, BRIDGE, and three variants of BRIDGE, namely, BRIDGE--Median, BRIDGE--Krum and BRIDGE--Bulyan, with the Byzantine-resilient algorithms defending against at most two Byzantine nodes while no nodes actually undergo Byzantine failure. This is the faultless setting and the code produces a plot similar to Figure 3(a) in the paper (Yang et al., 2020) in this case. |
| 31 | +2. Train the neural network using the six methods as above, with the Byzantine-resilient algorithms defending against at most two Byzantine nodes and exactly two nodes undergo Byzantine failure and communicate random values instead of the actual gradient to their neighbors. This is the faulty setting and the code produces a plot similar to Figure 3(b) in the paper (Yang et al., 2020) in this case. |
31 | 32 |
|
32 | | -2. Train the neural network using all six methods as above while defending against two byzantine nodes and indeed two nodes undergo byzantine failure communicating random values instead of the actual gradient. This is the faulty setting and will produce a plot similar to Figure 3b in the paper. |
| 33 | +For experiments in both the faultless and the faulty setting, we ran ten Monte Carlo trials in parallel and averaged the classification accuracy before plotting. |
33 | 34 |
|
34 | | -For each of these experiments, the faultless and faulty setting, we ran each sc ten Monte Carlo trials in parallel and averaged the classification accuracy before plotting. |
| 35 | +## Summary of Code |
| 36 | +The `dec_BRIDGE.py` and `dec_ByRDiE.py` serve as the "driver" or "main" files where we set up the experiments and call the necessary functions to learn the machine learning model in a decentralized manner. The actual implementations of the various screenings methods (ByRDiE, BRIDGE, and variants of BRIDGE) are carried out in the `DecLearning.py` module. While these specific implementations are written for the particular case of training with a single-layer neural network using TensorFlow, the core of these implementations can be easily adapted for other machine learning problems. |
35 | 37 |
|
36 | 38 | ## Computing Environment |
37 | 39 | All of our computational experiments were carried out on a Linux high-performance computing (HPC) cluster provided by the Rutgers Office of Advanced Research Computing; specifically, all of our experiments were run on: |
38 | 40 |
|
39 | | -Lenovo NextScale nx360 servers |
| 41 | +Lenovo NextScale nx360 servers: |
40 | 42 |
|
41 | 43 | - 2 x 12-core Intel Xeon E5-2680 v3 "Haswell" processors |
42 | 44 | - 128 GB RAM |
43 | 45 | - 1 TB local scratch disk |
44 | 46 |
|
45 | | -However we only allocated 4GB of RAM when submitting each of our jobs. |
| 47 | +However, we only allocated 4GB of RAM when submitting each of our jobs. |
46 | 48 |
|
47 | 49 | ## Requirements and Dependencies |
48 | | -To reproduce the environment with necessary dependencies create a conda environment using the `environment.yml` provided. |
| 50 | +This code is written in Python and uses TensforFlow. To reproduce the environment with necessary dependencies needed for running of the code in this repo, we recommend that the users create a `conda` environment using the `environment.yml` YAML file that is provided in the repo. Assuming the conda management system is installed on the user's system, this can be done using the following: |
49 | 51 |
|
50 | | -``` |
51 | | -conda env create -f environment.yml |
| 52 | +```shell |
| 53 | +$ conda env create -f environment.yml |
52 | 54 | ``` |
53 | 55 |
|
| 56 | +In the case users don't have conda installed on their system, they should check out the `environment.yml` file for the appropriate version of Python as well as the necessary dependencies with their respective versions needed to run the code in the repo. |
| 57 | + |
54 | 58 | ## Data |
55 | | -The MNIST dataset we used can be found in the `./data` directory. The `./data/MNIST/raw` directory contains the raw MNIST data and the `./data/MNIST_read.py` script reads the data into numpy arrays which are then pickled for use in the experiments. The pickled numpy arrays are already avaliable in the `./data/MNIST/pickled` directory so there is no need to rerun our script in order to perform the experiments. |
| 59 | +The MNIST dataset we used in our experiments can be found in the `./data` directory. The `./data/MNIST/raw` directory contains the raw MNIST data, as available from [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/), while the `./data/MNIST_read.py` script reads the data into `numpy` arrays that are then *pickled* for use in the experiments. The pickled numpy arrays are already available in the `./data/MNIST/pickled` directory, so there is no need to rerun our script in order to perform the experiments. |
56 | 60 |
|
57 | 61 | <a name="bridge"></a> |
58 | 62 | # BRIDGE Experiments |
59 | | -We performed decentralized learning using BRIDGE and some of its variants based on distributed learning screening methods, namely Median, Krum and Bulyan. To train the one layer neural network on MNIST with BRIDGE or its variants run the `dec_BRIDGE.py` script. When no screening method is training is done with distributed gradient descent and no screening. Each Monte Carlo trial ran in about one hundred seconds on our machines for all screening methods. |
| 63 | +We performed decentralized learning using BRIDGE and some of its variants based on distributed learning screening methods, namely Median, Krum and Bulyan. To train the one-layer neural network on MNIST with BRIDGE or its variants, run the `dec_BRIDGE.py` script. When no screening method is selected, training is done with distributed gradient descent (DGD) without screening. Each Monte Carlo trial ran in about one hundred seconds on our machines for each of the screening methods. |
60 | 64 |
|
61 | | -``` |
| 65 | +```shell |
62 | 66 | usage: dec_BRIDGE.py [-h] [-b BYZANTINE] [-gb GOBYZANTINE] |
63 | 67 | [-s {BRIDGE,Median,Krum,Bulyan}] |
64 | 68 | monte_trial |
65 | 69 |
|
66 | 70 | positional arguments: |
67 | | - monte_trial Specify which monte carlo trial to run |
| 71 | + monte_trial A number between 0 and 9 to indicate which |
| 72 | + Monte Carlo trial to run |
68 | 73 |
|
69 | 74 | optional arguments: |
70 | | - -h, --help show this help message and exit |
| 75 | + -h, --help Show this help message and exit |
71 | 76 | -b BYZANTINE, --byzantine BYZANTINE |
72 | | - Number of Byzantine nodes to defend against, if none |
73 | | - defaults to 0 |
| 77 | + Maximum number of Byzantine nodes to defend |
| 78 | + against; if none then it defaults to 0 |
74 | 79 | -gb GOBYZANTINE, --goByzantine GOBYZANTINE |
75 | 80 | Boolean to indicate if the specified number of |
76 | 81 | Byzantine nodes actually send out faulty values |
77 | 82 | -s {BRIDGE,Median,Krum,Bulyan}, --screening {BRIDGE,Median,Krum,Bulyan} |
78 | | - Screening method to use (BRIDGE,Median, Krum, Bulyan), |
79 | | - default no screening is done regular gradient descent |
| 83 | + Screening method to use (BRIDGE,Median,Krum,Bulyan); |
| 84 | + default is distributed gradient descent without screening |
80 | 85 | ``` |
81 | 86 |
|
82 | | -**Example:** BRIDGE defending against two Byzantine nodes with no faulty nodes (faultless). |
| 87 | +## Examples |
83 | 88 |
|
84 | | -`python dec_BRIDGE.py 0 -b=2 -s=BRIDGE` |
| 89 | +1) BRIDGE defending against at most two Byzantine nodes with no faulty nodes in the network (faultless setting). |
85 | 90 |
|
86 | | -With two faulty nodes (faulty) |
| 91 | +```shell |
| 92 | +$ python dec_BRIDGE.py 0 -b=2 -s=BRIDGE |
| 93 | +``` |
| 94 | +2) BRIDGE defending against at most two Byzantine nodes with exactly two faulty nodes in the network (faulty setting). |
| 95 | +
|
| 96 | +```shell |
| 97 | +$ python dec_BRIDGE.py 0 -b=2 -gb=True -s=BRIDGE |
| 98 | +``` |
87 | 99 |
|
88 | | -`python dec_BRIDGE.py 0 -b=2 -gb=True -s=BRIDGE` |
| 100 | +The user can run each of the possible screening methods ten times in parallel by varying `monte_trial` between 0 and 9 for ten independent Monte Carlo trials with predetermined random number generator seeds for each trial meant to reproduce the results in every run. |
89 | 101 |
|
90 | | -Run each of the possible screening methods ten times in parallel by varying `monte_trial` between 0 and 9 for ten independent Monte Carlo trials. |
91 | 102 | <a name="byrdie"></a> |
92 | 103 | # ByRDiE Experiments |
93 | | -We performed decentralized learning using ByRDiE, both in the faultless setting and in the presence of Byzantine nodes. To train the one layer neural network on MNIST with ByRDiE run the `dec_ByRDiE.py` script. Each Monte Carlo trial for ByRDiE ran in about two days on our machines. |
| 104 | +We performed decentralized learning using ByRDiE, both in the faultless setting and in the presence of actual Byzantine nodes. To train the one layer neural network on MNIST with ByRDiE, run the `dec_ByRDiE.py` script. Each Monte Carlo trial for ByRDiE ran in about two days on our machines. |
94 | 105 |
|
95 | | -``` |
| 106 | +```shell |
96 | 107 | usage: dec_ByRDiE.py [-h] [-b BYZANTINE] [-gb GOBYZANTINE] monte_trial |
97 | 108 |
|
98 | 109 | positional arguments: |
99 | | - monte_trial Specify which monte carlo trial to run |
| 110 | + monte_trial A number between 0 and 9 to indicate which |
| 111 | + Monte Carlo trial to run |
100 | 112 |
|
101 | 113 | optional arguments: |
102 | | - -h, --help show this help message and exit |
| 114 | + -h, --help Show this help message and exit |
103 | 115 | -b BYZANTINE, --byzantine BYZANTINE |
104 | | - Number of Byzantine nodes to defend against, if none |
105 | | - defaults to 0 |
| 116 | + Maximum number of Byzantine nodes to defend |
| 117 | + against; if none then it defaults to 0 |
106 | 118 | -gb GOBYZANTINE, --goByzantine GOBYZANTINE |
107 | 119 | Boolean to indicate if the specified number of |
108 | 120 | Byzantine nodes actually send out faulty values |
109 | 121 | ``` |
110 | | -**Example:** ByRDiE defending against two byzantine nodes with no faulty nodes |
111 | | -`python dec_ByRDiE.py 0 -b=2` |
112 | 122 |
|
113 | | -with two faulty nodes |
| 123 | +## Examples |
| 124 | +1) ByRDiE defending against at most two Byzantine nodes with no faulty nodes in the network (faultless setting). |
| 125 | +
|
| 126 | +```shell |
| 127 | +$ python dec_ByRDiE.py 0 -b=2 |
| 128 | +``` |
| 129 | +
|
| 130 | +2) ByRDiE defending against at most two Byzantine nodes with exactly two faulty nodes in the network (faulty setting). |
114 | 131 |
|
115 | | -`python dec_ByRDiE.py 0 -b=2 -gb=True` |
| 132 | +```shell |
| 133 | +$ python dec_ByRDiE.py 0 -b=2 -gb=True |
| 134 | +``` |
116 | 135 |
|
117 | | -Run `dec_ByRDiE.py` ten times in parallel by varying `monte_trial` between 0 and 9 for ten independent Monte Carlo trials. |
| 136 | +The user can run ByRDiE ten times in parallel by varying `monte_trial` between 0 and 9 for ten independent Monte Carlo trials with predetermined random number generator seeds for each trial meant to reproduce the results in every run. |
118 | 137 |
|
119 | 138 | <a name="plotting"></a> |
120 | 139 | # Plotting |
| 140 | +All results generated by `dec_BRIDGE.py` and `dec_ByRDiE.py` get saved in `./result` folder. After running ten independent trials for each Byzantine-resilient decentralized method as described above, run the `plot.py` script to generate the plots similar to Figure 3 in the paper (Yang et al., 2020). |
121 | 141 |
|
122 | | -All results get saved in `./result` folder, after running ten independent trials for each screening method as described above run the `plot.py` script to generate the plots similar to Figure 3 in the paper. |
123 | | - |
124 | | -**Note:** Due to a loss in the original implementation of the decentralized Krum and Bulyan screening methods the experiments with these screening methods will not perfectly reproduce the results found in Figure 3 of the paper. Nonetheless the results from the implementations in this codebase are consistent with the discussions and conclusions made in the paper. |
| 142 | +**Note:** Due to a loss in the original implementation of the decentralized Krum and Bulyan screening methods, the experiments with these screening methods will not perfectly reproduce the results found in Figure 3 of (Yang et al., 2020). Nonetheless, the results from the implementations in this codebase are consistent with the discussions and conclusions made in the paper. |
125 | 143 |
|
126 | 144 | # Contributors |
127 | | -The original implementation was provided by the author of the paper: |
| 145 | +The algorithmic implementations and experiments were originally developed by the authors of the papers listed above: |
128 | 146 |
|
129 | 147 | - [Zhixiong Yang](https://www.linkedin.com/in/zhixiong-yang-67139152/) |
130 | 148 | - [Arpita Gang](https://www.linkedin.com/in/arpita-gang-41444930/) |
131 | 149 | - [Waheed U. Bajwa](http://www.inspirelab.us/) |
132 | 150 |
|
133 | | -The publicization and reproducibility of the code was made possible by: |
| 151 | +The reproducibility of this codebase and publicizing of it was made possible by: |
134 | 152 |
|
135 | 153 | - [Joseph Shenouda](https://github.com/joeshenouda) |
0 commit comments