This repository documents my journey in learning about deep learning parallelism, including Distributed Data Parallel (DDP), tensor parallelism, pipeline parallelism, and more. I have gathered a variety of materials from sources such as CSDN, PyTorch tutorials, and GitHub repositories. I am thrilled to share this resource, and I hope it will be of help to you, even if just a little.
Here is a brief summary of the repository:
| experiment | short description | related info |
|---|---|---|
| Alexnet_deepspeed_parallelism (One Node) | different partitions on DeepSpeed | pytorch, deepspeed |
| Alexnet_pytorch_UnbanlancedDDP_parallelism | different partitions on my gradient communication | pytorch |
| ResNet_parameter_server_parallelism | parameter sever utilizing torch.rpc | pytorch rpc |
| Resnet50_OneNode_Pipline_parallelism | pipeline parallelism using torch.rpc | pytorch, rpc |
| ResNet152_multinode_DDP_parallelism | multinode DDP | pytorch |
| ring_allreduce | ring topology allreduce | pytorch |
| Rnn_parameter_server_parallelism | ps using torch rpc | pytorch, rpc |
| Tonymodel_multinode_DDP_torchrun_parallelism | pytorchDDP towards torchrun | pytorch |
| TonyModel_tensor_parallelism | tensor parallelism utilizing DTensor | pytorch, DTensor |
| Transformer_data_pipeline_parallelism | data parallelism integrated with pipeline parallelism | pytorch, rpc |
| Transformer_pipeline_parallelism | pipeline parallelism for Transformer on two node | pytorch, rpc |