-
Notifications
You must be signed in to change notification settings - Fork 14
MPI Comm cleanup update #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…on rule: removing obsolete files
…gging output still to be removed
…ved unnecessary debugging output
… HPC benchmark with this connection rule
… remote connection creation time
…rce nodes are actually used in the connections
…on of remote connections
… fast building of the maps when the source neurons of a remote connection command are in a sequence of contiguous integers
…rce neurons of a remote connection command are in a sequence of contiguous integers in target host (RemoteConnectSource)
…n the source neurons of a remote connection command are in a sequence of contiguous integers in source host (RemoteConnectTarget)
…en the source neurons of a remote connection command are in a sequence of contiguous integers in source host (RemoteConnectTarget)
…GPU memory allocation, with additional timers for remote connection creation
…onnectDistributedFixedIndegree
… bit-packing compression
…ber of MPI processes with the command SetNHosts(n_hosts)
… should be memorized
…rs needed only for specific choices of the algorithm
…allocating arrays useful for output spike buffer only using the number of local nodes rather than local+image nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this tremendous work! Could you please remove also the raster plots and the data unless it is needed for tests? And the Python examples and tests should be revised.
jhnnsnk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this tremendous work! 👍
|
Ping @lucapontisso @gmtiddia |
Thanks Jose for the huge work! |
|
I've just had a look at the code, it is ok for me, Thanks a lot for the huge work! |
|
Thank you for all this work Jose! However I have to ask you one more thing, for practical reasons. The python tests, previously in the folder python/test and usually launched with the bash scripts test_all.sh and test_mpi.sh, do not work any more, because data files for the test folder have been moved. I know that this is a temporary solution, because as soon as possible they should be handled in a similar way as the NEST (CPU) tests, however until we have that solution it would be better to keep them working in the old way because they are used after every changes to the code to check that everything is working properly. For the same reason I ask you to put back the all the files that were in the folder python/hpc_benchmark/test/, i.e. in the subfolders data_check, data_check_dfi, test_hpc_benchmark_hg, test_hpc_benchmark_p2p, test_hpc_benchmark_wg, test_hpc_benchmark_wg_dfi, and the files in the Potjans_2014 folder. |
|
Dear @golosio I have gone over all tests and I indeed noticed some broken tests however this is not due to re-organizing but partly because of incorrect test design and partly because of outdated tests. In particular for the different tests of the As for other tests, most of the tests found in the As for the Finally the files in the If you have issues running any of the tests I confirmed to work for me, I would be happy to discuss and possibly debug with you, however for the moment I did not find any other issue aside from those stated above, and in any case would suggest to fix the outdated tests in another pull request. |
|
Thank @JoseJVS, I confirm that all the tests are working for me as well. The only issue that I found is that after upgrading the CUDA toolkits to version 13, I receive the error: I use the Potjans_2014 script for a quick check that code changes do not affect the simulation time on the same hardware. Concerning the hpc_benchmark tests, I agree, but until we prepare clean tests they are the fastest way to check that different MPI communication schemes are working consistently, while test_hpc_benchmark_wg_dfi was the only test that we had for the distributed fixed_indegree connection rule. But clearly I can keep a script out of the repo for this purpose, if you think it is better. |
|
Dear @golosio I have added the As for the raster plot comparisons, I agree that these can be used as a temporary solution until we develop a proper verification pipeline. As for the Potjans_2014, I agree that having a quick performance reference is handy when developing changes, however that performance data is only applicable to your development workstation and cannot serve as reference to anyone else. |
|
Thank you again @JoseJVS. The check_kernel_params.sh and hpc_benchmark_wg_check_kernel_params.py were temporary checks and are now obsolete, you did well to remove them. We have to develop some checks for kernel parameter configuration, but I would do it in a separate pull request. About the Potjans_2014, I think that I did not explain myself well: I do not need the data folder, with the raster plot or other data. I would only need the scripts to run the simulation. But if you think that it would be better to keep the scripts in a separate repository, it would be fine. By the way, the folders Potjans_2014_hc and Potjans_2014_s are obsolete and should be removed. |
|
Thanks for the quick back and forth @golosio ! |
|
Dear all, GPUassert: invalid device symbol /leonardo/home/userexternal/nest-gpu/src/input_spike_buffer.h 857 cuda/12.2 with the fork on Bruno's repo (mpi_comm branch) everything is fine with the same environment. |
|
Dear @lucapontisso , When you mention Bruno's mpi_comm branch do you mean: The current clean-up branch in this pull request is branched off of mpi_comm_dev2. |
Dear @JoseJVS , the (https://github.com/golosio/nest-gpu/tree/mpi_comm) branch it works for me on Leonardo. GPUassert: invalid device symbol /leonardo/home/userexternal/lpontis1/NEST-GPU-TEST/nest-gpu-bruno-fork/nest-gpu/src/input_spike_buffer.h 857 |
|
I understood the problem. |
|
Ah okay, interesting to note @lucapontisso about the sm information. |
Done! |
|
Thanks for all the reviews! |
This is an updated version of the latest mpi comm branch from Bruno's fork.
I have cleaned up unrelated benchmarking scripts as well as old automake related scripts and deprecated c++ examples.
I also bumped the version to 2.0.0.