Skip to content

Conversation

@JoseJVS
Copy link
Collaborator

@JoseJVS JoseJVS commented Oct 15, 2025

This is an updated version of the latest mpi comm branch from Bruno's fork.
I have cleaned up unrelated benchmarking scripts as well as old automake related scripts and deprecated c++ examples.
I also bumped the version to 2.0.0.

golosio added 30 commits March 4, 2025 01:05
…rce nodes are actually used in the connections
… fast building of the maps when the source neurons of a remote connection command are in a sequence of contiguous integers
…rce neurons of a remote connection command are in a sequence of contiguous integers in target host (RemoteConnectSource)
…n the source neurons of a remote connection command are in a sequence of contiguous integers in source host (RemoteConnectTarget)
…en the source neurons of a remote connection command are in a sequence of contiguous integers in source host (RemoteConnectTarget)
…GPU memory allocation, with additional timers for remote connection creation
…ber of MPI processes with the command SetNHosts(n_hosts)
…rs needed only for specific choices of the algorithm
…allocating arrays useful for output spike buffer only using the number of local nodes rather than local+image nodes
Copy link
Collaborator

@jhnnsnk jhnnsnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this tremendous work! Could you please remove also the raster plots and the data unless it is needed for tests? And the Python examples and tests should be revised.

Copy link
Collaborator

@jhnnsnk jhnnsnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this tremendous work! 👍

@JoseJVS
Copy link
Collaborator Author

JoseJVS commented Oct 31, 2025

Ping @lucapontisso @gmtiddia

@lucapontisso
Copy link
Collaborator

Ping @lucapontisso @gmtiddia

Thanks Jose for the huge work!
Sorry for the late reply. I will conclude the review in the weekend.

@gmtiddia
Copy link
Collaborator

gmtiddia commented Oct 31, 2025

I've just had a look at the code, it is ok for me, Thanks a lot for the huge work!

@golosio
Copy link
Collaborator

golosio commented Oct 31, 2025

Thank you for all this work Jose! However I have to ask you one more thing, for practical reasons. The python tests, previously in the folder python/test and usually launched with the bash scripts test_all.sh and test_mpi.sh, do not work any more, because data files for the test folder have been moved. I know that this is a temporary solution, because as soon as possible they should be handled in a similar way as the NEST (CPU) tests, however until we have that solution it would be better to keep them working in the old way because they are used after every changes to the code to check that everything is working properly. For the same reason I ask you to put back the all the files that were in the folder python/hpc_benchmark/test/, i.e. in the subfolders data_check, data_check_dfi, test_hpc_benchmark_hg, test_hpc_benchmark_p2p, test_hpc_benchmark_wg, test_hpc_benchmark_wg_dfi, and the files in the Potjans_2014 folder.

@JoseJVS
Copy link
Collaborator Author

JoseJVS commented Nov 3, 2025

Dear @golosio I have gone over all tests and I indeed noticed some broken tests however this is not due to re-organizing but partly because of incorrect test design and partly because of outdated tests.

In particular for the different tests of the hpc_benchmark, doing a binary comparison on the produced raster plots will yield differences when using different matplotlib versions.
I have confirmed this to be true both in the mpi_comm_dev2 branch of your repo and this pull request's mpi_comm_clean branch, however the firing rate comparison remains correct.
To summarize, all check.sh scripts that compare against raster plots in the data* directories fail due to different matplotlib versions (at least for my setup), those that compare against raster plots generated by the p2p test succeed (i.e. all raster plots generated with the same setup), and all check2.sh succeed (all firing rate comparisons).
Though, there is a check_kernel_params.sh bash script which attempts to run a non existing python file, but this issue is already present in the mpi_comm_dev2 branch of your repo.

As for other tests, most of the tests found in the new_features directory (previously test_new_features) are outdated and do not execute properly anymore (this is, I believe, with respect to updates on the command for setting neuron parameters).

As for the test_all.sh scripts you mention, previously found in the test directory now in static_tests, I have successfully ran all tests after reorganizing the mpi_comm_clean branch.
I did clean one file I should not have and have restored it with the recent commits I pushed, and confirmed that the rest of the test_* bash scripts correctly work.

Finally the files in the Potjans_2014 directory are only performance information on different timers that from what I found are not used in any test.
Anyhow, performance data should not be used for verification tests as this is is highly volatile and not only depend on the hardware used but are, for any practical purposes, unique to the specific run.
For the purpose of storing performance information we have published our benchmarking data on Zenodo.

If you have issues running any of the tests I confirmed to work for me, I would be happy to discuss and possibly debug with you, however for the moment I did not find any other issue aside from those stated above, and in any case would suggest to fix the outdated tests in another pull request.

@golosio
Copy link
Collaborator

golosio commented Nov 3, 2025

Thank @JoseJVS, I confirm that all the tests are working for me as well. The only issue that I found is that after upgrading the CUDA toolkits to version 13, I receive the error:
... error: namespace "std" has no member "vector"
unless I include the header in remote_spike.h:

#ifndef REMOTE_SPIKE_H
#define REMOTE_SPIKE_H
#include <vector>

I use the Potjans_2014 script for a quick check that code changes do not affect the simulation time on the same hardware. Concerning the hpc_benchmark tests, I agree, but until we prepare clean tests they are the fastest way to check that different MPI communication schemes are working consistently, while test_hpc_benchmark_wg_dfi was the only test that we had for the distributed fixed_indegree connection rule. But clearly I can keep a script out of the repo for this purpose, if you think it is better.

@JoseJVS
Copy link
Collaborator Author

JoseJVS commented Nov 3, 2025

Dear @golosio I have added the <vector> inclusion in remote_spike.h, I guess we will soon need to perform thorough tests on CUDA versions compatibility...

As for the raster plot comparisons, I agree that these can be used as a temporary solution until we develop a proper verification pipeline.
In the case of test_hpc_benchmark_wg_dfi I think it makes sense to use this model as our verification for the distributed fixed indegree rule, however in my previous comment the issue I was alluding to was that precisely in this test directory there is the check_kernel_params.sh script that executes a hpc_benchmark_wg_check_kernel_params.py which does not exist.
Perhaps, you could push this python script to your repo so that I can then include it in this test directory?

As for the Potjans_2014, I agree that having a quick performance reference is handy when developing changes, however that performance data is only applicable to your development workstation and cannot serve as reference to anyone else.
Of course documenting these performance behaviors of our code across different hardware is also important (which is why we publish our data on Zenodo), so as a compromise I would suggest to either add your hardware specification and performance data directly to the readme (and you would have to keep this updated whenever we optimize something) or define a standard for naming these performance files and write this format to the .gitignore so that any developer can produce their own performance references for their corresponding development machines without interfering with the commit history or git status.

@golosio
Copy link
Collaborator

golosio commented Nov 3, 2025

Thank you again @JoseJVS. The check_kernel_params.sh and hpc_benchmark_wg_check_kernel_params.py were temporary checks and are now obsolete, you did well to remove them. We have to develop some checks for kernel parameter configuration, but I would do it in a separate pull request. About the Potjans_2014, I think that I did not explain myself well: I do not need the data folder, with the raster plot or other data. I would only need the scripts to run the simulation. But if you think that it would be better to keep the scripts in a separate repository, it would be fine. By the way, the folders Potjans_2014_hc and Potjans_2014_s are obsolete and should be removed.

@JoseJVS
Copy link
Collaborator Author

JoseJVS commented Nov 3, 2025

Thanks for the quick back and forth @golosio !
I have just pushed a commit where I removed check_kernel_params.sh.
And I also see we have reached an agreement on the data files of the Potjans_2014 model.
For now I think it is fine to leave the scripts of this model in our repo as it is still NEST GPU specific (I did remove the directories Potjans_2014_hc and Potjans_2014_s though).
Once we have unified our interface with PyNEST we can begin the discussion on migrating the model scripts to the dedicated (new) microcircuit repository managed by the NEST Initiative.

@lucapontisso
Copy link
Collaborator

Dear all,
I tried the new version [JoseJVS:mpi_comm_clean] until commit ec5f6a0
on Leonardo and i receive this error when trying to import nestgpu in python:

GPUassert: invalid device symbol /leonardo/home/userexternal/nest-gpu/src/input_spike_buffer.h 857
terminate called after throwing an instance of 'ngpu_exception'
what(): CUDA error

cuda/12.2
openmpi/4.1.6--gcc--12.2.0-cuda-12.2

with the fork on Bruno's repo (mpi_comm branch) everything is fine with the same environment.

@JoseJVS
Copy link
Collaborator Author

JoseJVS commented Nov 6, 2025

Dear @lucapontisso ,

When you mention Bruno's mpi_comm branch do you mean:
this mpi_comm branch

or this mpi_comm_dev2 branch?

The current clean-up branch in this pull request is branched off of mpi_comm_dev2.

@lucapontisso
Copy link
Collaborator

lucapontisso commented Nov 7, 2025

Dear @lucapontisso ,

When you mention Bruno's mpi_comm branch do you mean: this mpi_comm branch or this mpi_comm_dev2 branch?

The current clean-up branch in this pull request is branched off of mpi_comm_dev2.

Dear @JoseJVS ,

the (https://github.com/golosio/nest-gpu/tree/mpi_comm) branch it works for me on Leonardo.
I just tried the "mpi_comm_dev2" (always on Leonardo) and i receive the error i mentioned before when trying to import nestgpu in python:

GPUassert: invalid device symbol /leonardo/home/userexternal/lpontis1/NEST-GPU-TEST/nest-gpu-bruno-fork/nest-gpu/src/input_spike_buffer.h 857
terminate called after throwing an instance of 'ngpu_exception'
what(): CUDA error
Aborted

@lucapontisso
Copy link
Collaborator

lucapontisso commented Nov 7, 2025

I understood the problem.
My script was compiling for a slightly different Nvidia architecture than the A100 (sm=89 instead sm=80). It worked anyway until the mpi_comm branch but not with the newest version.
Changing the script and recompiling, everything works.

@JoseJVS
Copy link
Collaborator Author

JoseJVS commented Nov 7, 2025

Ah okay, interesting to note @lucapontisso about the sm information.
I was under the impression that the Ampere architecture could handle sm=89, but I guess we can investigate this further on a different PR.
Am I right to understand you also approve of this PR now?

@lucapontisso
Copy link
Collaborator

Ah okay, interesting to note @lucapontisso about the sm information. I was under the impression that the Ampere architecture could handle sm=89, but I guess we can investigate this further on a different PR. Am I right to understand you also approve of this PR now?

Done!
Thank you again @JoseJVS for the tremendous work :)

@JoseJVS
Copy link
Collaborator Author

JoseJVS commented Nov 8, 2025

Thanks for all the reviews!

@JoseJVS JoseJVS merged commit 105e807 into nest:nest-gpu-2.0-mpi-comm Nov 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants