Skip to content

Conversation

TheAssembler1
Copy link
Collaborator

@TheAssembler1 TheAssembler1 commented May 20, 2025

This PR fixes this issue on the client tests. The server has similar issues. In the following functions bulk handles were created and never destroyed on the client side:

PDC_Client_transfer_request
PDC_Client_transfer_request_all
PDC_Client_transfer_request_metadata_query
PDC_Client_transfer_request_metadata_query2

These functions now take an additional hg_bulk_t* parameter which they set to the bulk handle created within the function. This is a pointer into the array of bulk handles for the corresponding region transfer.

To fix this an array of hg_bulk_t has been added to the PDC region transfer struct. The bulk handles created are added to the array. On a transfer close HG_Bulk_free is called on each bulk handle in the array.

  • System: Perlmutter
  • Nodes: 4
  • Clients: 32 per node
  • Servers: 1 per node
  • Mercury: Single-threaded
  • Cache: Enabled
  • Commands:
    • VPICIO: ./vpicio_mts 8388608 5 20
    • BDCATS: ./vpicio ./bdcats

VPICIO_MTS Develop time (old):

Step Time Type Time Value (s)
0 Obj create time 3.79819e-02
0 Transfer create time 1.89206e-04
0 Transfer start time 8.81842e-01
0 Transfer wait time 5.68597e-02
0 Transfer close time 7.25157e-02
0 Obj close time 9.77335e-04
1 Obj create time 1.18654e-03
1 Transfer create time 2.67920e-05
1 Transfer start time 1.02360e+00
1 Transfer wait time 1.32388e-01
1 Transfer close time 1.20518e-01
1 Obj close time 7.96650e-05
2 Obj create time 7.67836e-03
2 Transfer create time 3.07790e-05
2 Transfer start time 9.94380e-01
2 Transfer wait time 1.03283e-01
2 Transfer close time 6.81297e-02
2 Obj close time 6.10180e-05
3 Obj create time 1.26300e-03
3 Transfer create time 2.60810e-05
3 Transfer start time 1.01179e+00
3 Transfer wait time 6.62225e-02
3 Transfer close time 8.85037e-02
3 Obj close time 5.97460e-05
4 Obj create time 1.30326e-03
4 Transfer create time 2.59000e-05
4 Transfer start time 1.04613e+00
4 Transfer wait time 5.82398e+00
4 Transfer close time 1.02357e-01
4 Obj close time 4.92950e-05

VPICIO_MTS New time:

Step Time Type Time Value (s)
0 Obj create time 3.66105e-02
0 Transfer create time 2.55625e-04
0 Transfer start time 8.81592e-01
0 Transfer wait time 7.21270e-02
0 Transfer close time 5.83703e-02
0 Obj close time 1.91511e-03
1 Obj create time 1.55751e-03
1 Transfer create time 1.11616e-04
1 Transfer start time 9.94855e-01
1 Transfer wait time 4.39608e-02
1 Transfer close time 1.08936e-01
1 Obj close time 8.30160e-04
2 Obj create time 1.20452e-03
2 Transfer create time 1.42807e-04
2 Transfer start time 1.02291e+00
2 Transfer wait time 1.04457e-01
2 Transfer close time 9.65955e-02
2 Obj close time 2.97206e-04
3 Obj create time 2.77333e-03
3 Transfer create time 5.50470e-05
3 Transfer start time 1.01106e+00
3 Transfer wait time 5.92011e-02
3 Transfer close time 9.21980e-02
3 Obj close time 3.01645e-04
4 Obj create time 1.82119e-03
4 Transfer create time 1.10264e-04
4 Transfer start time 1.07994e+00
4 Transfer wait time 5.38074e+00
4 Transfer close time 1.15311e-01
4 Obj close time 1.50832e-04

The following results were obtained using the time command.

VPICIO/BDCATS Develop time (old):

VPICIO

real    0m15.183s
user    0m0.054s
sys     0m0.122s

BDCATS

real    0m25.232s
user    0m0.065s
sys     0m0.122s

VPICIO/BDCATS New time:

VPICIO

real    0m15.354s
user    0m0.069s
sys     0m0.113s

BDCATS

real    0m25.279s
user    0m0.089s
sys     0m0.103s

Corresponding issue: #266

@TheAssembler1 TheAssembler1 requested a review from a team as a code owner May 20, 2025 13:49
@TheAssembler1 TheAssembler1 requested review from jeanbez and houjun May 20, 2025 13:50
@TheAssembler1 TheAssembler1 self-assigned this May 20, 2025
@TheAssembler1 TheAssembler1 added the type: bug Something isn't working label May 20, 2025
@jeanbez jeanbez changed the title Propogate HGfinalize error on PDCclose Draft: Propogate HGfinalize error on PDCclose May 20, 2025
@TheAssembler1 TheAssembler1 changed the title Draft: Propogate HGfinalize error on PDCclose Draft: Propogate HG_Finalize error on PDCclose May 20, 2025
@TheAssembler1 TheAssembler1 changed the title Draft: Propogate HG_Finalize error on PDCclose Draft: Client Propogate HG_Finalize error on PDCclose Jul 5, 2025
@TheAssembler1 TheAssembler1 changed the title Draft: Client Propogate HG_Finalize error on PDCclose Client Propogate HG_Finalize error on PDCclose Jul 5, 2025
@jeanbez
Copy link
Member

jeanbez commented Jul 15, 2025

@houjun could you please approve this one as well?

@jeanbez jeanbez merged commit 7c118b2 into hpc-io:develop Jul 15, 2025
8 checks passed
jeanbez added a commit that referenced this pull request Jul 21, 2025
* Add pdc_logger.h to installation (#245)

* sync with gitlab (#248)

* Fix restart issue (#228)

* Fix cache flush (#226)

* Fix a thread race issue that may cause memory error when larger than cache max size data is transferred

* Add a test that writes more data than server cache size

* Fix CI run command

* Fix restart issue

* Update nersc.yml (#238)

* Since PDCinit returns a uint64_t, 0 should indicate failure (#233)

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Check the return value of `PDC_Client_init` in `PDC_init` (#230)

* Check that return value of PDC_Client_init in PDC_init

* Change return to 0

This will make is simpler when merging #233 (comment)

---------

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Change `printf` to PDC logger (#232)

* Changed all printf to use pdc logger

Also removed large blocks of comments and chanegd the pdc logger
to print the file name, function, and line number.

* Change typo of LOG_INFO to LOG_ERROR

* Correct grammar from fail -> failed

* update grammer succesfully close -> successfully closed

* switch type of LOG_INFO to LOG_ERROR

* Add logging docs and fix some LOG_INFO->LOG_JUST_PRINT

* update clang formatting

---------

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Malloc correct size for pdc_obj_metadata_pkg (#237)

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* PDCregion_transfer_create validate client buf, local region, and remote regions (#236)

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

---------

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>
Co-authored-by: Noah Lewis <47840925+TheAssembler1@users.noreply.github.com>

* Fix return metadata dtype (#246)

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Region info transfer struct type and helper functions (#247)

* Fix cache flush (#226)

* Fix a thread race issue that may cause memory error when larger than cache max size data is transferred

* Add a test that writes more data than server cache size

* Fix CI run command

* checkpoint

* Switch variables such as count_0, start_0, and size0... to arrays

This will reduce code duplication, reduce bugs, and make it easier
to switch to support n-dimnesional data.

* clang format

* checkpoint

* created better function names and documentation

* remove

* Committing clang-format changes

* clang format

* remove file

* change for use helper function

* fix bug with incorrect helper function call

---------

Co-authored-by: Houjun Tang <htang4@lbl.gov>
Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Fix issues with PDC tools (#249)

* Fix issues with PDC tools

* Correct LOG_ERROR to LOG_INFO

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Fix printing in `PGOTO_ERROR` and `PGOTO_ERROR_VOID` (#250)

Print new line by default in `PGOTO_ERROR` and `PGOTO_ERROR_VOID`

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Group Tests Into Folders (#252)

* Fix cache flush (#226)

* Fix a thread race issue that may cause memory error when larger than cache max size data is transferred

* Add a test that writes more data than server cache size

* Fix CI run command

* Grouped commons tests into folders

This commit also changes the src/tests/CmakeLists.txt to build tests
within their new folders

* add deprecated folder remove buf_map folder

* Update run_multiple_mpi_test.sh

* Update dependencies-macos.sh

* Update dependencies-macos.sh

---------

Co-authored-by: Houjun Tang <htang4@lbl.gov>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>
Co-authored-by: Jean Luca Bez <jeanlucabez@gmail.com>

* Return the same obj_id if the obj is just created or already opened (#254)

* Return the same obj_id if the obj is just created or already opened

* Committing clang-format changes

* Update doc

* Update dependencies-macos.sh

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* add option to choose interface (#255)

* add option to connect to a given network interface
* Committing clang-format changes
* fix conflict
* include header
* enable output on failure

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>

* Fix multithreading compilation (#259)

* fix multhreading compilation

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Fix segmentation fault of calling `PDCobj_create_mpi` twice with duplicate object name (#262)

* Validate sucess of PDC_obj_create and PDC_find_id in PDCobj_create_mpi

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Use `PDC_malloc`, `PDC_free`, `PDC_calloc`, and `PDC_realloc` (#260)

* checkpoint

* replace free with PDC_free and calloc with PDC_calloc

* Committing clang-format changes

* fix more mallocs to PDC_malloc

* more PDC_free fixes

* Committing clang-format changes

* Update ubuntu-cache.yml

* remove eno1

* fix realloc

* Committing clang-format changes

* Update ubuntu-no-cache.yaml

* Fix several bugs with error checking with object dim allocation

* Committing clang-format changes

* fix bug

* Committing clang-format changes

* Update ubuntu-no-cache.yaml

* Update ubuntu-cache.yml

* Set default value of ndim to 1 in PDCprop_create when using PDC_OBJ_CREATE

* Committing clang-format changes

* Malloc when defaulting to ndim size 1.
Only free hostname when we PDC_malloc the memory
because pointers returned by getenv are not malloced
and could point to static memory.

* Committing clang-format changes

* Update README.md

minor change to trigger the pipeline

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>
Co-authored-by: Jean Luca Bez <jeanlucabez@gmail.com>

* Fix Sphinx documentation errors and warnings (#265)

* Fix all sphinx warnings and errors. Removed repeat declarations of functions.

* Committing clang-format changes

* remove def of EXTENSION_MAPPING

* gitignore for docs and fix c structs

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Replace `docs/README.md` -> steps to build docs (#268)

* Replace docs/README.md -> steps to build docs

* Update README.md

---------

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Use `FUNC_ENTER` and `FUNC_LEAVE` (#270)

* use func enter and func leave in all functions

* Committing clang-format changes

* fix infinite recursion between memory managment, hash table, and per function timing

* Committing clang-format changes

* add profiling to CI

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* New test macros and code cleanup (#261)

* checkpoint

* Committing clang-format changes

* some tests

* Committing clang-format changes

* checkpoint

* open_obj uses new test macros

* Committing clang-format changes

* read_obj uses TASSERT

* read_obj uses TASSERT

* Committing clang-format changes

* cont_del and cont_getid use test macros

* convert more tests to use macros

* convert more tests to macros

* Committing clang-format changes

* Committing clang-format changes

* clang format

* use test helper in cont_info and cont_add_del

* more tests use macros

* Committing clang-format changes

* use tests macros in more tests

* use PGOTO* macros instead of goto

* clang format

* more log fixes

* logging cleanup and more usage of test macros

* Committing clang-format changes

* clang format and fix CMakeLists for tests

* use tests macros in transfer overlap 2D/3D

* use TASSERT in more tests

* Committing clang-format changes

* use test asserts

* all tests on the CI use TASSERT

* fix printing and newlines in tests

* print time, file name, function name, and line number in debug print

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>

* Tests logging typo fix (#273)

* Fixed logging typos

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>

* Rename pdc_server.exe to pdc_server for consistency (#275)

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Update vpicio_mts.c (#276)

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Client Propogate `HG_Finalize` error on `PDCclose` (#263)

* all but 4 close errors are fixed

* Committing clang-format changes

* client side HG_Finalize now passes on serial tests

* Committing clang-format changes

* cleanup

* Committing clang-format changes

* Update pdc_region_transfer.c

* free bulk handles during region transfer close

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>

* Standardize ID Lookup Null Checks and Error Handling (#281)

* cleanup finding id's

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>

* Obj open fix (#279)

* Fix seg fault for PDCobj_open on non-existent object

* Committing clang-format changes

* Remove log from NULL check

* Log message when object metadata isn't found.

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Fix multithread (#274)

* move hash table mutex to hashtable source filse

* Committing clang-format changes

* add multithread compile test

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Fix seg fault when mercury initialization fails (#283)

* check for NULL paramterse in hash table

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

---------

Co-authored-by: Noah Lewis <47840925+TheAssembler1@users.noreply.github.com>
Co-authored-by: Houjun Tang <htang4@lbl.gov>
Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants