Some program, for example, llama.cpp, will load a fatbin containing 100+ PTXs, we don't have support for such situation now and defaults to fatbin containing only one PTX. Support such situation.
What to do:
- Use
cuobjdump --extract-ptx to extract all PTX files
- Patch each of PTX files in the same way as we patch single PTX files
- Use
fatbinary to compose patched PTX files to a fatbin