[FEATURE] GPU: Support patching fatbin with multiple PTXs

Some program, for example, `llama.cpp`, will load a fatbin containing 100+ PTXs, we don't have support for such situation now and defaults to fatbin containing only one PTX. Support such situation.

What to do:
- Use `cuobjdump --extract-ptx` to extract all PTX files 
- Patch each of PTX files in the same way as we patch single PTX files
- Use `fatbinary` to compose patched PTX files to a fatbin