`profiler + cudagraph` hangs forever for rope on main branch. repro: `python3 run.py --op rope --metrics=latency --latency-measure-mode=profiler --cudagraph`