I have implemented a module using OpenMP and CUDA that runs faster while maintaining the memory efficiency of your CuPy implementation.
shikishima-TasakiLab/Involution-PyTorch
It also supports TorchScript and 16-bit float.
shikishima-TasakiLab/Involution-PyTorch#1