Fast and generic implementation using OpenMP and CUDA

I have implemented a module using OpenMP and CUDA that runs faster while maintaining the memory efficiency of your CuPy implementation.

[shikishima-TasakiLab/Involution-PyTorch](https://github.com/shikishima-TasakiLab/Involution-PyTorch)

It also supports TorchScript and 16-bit float.

shikishima-TasakiLab/Involution-PyTorch#1