Thank you for your great work! In the paper, I see that using sparsity can reduce the memory movement in the shift operation, but in the code, the shift operation, \ie, ssl_cuda_kernel will always copy or move all the channels. The sparsity will thus not reduce the memory cost of the shift operation. So I wonder if the shift operation implementation in inference mode should be different from the training mode. If that's so, would you mind sharing the ssl_cuda_kernel implementation for the inference mode? Thanks a lot!