In the _make_fuse_layers, the upsampling is done after the 1x1 convolution. However, in the paper the upsampling is done before.
If x > r, f_{xr}(R) upsamples the input representation R through the bilinear upsampling followed by a 1 × 1 convolution for aligning the number of channels.
Moreover, the paper is using bilinear upsampling while the implementation uses with mode='nearest'.
Is there any reasons for these two differences ?