julia> B = BandedMatrix(0=>ones(1000));
julia> D = Diagonal(B);
julia> @btime $B * $B;
118.442 μs (2 allocations: 7.98 KiB)
julia> @btime $D * $D;
761.298 ns (1 allocation: 7.94 KiB)
julia> B * B == D * D
true
This is a significant difference, and it would be good to identify why this is slow.