You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, authors!
I found a small problem with the implementation of the paper content.
The fusion method for the two Updated Agg Tokens for video and audio mentioned in the paper is linear aggregation. And the paper does not explain how the weight matrix of the two features is calculated.
However, in the code, the two features are only joined in the channel dimension after dimension expansion. The code is as follows.