-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Hi, I'm also part of the research on Zero Shot Tempoarl Localization Action, and I found that if I use Transformer to model CLIP video frame features, it leads to high mAP in the training set and low mAP in the test set. My guess is that the video frame information from CLIP, after Transformer leads to difficulty in matching with text information. What is the core of solving this problem?
Can you help me? 😭
Metadata
Metadata
Assignees
Labels
No labels