Why can Zero Shot be achieved?

Hi, I'm also part of the research on Zero Shot Tempoarl Localization Action, and I found that if I use Transformer to model CLIP video frame features, it leads to **high mAP in the training set and low mAP in the test set**. My guess is that the video frame information from CLIP, after Transformer leads to difficulty in matching with text information. What is the core of solving this problem?

Can you help me? 😭

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why can Zero Shot be achieved? #26

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Why can Zero Shot be achieved? #26

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions