吳紹宏
In response to the proliferation of multimedia content daily, video streaming platforms can use models to predict video tags, thereby providing more accurate content recommendations to users. However, most existing video tag prediction models focus on single-label prediction, which is insufficient to represent the com- plete information of a video. This paper regards multi-label prediction as a set of single-label prediction tasks and utilizes Vision-Language Models (VLMs), which are trained on large-scale image-text datasets, as the backbone architecture. By keeping the text encoder and image encoder frozen, we introduce a prompt learning method to prevent the model from overfitting and improve its ability to correctly predict unseen categories. Two sets of prompts are added as learnable parameters, each representing the features of positive and negative text samples, respectively. For the visual component, multiple frames from a video are encoded by the image encoder to extract features from each frame. These features are then fused using a Temporal Modeling Module to extract the overall video features. The cosine simi- larity between the positive and negative features and the video features is computed to derive the positive and negative logits, which are then processed through a soft- max function to obtain the probability of the positive class. Finally, an asymmetric loss function is applied to calculate the loss value. Additionally, we use video features as text tokens to further enhance prompt learning effectiveness.