当前位置：首页 > article >正文

CLIP代码相关问题

article 2025/2/28 15:31:51

首先需要将text转为token，用到CLIPTokenizer，接着是从token得到embedding。可以用CLIPTextModelWithProjection或CLIPTextModel。

CLIPTextModelWithProjection和CLIPTextModel的区别：

CLIPTextModel输出的是pooler_output，CLIPTextModelWithProjection的输出是text_embeds。
text_embeds是和image_embeds在同一个space下的，所以如果要和image做相似度比较，需要用到CLIPTextModelWithProjection。
但假如说只需要一个text的编码信息，那么用两个都可以(用CLIPTextModel会省一点显存）。
CLIPTextModelWithProjection = CLIPTextModel + 一层Linear
参考：
https://github.com/huggingface/transformers/issues/21465#issuecomment-1419080756

CLIPTextTransform = CLIPTokenizer + CLIPTextModel
CLIPTextWithProjectionTransform = CLIPTokenizer + CLIPTextModelWithProjection

查看全文

http://www.kler.cn/a/488239.html