unexpected similarity based on embedding  

Dear experts,

Thanks for building the model for Chinese. 
When I tried to use your model to calculate semantic similarity (see bellowed) 
import torch
from transformers import (BertTokenizerFast,AutoModel,)

tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
albert_model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese')

def encode_text(text):
    text_code = tokenizer(text,padding=True,truncation=True,return_tensors='pt')
    input_ids = text_code['input_ids']
    attention_mask = text_code['attention_mask']
    token_type_ids = text_code['token_type_ids']
    print('input_ids',input_ids)
    print('attention_mask',attention_mask)
    print('token_type_ids',token_type_ids)
    with torch.no_grad():
        output = albert_model(input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
        embed = output.pooler_output
    return embed
cs1=cosine_similarity(encode_text('蘋果'),encode_text('鳳梨'))
cs2=cosine_similarity(encode_text('蘋果'),encode_text('塑膠'))

I expected to see cs1>cs2 but it is not the case. I wonder how do you interpret the results which higher similarity occurs between unrelated words? I wonder what can I do to improve the results of semantic related from  your model  Thanks!

Sincerely,
Veda










Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

unexpected similarity based on embedding #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

unexpected similarity based on embedding #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions