-
Notifications
You must be signed in to change notification settings - Fork 79
Description
Dear experts,
Thanks for building the model for Chinese.
When I tried to use your model to calculate semantic similarity (see bellowed)
import torch
from transformers import (BertTokenizerFast,AutoModel,)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
albert_model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese')
def encode_text(text):
text_code = tokenizer(text,padding=True,truncation=True,return_tensors='pt')
input_ids = text_code['input_ids']
attention_mask = text_code['attention_mask']
token_type_ids = text_code['token_type_ids']
print('input_ids',input_ids)
print('attention_mask',attention_mask)
print('token_type_ids',token_type_ids)
with torch.no_grad():
output = albert_model(input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
embed = output.pooler_output
return embed
cs1=cosine_similarity(encode_text('蘋果'),encode_text('鳳梨'))
cs2=cosine_similarity(encode_text('蘋果'),encode_text('塑膠'))
I expected to see cs1>cs2 but it is not the case. I wonder how do you interpret the results which higher similarity occurs between unrelated words? I wonder what can I do to improve the results of semantic related from your model Thanks!
Sincerely,
Veda