Skip to content

unexpected similarity based on embedding  #34

@VedaHung

Description

@VedaHung

Dear experts,

Thanks for building the model for Chinese.
When I tried to use your model to calculate semantic similarity (see bellowed)
import torch
from transformers import (BertTokenizerFast,AutoModel,)

tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
albert_model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese')

def encode_text(text):
text_code = tokenizer(text,padding=True,truncation=True,return_tensors='pt')
input_ids = text_code['input_ids']
attention_mask = text_code['attention_mask']
token_type_ids = text_code['token_type_ids']
print('input_ids',input_ids)
print('attention_mask',attention_mask)
print('token_type_ids',token_type_ids)
with torch.no_grad():
output = albert_model(input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
embed = output.pooler_output
return embed
cs1=cosine_similarity(encode_text('蘋果'),encode_text('鳳梨'))
cs2=cosine_similarity(encode_text('蘋果'),encode_text('塑膠'))

I expected to see cs1>cs2 but it is not the case. I wonder how do you interpret the results which higher similarity occurs between unrelated words? I wonder what can I do to improve the results of semantic related from your model Thanks!

Sincerely,
Veda

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions