-
Notifications
You must be signed in to change notification settings - Fork 32
Modified BertClassifier class, new BertEmbeddingGenerator class #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I looked into it, and running the tests: pytest tests/
on CPU returns the error:
___________________________ ERROR collecting tests/test_tokenize_with_pooling.py ____________________________
tests/test_tokenize_with_pooling.py:15: in <module>
MODEL = BertClassifierWithPooling(**MODEL_PARAMS, device="cpu")
belt_nlp/bert_with_pooling.py:47: in __init__
super().__init__(
belt_nlp/bert.py:66: in __init__
self.neural_network.to(device)
.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1137: in to
raise TypeError('nn.Module.to only accepts floating point or complex '
E TypeError: nn.Module.to only accepts floating point or complex dtypes, but got desired dtype=torch.bool
It works on the main branch. Can you check it out?
@@ -0,0 +1,68 @@ | |||
from belt_nlp.bert_embedder_pooling import BertEmbeddingGenerator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests should be placed in the directory ./tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sorry, just added it in a hurry. Try to run the tests again now, forgot to add the trust_remote to the other classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Good job adding the class for generating embeddings.
- I checked that the added test works as expected.
- The code needs some serious refactor to make it compatible with our project - I added some hints and suggestions in the comments.
- As a consequence, while the code works, it might be difficult to modify or improve it in the current form.
- It would be great, If you could rewrite it according to the composition over inheritance principle.
chunk_size=500, | ||
stride=125, | ||
minimal_chunk_length = 200, | ||
pretrained_model_name_or_path='ltg/norbert3-base', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This confused me... It seems that Norbert is the model for Norwegian language, however you are using it with English examples...
I suggest using default model bert-base-uncased
for test purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the model was set to nortbert to test the trust_remote_code parameter but for future test purposes the bert-base is more suitable. I'll take a look at your other comments and try to improve the implementation when I have time.
minimal_chunk_length = 200, | ||
pretrained_model_name_or_path='ltg/norbert3-base', | ||
trust_remote_code=True, | ||
device="cuda", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest running all tests on cpu.
# Dummy test texts based on the structure I expect to process text in my project | ||
applications = { | ||
"app_1": [ | ||
"The first document of the first application delves deeply into the importance of artificial intelligence (AI) in modern software development, exploring various facets such as algorithm optimization, data analysis, and automated decision-making processes. It emphasizes how AI technologies have become indispensable in creating more efficient, intelligent, and user-friendly applications. The document further discusses the integration of AI in developing software solutions, highlighting case studies where AI-driven applications have significantly improved performance and user engagement. Additionally, it touches upon the ethical considerations and potential biases in AI, advocating for responsible development practices that ensure fairness and transparency. The discussion extends to the future of software development in the AI era, speculating on emerging trends, potential challenges, and the evolving role of developers in an increasingly automated industry.The first document of the first application delves deeply into the importance of artificial intelligence (AI) in modern software development, exploring various facets such as algorithm optimization, data analysis, and automated decision-making processes. It emphasizes how AI technologies have become indispensable in creating more efficient, intelligent, and user-friendly applications. The document further discusses the integration of AI in developing software solutions, highlighting case studies where AI-driven applications have significantly improved performance and user engagement. Additionally, it touches upon the ethical considerations and potential biases in AI, advocating for responsible development practices that ensure fairness and transparency. The discussion extends to the future of software development in the AI era, speculating on emerging trends, potential challenges, and the evolving role of developers in an increasingly automated industry.The first document of the first application delves deeply into the importance of artificial intelligence (AI) in modern software development, exploring various facets such as algorithm optimization, data analysis, and automated decision-making processes. It emphasizes how AI technologies have become indispensable in creating more efficient, intelligent, and user-friendly applications. The document further discusses the integration of AI in developing software solutions, highlighting case studies where AI-driven applications have significantly improved performance and user engagement. Additionally, it touches upon the ethical considerations and potential biases in AI, advocating for responsible development practices that ensure fairness and transparency. The discussion extends to the future of software development in the AI era, speculating on emerging trends, potential challenges, and the evolving role of developers in an increasingly automated industry. The first document of the first application delves deeply into the importance of artificial intelligence (AI) in modern software development, exploring various facets such as algorithm optimization, data analysis, and automated decision-making processes. It emphasizes how AI technologies have become indispensable in creating more efficient, intelligent, and user-friendly applications. The document further discusses the integration of AI in developing software solutions, highlighting case studies where AI-driven applications have significantly improved performance and user engagement. Additionally, it touches upon the ethical considerations and potential biases in AI, advocating for responsible development practices that ensure fairness and transparency. The discussion extends to the future of software development in the AI era, speculating on emerging trends, potential challenges, and the evolving role of developers in an increasingly automated industry.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using public datasets as examples - see the IMDB review data we used in notebooks. It would also be nice to use the same examples in all tests/notebooks.
from belt_nlp.splitting import transform_list_of_texts | ||
|
||
|
||
class BertEmbeddingGenerator(BertClassifier): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This inheritance from BertClassifier is confusing - this is not the classifier. I suggest refactoring it using composition over inheritance principle.
} | ||
|
||
belt_embeds = embed_doc(applications, belt_model) | ||
print_embeddings_length(belt_embeds) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test would work better as a notebook.
We use the following distinction:
- In tests we put automatic tests to be run and see if there are any errors. There are asserts but no prints.
- In notebooks we put instructive code to show how it works step by step. There are prints but no asserts.
This file looks to me like the second case.
if self.pooling_strategy == "mean": | ||
pooled_output = torch.mean(sequence_output, dim=0).detach().cpu() | ||
elif self.pooling_strategy == "max": | ||
pooled_output = torch.max(sequence_output, dim=0).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about this part:
- For the classifier we aggregate probabilities over chunks using mean or max function - this makes sense heuristically. For example, when we build the classifier for finding hate speech - it is sufficient that the hate speech appears in the single chunk. Hence the max aggregation.
- I am not sure if aggregating embeddings is that simple - mean aggregation still makes sense. However I do not know how to interpret the max aggregation... Do you have some heuristic/theoretical or experimental rationale in mind?
Added a script with text for testing. The new class can probably be avoided and stuff can probably be more efficient. I just threw together this fast for my own testing. From what I can see it seems to do what I want but I'm not the most experienced with modifying stuff like this. Would appreciate some feedback if it produces embeddings for texts over the 512 token limit, from u/Hellbink on reddit. Goal is to use the embeddings for similarities later.