Modified BertClassifier class, new BertEmbeddingGenerator class #27

DevBerge · 2024-02-04T13:23:39Z

Added a script with text for testing. The new class can probably be avoided and stuff can probably be more efficient. I just threw together this fast for my own testing. From what I can see it seems to do what I want but I'm not the most experienced with modifying stuff like this. Would appreciate some feedback if it produces embeddings for texts over the 512 token limit, from u/Hellbink on reddit. Goal is to use the embeddings for similarities later.

…pt for test

MichalBrzozowski91

Hi, I looked into it, and running the tests: pytest tests/ on CPU returns the error:

___________________________ ERROR collecting tests/test_tokenize_with_pooling.py ____________________________
tests/test_tokenize_with_pooling.py:15: in <module>
    MODEL = BertClassifierWithPooling(**MODEL_PARAMS, device="cpu")
belt_nlp/bert_with_pooling.py:47: in __init__
    super().__init__(
belt_nlp/bert.py:66: in __init__
    self.neural_network.to(device)
.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1137: in to
    raise TypeError('nn.Module.to only accepts floating point or complex '
E   TypeError: nn.Module.to only accepts floating point or complex dtypes, but got desired dtype=torch.bool

It works on the main branch. Can you check it out?

MichalBrzozowski91 · 2024-02-23T13:20:14Z

belt_nlp/test_bert_embedder_pooling.py

@@ -0,0 +1,68 @@
+from belt_nlp.bert_embedder_pooling import BertEmbeddingGenerator


Tests should be placed in the directory ./tests

Yeah sorry, just added it in a hurry. Try to run the tests again now, forgot to add the trust_remote to the other classes.

MichalBrzozowski91

Good job adding the class for generating embeddings.
I checked that the added test works as expected.
The code needs some serious refactor to make it compatible with our project - I added some hints and suggestions in the comments.
As a consequence, while the code works, it might be difficult to modify or improve it in the current form.
It would be great, If you could rewrite it according to the composition over inheritance principle.

MichalBrzozowski91 · 2024-02-25T11:35:21Z

belt_nlp/test_bert_embedder_pooling.py

+        chunk_size=500,
+        stride=125,
+        minimal_chunk_length = 200,
+        pretrained_model_name_or_path='ltg/norbert3-base',


This confused me... It seems that Norbert is the model for Norwegian language, however you are using it with English examples...
I suggest using default model bert-base-uncased for test purposes.

I agree, the model was set to nortbert to test the trust_remote_code parameter but for future test purposes the bert-base is more suitable. I'll take a look at your other comments and try to improve the implementation when I have time.

MichalBrzozowski91 · 2024-02-25T11:36:17Z

belt_nlp/test_bert_embedder_pooling.py

+        minimal_chunk_length = 200,
+        pretrained_model_name_or_path='ltg/norbert3-base',
+        trust_remote_code=True,
+        device="cuda",


I suggest running all tests on cpu.

MichalBrzozowski91 · 2024-02-25T11:40:40Z

belt_nlp/test_bert_embedder_pooling.py

+    # Dummy test texts based on the structure I expect to process text in my project
+    applications = {
+        "app_1": [
+            "The first document of the first application delves deeply into the importance of artificial intelligence (AI) in modern software development, exploring various facets such as algorithm optimization, data analysis, and automated decision-making processes. It emphasizes how AI technologies have become indispensable in creating more efficient, intelligent, and user-friendly applications. The document further discusses the integration of AI in developing software solutions, highlighting case studies where AI-driven applications have significantly improved performance and user engagement. Additionally, it touches upon the ethical considerations and potential biases in AI, advocating for responsible development practices that ensure fairness and transparency. The discussion extends to the future of software development in the AI era, speculating on emerging trends, potential challenges, and the evolving role of developers in an increasingly automated industry.The first document of the first application delves deeply into the importance of artificial intelligence (AI) in modern software development, exploring various facets such as algorithm optimization, data analysis, and automated decision-making processes. It emphasizes how AI technologies have become indispensable in creating more efficient, intelligent, and user-friendly applications. The document further discusses the integration of AI in developing software solutions, highlighting case studies where AI-driven applications have significantly improved performance and user engagement. Additionally, it touches upon the ethical considerations and potential biases in AI, advocating for responsible development practices that ensure fairness and transparency. The discussion extends to the future of software development in the AI era, speculating on emerging trends, potential challenges, and the evolving role of developers in an increasingly automated industry.The first document of the first application delves deeply into the importance of artificial intelligence (AI) in modern software development, exploring various facets such as algorithm optimization, data analysis, and automated decision-making processes. It emphasizes how AI technologies have become indispensable in creating more efficient, intelligent, and user-friendly applications. The document further discusses the integration of AI in developing software solutions, highlighting case studies where AI-driven applications have significantly improved performance and user engagement. Additionally, it touches upon the ethical considerations and potential biases in AI, advocating for responsible development practices that ensure fairness and transparency. The discussion extends to the future of software development in the AI era, speculating on emerging trends, potential challenges, and the evolving role of developers in an increasingly automated industry. The first document of the first application delves deeply into the importance of artificial intelligence (AI) in modern software development, exploring various facets such as algorithm optimization, data analysis, and automated decision-making processes. It emphasizes how AI technologies have become indispensable in creating more efficient, intelligent, and user-friendly applications. The document further discusses the integration of AI in developing software solutions, highlighting case studies where AI-driven applications have significantly improved performance and user engagement. Additionally, it touches upon the ethical considerations and potential biases in AI, advocating for responsible development practices that ensure fairness and transparency. The discussion extends to the future of software development in the AI era, speculating on emerging trends, potential challenges, and the evolving role of developers in an increasingly automated industry.",


I suggest using public datasets as examples - see the IMDB review data we used in notebooks. It would also be nice to use the same examples in all tests/notebooks.

MichalBrzozowski91 · 2024-02-25T12:37:09Z

belt_nlp/bert_embedder_pooling.py

+from belt_nlp.splitting import transform_list_of_texts
+
+
+class BertEmbeddingGenerator(BertClassifier):


This inheritance from BertClassifier is confusing - this is not the classifier. I suggest refactoring it using composition over inheritance principle.

MichalBrzozowski91 · 2024-02-25T12:42:01Z

belt_nlp/test_bert_embedder_pooling.py

+    }
+
+    belt_embeds = embed_doc(applications, belt_model)
+    print_embeddings_length(belt_embeds)


This test would work better as a notebook.
We use the following distinction:

In tests we put automatic tests to be run and see if there are any errors. There are asserts but no prints.

In notebooks we put instructive code to show how it works step by step. There are prints but no asserts.
This file looks to me like the second case.

MichalBrzozowski91 · 2024-02-25T12:50:19Z

belt_nlp/bert_embedder_pooling.py

+        if self.pooling_strategy == "mean":
+            pooled_output = torch.mean(sequence_output, dim=0).detach().cpu()
+        elif self.pooling_strategy == "max":
+            pooled_output = torch.max(sequence_output, dim=0).values


I am curious about this part:

For the classifier we aggregate probabilities over chunks using mean or max function - this makes sense heuristically. For example, when we build the classifier for finding hate speech - it is sufficient that the hate speech appears in the single chunk. Hence the max aggregation.

I am not sure if aggregating embeddings is that simple - mean aggregation still makes sense. However I do not know how to interpret the max aggregation... Do you have some heuristic/theoretical or experimental rationale in mind?

Modified BertClassifier class, new BertEmbeddingGenerator class, scri…

465356a

…pt for test

MichalBrzozowski91 self-assigned this Feb 23, 2024

MichalBrzozowski91 reviewed Feb 23, 2024

View reviewed changes

added remote for classifier and truncated

a8502b3

MichalBrzozowski91 reviewed Feb 25, 2024

View reviewed changes

mwachnicki assigned DevBerge and unassigned MichalBrzozowski91 Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modified BertClassifier class, new BertEmbeddingGenerator class #27

Modified BertClassifier class, new BertEmbeddingGenerator class #27

Uh oh!

DevBerge commented Feb 4, 2024

Uh oh!

MichalBrzozowski91 left a comment

Uh oh!

MichalBrzozowski91 Feb 23, 2024

Uh oh!

DevBerge Feb 23, 2024

Uh oh!

MichalBrzozowski91 left a comment

Uh oh!

MichalBrzozowski91 Feb 25, 2024

Uh oh!

DevBerge Feb 26, 2024

Uh oh!

MichalBrzozowski91 Feb 25, 2024

Uh oh!

MichalBrzozowski91 Feb 25, 2024

Uh oh!

MichalBrzozowski91 Feb 25, 2024

Uh oh!

MichalBrzozowski91 Feb 25, 2024

Uh oh!

MichalBrzozowski91 Feb 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,68 @@
		from belt_nlp.bert_embedder_pooling import BertEmbeddingGenerator

		from belt_nlp.splitting import transform_list_of_texts


		class BertEmbeddingGenerator(BertClassifier):

Modified BertClassifier class, new BertEmbeddingGenerator class #27

Are you sure you want to change the base?

Modified BertClassifier class, new BertEmbeddingGenerator class #27

Uh oh!

Conversation

DevBerge commented Feb 4, 2024

Uh oh!

MichalBrzozowski91 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichalBrzozowski91 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants