Skip to content

Conversation

DevBerge
Copy link

@DevBerge DevBerge commented Feb 4, 2024

Added a script with text for testing. The new class can probably be avoided and stuff can probably be more efficient. I just threw together this fast for my own testing. From what I can see it seems to do what I want but I'm not the most experienced with modifying stuff like this. Would appreciate some feedback if it produces embeddings for texts over the 512 token limit, from u/Hellbink on reddit. Goal is to use the embeddings for similarities later.

@MichalBrzozowski91 MichalBrzozowski91 self-assigned this Feb 23, 2024
Copy link
Contributor

@MichalBrzozowski91 MichalBrzozowski91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I looked into it, and running the tests: pytest tests/ on CPU returns the error:

___________________________ ERROR collecting tests/test_tokenize_with_pooling.py ____________________________
tests/test_tokenize_with_pooling.py:15: in <module>
    MODEL = BertClassifierWithPooling(**MODEL_PARAMS, device="cpu")
belt_nlp/bert_with_pooling.py:47: in __init__
    super().__init__(
belt_nlp/bert.py:66: in __init__
    self.neural_network.to(device)
.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1137: in to
    raise TypeError('nn.Module.to only accepts floating point or complex '
E   TypeError: nn.Module.to only accepts floating point or complex dtypes, but got desired dtype=torch.bool

It works on the main branch. Can you check it out?

@@ -0,0 +1,68 @@
from belt_nlp.bert_embedder_pooling import BertEmbeddingGenerator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should be placed in the directory ./tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry, just added it in a hurry. Try to run the tests again now, forgot to add the trust_remote to the other classes.

Copy link
Contributor

@MichalBrzozowski91 MichalBrzozowski91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Good job adding the class for generating embeddings.
  • I checked that the added test works as expected.
  • The code needs some serious refactor to make it compatible with our project - I added some hints and suggestions in the comments.
  • As a consequence, while the code works, it might be difficult to modify or improve it in the current form.
  • It would be great, If you could rewrite it according to the composition over inheritance principle.

chunk_size=500,
stride=125,
minimal_chunk_length = 200,
pretrained_model_name_or_path='ltg/norbert3-base',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confused me... It seems that Norbert is the model for Norwegian language, however you are using it with English examples...
I suggest using default model bert-base-uncased for test purposes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the model was set to nortbert to test the trust_remote_code parameter but for future test purposes the bert-base is more suitable. I'll take a look at your other comments and try to improve the implementation when I have time.

minimal_chunk_length = 200,
pretrained_model_name_or_path='ltg/norbert3-base',
trust_remote_code=True,
device="cuda",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest running all tests on cpu.

# Dummy test texts based on the structure I expect to process text in my project
applications = {
"app_1": [
"The first document of the first application delves deeply into the importance of artificial intelligence (AI) in modern software development, exploring various facets such as algorithm optimization, data analysis, and automated decision-making processes. It emphasizes how AI technologies have become indispensable in creating more efficient, intelligent, and user-friendly applications. The document further discusses the integration of AI in developing software solutions, highlighting case studies where AI-driven applications have significantly improved performance and user engagement. Additionally, it touches upon the ethical considerations and potential biases in AI, advocating for responsible development practices that ensure fairness and transparency. The discussion extends to the future of software development in the AI era, speculating on emerging trends, potential challenges, and the evolving role of developers in an increasingly automated industry.The first document of the first application delves deeply into the importance of artificial intelligence (AI) in modern software development, exploring various facets such as algorithm optimization, data analysis, and automated decision-making processes. It emphasizes how AI technologies have become indispensable in creating more efficient, intelligent, and user-friendly applications. The document further discusses the integration of AI in developing software solutions, highlighting case studies where AI-driven applications have significantly improved performance and user engagement. Additionally, it touches upon the ethical considerations and potential biases in AI, advocating for responsible development practices that ensure fairness and transparency. The discussion extends to the future of software development in the AI era, speculating on emerging trends, potential challenges, and the evolving role of developers in an increasingly automated industry.The first document of the first application delves deeply into the importance of artificial intelligence (AI) in modern software development, exploring various facets such as algorithm optimization, data analysis, and automated decision-making processes. It emphasizes how AI technologies have become indispensable in creating more efficient, intelligent, and user-friendly applications. The document further discusses the integration of AI in developing software solutions, highlighting case studies where AI-driven applications have significantly improved performance and user engagement. Additionally, it touches upon the ethical considerations and potential biases in AI, advocating for responsible development practices that ensure fairness and transparency. The discussion extends to the future of software development in the AI era, speculating on emerging trends, potential challenges, and the evolving role of developers in an increasingly automated industry. The first document of the first application delves deeply into the importance of artificial intelligence (AI) in modern software development, exploring various facets such as algorithm optimization, data analysis, and automated decision-making processes. It emphasizes how AI technologies have become indispensable in creating more efficient, intelligent, and user-friendly applications. The document further discusses the integration of AI in developing software solutions, highlighting case studies where AI-driven applications have significantly improved performance and user engagement. Additionally, it touches upon the ethical considerations and potential biases in AI, advocating for responsible development practices that ensure fairness and transparency. The discussion extends to the future of software development in the AI era, speculating on emerging trends, potential challenges, and the evolving role of developers in an increasingly automated industry.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using public datasets as examples - see the IMDB review data we used in notebooks. It would also be nice to use the same examples in all tests/notebooks.

from belt_nlp.splitting import transform_list_of_texts


class BertEmbeddingGenerator(BertClassifier):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inheritance from BertClassifier is confusing - this is not the classifier. I suggest refactoring it using composition over inheritance principle.

}

belt_embeds = embed_doc(applications, belt_model)
print_embeddings_length(belt_embeds) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would work better as a notebook.
We use the following distinction:

  • In tests we put automatic tests to be run and see if there are any errors. There are asserts but no prints.
  • In notebooks we put instructive code to show how it works step by step. There are prints but no asserts.
    This file looks to me like the second case.

if self.pooling_strategy == "mean":
pooled_output = torch.mean(sequence_output, dim=0).detach().cpu()
elif self.pooling_strategy == "max":
pooled_output = torch.max(sequence_output, dim=0).values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about this part:

  • For the classifier we aggregate probabilities over chunks using mean or max function - this makes sense heuristically. For example, when we build the classifier for finding hate speech - it is sufficient that the hate speech appears in the single chunk. Hence the max aggregation.
  • I am not sure if aggregating embeddings is that simple - mean aggregation still makes sense. However I do not know how to interpret the max aggregation... Do you have some heuristic/theoretical or experimental rationale in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants