-
Notifications
You must be signed in to change notification settings - Fork 64
Open
Labels
bugSomething isn't workingSomething isn't working
Description
I tried to correct spelling mistakes in a large text.
import spacy
import contextualSpellCheck
spacy_nlp = spacy.load(
'en_core_web_sm',
# disable=['ner']
disable=['parser', 'ner'] # disable extra componens for efficiency
)
contextualSpellCheck.add_to_pipe(spacy_nlp)
corpus_spacy = [spacy_nlp(doc) for doc in corpus_raw]
At first, I faced this error:
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe('sentencizer')
. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting doc[i].is_sent_start
.
So, I added the sentencizer
component to the pipeline.
import spacy
import contextualSpellCheck
spacy_nlp = spacy.load(
'en_core_web_sm',
# disable=['ner']
disable=['parser', 'ner'] # disable extra componens for efficiency
)
spacy_nlp.add_pipe('sentencizer')
contextualSpellCheck.add_to_pipe(spacy_nlp)
corpus_spacy = [spacy_nlp(doc) for doc in corpus_raw]
This time I faced this error:
RuntimeError: The expanded size of the tensor (837) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 837]. Tensor sizes: [1, 512]
I guess this is due to the limitations of BERT. However, I believe that there should be a way to catch this error and bypass the spell check.
hardianlawi
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working