Skip to content

[BUG] Sentence context greater than 512 character #64

@xei

Description

@xei

I tried to correct spelling mistakes in a large text.

import spacy
import contextualSpellCheck

spacy_nlp = spacy.load(
    'en_core_web_sm',
    # disable=['ner']
    disable=['parser', 'ner'] # disable extra componens for efficiency
)
contextualSpellCheck.add_to_pipe(spacy_nlp)

corpus_spacy = [spacy_nlp(doc) for doc in corpus_raw]

At first, I faced this error:
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe('sentencizer'). Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting doc[i].is_sent_start.

So, I added the sentencizer component to the pipeline.

import spacy
import contextualSpellCheck

spacy_nlp = spacy.load(
    'en_core_web_sm',
    # disable=['ner']
    disable=['parser', 'ner'] # disable extra componens for efficiency
)
spacy_nlp.add_pipe('sentencizer')
contextualSpellCheck.add_to_pipe(spacy_nlp)

corpus_spacy = [spacy_nlp(doc) for doc in corpus_raw]

This time I faced this error:
RuntimeError: The expanded size of the tensor (837) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 837]. Tensor sizes: [1, 512]

I guess this is due to the limitations of BERT. However, I believe that there should be a way to catch this error and bypass the spell check.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions