Skip to content

Rare crash triggered by an invalid ignore pattern: UnboundLocalError: local variable 'compiledPattern' referenced before assignment #596

@JustAnotherArchivist

Description

@JustAnotherArchivist

Job 2mt13kxolzln2i6awfxyprnud crashed with this traceback:

Pattern ^https?://www\.pinterest.\com/.*\.js$ is invalid (error: bad escape \c at position 25).  Ignored.
ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/application/app.py", line 157, in run
    yield from pipeline.process()
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/pipeline/pipeline.py", line 194, in process
    yield from self._process_one_worker()
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
    task.result()
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/pipeline/pipeline.py", line 119, in process
    item = yield from self.process_one(_worker_id=worker_id)
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/pipeline/pipeline.py", line 103, in process_one
    yield from task.process(item)
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/application/tasks/download.py", line 492, in process
    yield from session.app_session.factory['Processor'].process(session)
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/delegate.py", line 29, in process
    return (yield from processor.process(item_session))
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/web.py", line 92, in process
    return (yield from session.process())
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/web.py", line 174, in process
    ok = yield from self._process_robots()
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/web.py", line 201, in _process_robots
    request))
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/web.py", line 367, in _should_fetch_reason_with_robots
    self._fetch_rule.check_initial_web_request(self._item_session, request)
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/rule.py", line 179, in check_initial_web_request
    item_session, verdict, reason, test_info
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/rule.py", line 130, in consult_hook
    PluginFunctions.accept_url, item_session, verdict, reasons,
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/application/hook.py", line 81, in call
    return self._callbacks[name](*args, **kwargs)
  File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/application/plugin.py", line 49, in wrapper
    return func(*args, **kwargs)
  File "archive_bot_plugin.py", line 227, in accept_url
    pattern = self.settings.ignore_url(item_session.url_record)
  File "/home/archivebot/ArchiveBot-c/pipeline/archivebot/wpull/settings.py", line 50, in ignore_url
    return self.ignoracle.ignores(record_info)
  File "/home/archivebot/ArchiveBot-c/pipeline/archivebot/wpull/ignoracle.py", line 110, in ignores
    self._compiled.append((pattern, compiledPattern))
UnboundLocalError: local variable 'compiledPattern' referenced before assignment

This crash will only happen when the invalid ignore pattern appears first in the pattern set iterator. Otherwise, the previous ignore pattern will be duplicated (which causes no harm apart from a very minor performance impact).

The fix is that the exception handler needs to continue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions