Skip to content

Splash memory leak #312

@Ethan353

Description

@Ethan353

I have used scrapy splash for requesting in my crawling service. after amount of time my services usage of ram increase continuesly and after a while they use all ram of a vm. the wierd thing is splash service it self works properly but services which use splash for requests have memory leak. for more detail here is my code snippet and splash config i uses:
code:

if condition_to_use_splash:
    return SplashRequest(url, errback=self.errback, callback=self.parse, meta=metadata, args={'wait': 7})
else:
    return FormRequest(url, dont_filter=True, errback=self.errback,method=method, formdata=parameter, meta=metadata)

config:

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DOWNLOADER_MIDDLEWARES = {
    'solaris_scrapy.solaris_scrapy.middlewares.ProxyMiddleware': 100,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
    'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

I use splash 3.1 as splash image and it is my splash service docker compose:

services:
  splash:
    image: scrapinghub/splash:3.1
    ports:
      - "prot:port"
    networks:
      - net

note that I run my code on a vm in a docker container.
what do you think I should do about. I also aware of memory limit, maxrss and slots for preventing splash use lots of ram but this way causes my crawling service misses bunch of websites. how should I handle It in my code?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions