-
Notifications
You must be signed in to change notification settings - Fork 456
Open
Description
I have used scrapy splash for requesting in my crawling service. after amount of time my services usage of ram increase continuesly and after a while they use all ram of a vm. the wierd thing is splash service it self works properly but services which use splash for requests have memory leak. for more detail here is my code snippet and splash config i uses:
code:
if condition_to_use_splash:
return SplashRequest(url, errback=self.errback, callback=self.parse, meta=metadata, args={'wait': 7})
else:
return FormRequest(url, dont_filter=True, errback=self.errback,method=method, formdata=parameter, meta=metadata)
config:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'solaris_scrapy.solaris_scrapy.middlewares.ProxyMiddleware': 100,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
I use splash 3.1 as splash image and it is my splash service docker compose:
services:
splash:
image: scrapinghub/splash:3.1
ports:
- "prot:port"
networks:
- net
note that I run my code on a vm in a docker container.
what do you think I should do about. I also aware of memory limit, maxrss and slots for preventing splash use lots of ram but this way causes my crawling service misses bunch of websites. how should I handle It in my code?
Metadata
Metadata
Assignees
Labels
No labels