python - Scrapy crawler ignores `DOWNLOADER_MIDDLEWARES` when run as a script -
i'd acquire data, using scrapy, few different sites , perform analysis on data. since both crawlers , code analyze data relate same project, i'd store in same git repository. created minimal reproducible example on github.
the structure of project looks this:
./crawlers ./crawlers/__init__.py ./crawlers/myproject ./crawlers/myproject/__init__.py ./crawlers/myproject/myproject ./crawlers/myproject/myproject/__init__.py ./crawlers/myproject/myproject/items.py ./crawlers/myproject/myproject/pipelines.py ./crawlers/myproject/myproject/settings.py ./crawlers/myproject/myproject/spiders ./crawlers/myproject/myproject/spiders/__init__.py ./crawlers/myproject/myproject/spiders/example.py ./crawlers/myproject/scrapy.cfg ./scrapyscript.py
from ./crawlers/myproject
folder, can execute crawler typing:
scrapy crawl example
the crawler uses downloader middleware, specifically, alecxe's excellent scrapy-fake-useragent. settings.py
:
downloader_middlewares = { 'scrapy.contrib.downloadermiddleware.useragent.useragentmiddleware': none, 'scrapy_fake_useragent.middleware.randomuseragentmiddleware': 400, }
when executed using scrapy crawl ...
useragent looks real browser. here's sample record webserver:
24.8.42.44 - - [16/jun/2015:05:07:59 +0000] "get / http/1.1" 200 27161 "-" "mozilla/5.0 (windows nt 6.3; win64; x64) applewebkit/537.36 (khtml, gecko) chrome/37.0.2049.0 safari/537.36"
looking @ documentation, it's possible equivalent of scrapy crawl ...
script. scrapyscript.py
file, based on documentation, looks this:
from twisted.internet import reactor scrapy.crawler import crawler scrapy import log, signals scrapy.utils.project import get_project_settings crawlers.myproject.myproject.spiders.example import examplespider spider = examplespider() settings = get_project_settings() crawler = crawler(settings) crawler.signals.connect(reactor.stop, signal=signals.spider_closed) crawler.configure() crawler.crawl(spider) crawler.start() log.start() reactor.run()
when execute script, can see crawler makes page requests. unfortunately, ignoring downloader_middlewares
. useragent, example, no longer spoofed:
24.8.42.44 - - [16/jun/2015:05:32:04 +0000] "get / http/1.1" 200 27161 "-" "scrapy/0.24.6 (+http://scrapy.org)"
somehow, when crawler executed script, seems ignoring settings in settings.py
.
can see i'm doing wrong?
in order get_project_settings()
find desired settings.py
, set scrapy_settings_module
environment variable:
import os import sys # ... sys.path.append(os.path.join(os.path.curdir, "crawlers/myproject")) os.environ['scrapy_settings_module'] = 'myproject.settings' settings = get_project_settings()
note that, due location of runner script, need add myproject
sys.path
. or, move scrapyscript.py
under myproject
directory.
Comments
Post a Comment