Obey Robots.txt
See original GitHub issueIs scrapy-splash not compatible with obeying robots.txt? Everytime I make a query it attempts to download the robots.txt from the docker instance of scrapy-splash. The below is my settings file. I’m thinking it may be a misordering of the middlewares, but I’m not sure what it should look like.
# -*- coding: utf-8 -*-
# Scrapy settings for ishop project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'ishop'
SPIDER_MODULES = ['ishop.spiders']
NEWSPIDER_MODULE = 'ishop.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'ishop (+http://www.ishop.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'ishop.middlewares.IshopSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'ishop.middlewares.IshopDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'ishop.pipelines.HbasePipeline': 100
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
SPIDER_MIDDLEWARES = {
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 25,
'frontera.contrib.scrapy.middlewares.seeds.file.FileSeedLoader': 650,
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 999,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810
}
SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
FRONTERA_SETTINGS = 'ishop.frontera.spiders' # module path to your Frontera spider config module
SPLASH_URL = 'http://127.0.0.1:8050'
# SPLASH_URL= 'http://172.17.0.2:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:9
Top Results From Across the Web
About obeying the robots.txt file
You can set the Web Crawler to either ignore or obey the robots.txt exclusion standard, as well as any META ROBOTS tags in...
Read more >Obeying Robots Rules
A robots.txt file contains a list of User Agents and associated paths the User Agents are allowed, or disallowed, to follow. If you...
Read more >Robots.txt Introduction and Guide | Google Search Central
Robots.txt is used to manage crawler traffic. Explore this robots.txt introduction guide to learn what robot.txt files are and how to use them....
Read more >A Complete Guide to Robots.txt & Why It Matters - SEMrush
txt File Work? Robots.txt files tell search engine bots which URLs they can crawl and, more importantly, which ones they can't. Search engines ......
Read more >Robots.txt Testing In The SEO Spider - Screaming Frog
All major search engine bots conform to the robots exclusion standard, and will read and obey the instructions of the robots.txt file, before...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
can you share this? disabling the hole robots is no option.