question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

IPv6 support? Problem running home page example from an IPv6 network

See original GitHub issue

I’m running into problems while trying to run the example on the scrapy.org home page from the FOSDEM IPv6-only Wi-Fi network. (The scraper works fine from an IPv4 network.)

If both IPv4 and IPv6 are enabled on my computer (OS X Yosemite), and the IPv4 is configured with DHCP, and thus gets a self-assigned address (169.254.x.x), then I get timeout errors:

$ scrapy runspider myspider.py
:0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
2015-01-31 12:13:32+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
2015-01-31 12:13:32+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-01-31 12:13:32+0100 [scrapy] INFO: Overridden settings: {}
2015-01-31 12:13:32+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-01-31 12:13:32+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-01-31 12:13:32+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-01-31 12:13:32+0100 [scrapy] INFO: Enabled item pipelines:
2015-01-31 12:13:32+0100 [blogspider] INFO: Spider opened
2015-01-31 12:13:32+0100 [blogspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-31 12:13:32+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-01-31 12:13:32+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-01-31 12:14:32+0100 [blogspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-31 12:14:48+0100 [blogspider] DEBUG: Retrying <GET http://blog.scrapinghub.com> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2015-01-31 12:15:32+0100 [blogspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-31 12:16:03+0100 [blogspider] DEBUG: Retrying <GET http://blog.scrapinghub.com> (failed 2 times): TCP connection timed out: 60: Operation timed out.
2015-01-31 12:16:32+0100 [blogspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-31 12:17:18+0100 [blogspider] DEBUG: Gave up retrying <GET http://blog.scrapinghub.com> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2015-01-31 12:17:18+0100 [blogspider] ERROR: Error downloading <GET http://blog.scrapinghub.com>: TCP connection timed out: 60: Operation timed out.
2015-01-31 12:17:18+0100 [blogspider] INFO: Closing spider (finished)
2015-01-31 12:17:18+0100 [blogspider] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 3,
     'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 3,
     'downloader/request_bytes': 657,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 1, 31, 11, 17, 18, 494774),
     'log_count/DEBUG': 5,
     'log_count/ERROR': 1,
     'log_count/INFO': 10,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'start_time': datetime.datetime(2015, 1, 31, 11, 13, 32, 955024)}
2015-01-31 12:17:18+0100 [blogspider] INFO: Spider closed (finished)

If I turn off IPv4 completely, then scrapy fails with “No route to host” errors:

$ scrapy runspider myspider.py
:0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
2015-01-31 12:10:06+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
2015-01-31 12:10:06+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-01-31 12:10:06+0100 [scrapy] INFO: Overridden settings: {}
2015-01-31 12:10:06+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-01-31 12:10:06+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-01-31 12:10:06+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-01-31 12:10:06+0100 [scrapy] INFO: Enabled item pipelines:
2015-01-31 12:10:06+0100 [blogspider] INFO: Spider opened
2015-01-31 12:10:06+0100 [blogspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-31 12:10:06+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-01-31 12:10:06+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-01-31 12:10:06+0100 [blogspider] DEBUG: Retrying <GET http://blog.scrapinghub.com> (failed 1 times): No route to host: 51: Network is unreachable.
2015-01-31 12:10:06+0100 [blogspider] DEBUG: Retrying <GET http://blog.scrapinghub.com> (failed 2 times): No route to host: 51: Network is unreachable.
2015-01-31 12:10:06+0100 [blogspider] DEBUG: Gave up retrying <GET http://blog.scrapinghub.com> (failed 3 times): No route to host: 51: Network is unreachable.
2015-01-31 12:10:06+0100 [blogspider] ERROR: Error downloading <GET http://blog.scrapinghub.com>: No route to host: 51: Network is unreachable.
2015-01-31 12:10:06+0100 [blogspider] INFO: Closing spider (finished)
2015-01-31 12:10:06+0100 [blogspider] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 3,
     'downloader/exception_type_count/twisted.internet.error.NoRouteError': 3,
     'downloader/request_bytes': 657,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 1, 31, 11, 10, 6, 482287),
     'log_count/DEBUG': 5,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'start_time': datetime.datetime(2015, 1, 31, 11, 10, 6, 463382)}
2015-01-31 12:10:06+0100 [blogspider] INFO: Spider closed (finished)

Note that I can open the blog.scrapinghub.com site in Safari, so the target web site does support IPv6 and the problem seems to be on scrapy’s side.

Issue Analytics

  • State:closed
  • Created 9 years ago
  • Reactions:2
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
glyphcommented, Dec 12, 2018

Indeed, this is still an issue. Scrapy disables Twisted’s IPv6 support by installing a non-IPv6-aware resolver. The problem is here:

https://github.com/scrapy/scrapy/blob/1fd1702a11a56ecbe9851ba4f9d3c10797e262dd/scrapy/crawler.py#L289

If you don’t want to trust the operating system’s DNS caching for some reason, you can use the more modern API to install a custom resolver: https://twistedmatrix.com/documents/18.9.0/api/twisted.internet.interfaces.IReactorPluggableNameResolver.html#installNameResolver

and, rather than subclassing a resolver within Twisted (you shouldn’t need the internal _GAIResolver to be made public), you can write a generalized caching layer; an twisted.internet.interfaces.IHostnameResolver that takes another IHostnameResolver as an argument, and caches its results; then simply pass the previous value of reactor.nameResolver to it.

Hope that this helps!

1reaction
nyovcommented, Mar 24, 2015

An implementation based on socket.getaddrinfo: https://github.com/scrapy/scrapy/pull/1104

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Fix an IPv6 No Network Access Error - Lifewire
First, select 6to4 tunnel in the router's settings to allow IPv6 and IPv4 traffic and devices to work together. Another option is to...
Read more >
Creating a Home IPv6 Network
The solution to all this is to connect over IPv6 which has an address space large enough to support every device on the...
Read more >
IPv6 Troubleshooting for Residential ISP Helpdesks - RIPE
If IPv4 is working but the page is unavailable, check DNS. ... Check the configuration of the home router: if an IPv6 address...
Read more >
IPv6 Explained for Beginners - Steve's internet Guide
A beginners Guide to understanding and using ipv6 protocol and addressing.
Read more >
What to do if you're broken... - Test IPv6
To resolve this, you'll first need to identify your IPv6 address (if you have one) and your IPv6 default route (if you have...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found