How to work with a very large “allowed_domains” attribute in scrapy?
See original GitHub issueBecause the allowed_domains is very big, it throws this exception:
regex = r'^(.*.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None)
How do I solve this problem?
Issue Analytics
- State:
- Created 7 years ago
- Comments:15 (8 by maintainers)
Top Results From Across the Web
Go Colly not returning any data from website - Stack Overflow
AllowedDomains ("www.bjjheroes.com/"), ) // On every a element which has href attribute call callback c.OnHTML("a[href]", func(e *colly.
Read more >Scrapy Multiple Domains And Start Urls - ADocLib
Learn how to use cloud based Scraping API to scrape web pages without ... The alloweddomains array restricts all further crawling to the...
Read more >Scrapy Streaming Documentation - Read the Docs
and if you need to use extra arguments, add them using the -a parameter: ... create_spider(name, startUrls, callback[, allowedDomains, ...
Read more >Building A Golang Web Scraper: Simple Steps + Real Example
AllowedDomains ("www.jackjones.com") setting from the collector . We'll add the ScraperAPI endpoint to our initial .Visit() function like this: ...
Read more >Golang Web Scraper Tutorial - Oxylabs
AllowedDomains ("books.toscrape.com"), ) Link to Github. Once the instance is available, the Visit() function can be called to start the scraper.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
A regex with e.g. 50K domains should be super-fast with pyre2; for such regexes stdlib re matching is O(N), but re2 can match it in O(1) time regarding number of domains in a regex. I’m using a similar approach in https://github.com/scrapinghub/adblockparser.
@15310944349 just pass max_mem argument to re.compile like it is done here.