question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to work with a very large “allowed_domains” attribute in scrapy?

See original GitHub issue

Because the allowed_domains is very big, it throws this exception: regex = r'^(.*.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None) How do I solve this problem?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:15 (8 by maintainers)

github_iconTop GitHub Comments

4reactions
kmikecommented, Apr 7, 2016

A regex with e.g. 50K domains should be super-fast with pyre2; for such regexes stdlib re matching is O(N), but re2 can match it in O(1) time regarding number of domains in a regex. I’m using a similar approach in https://github.com/scrapinghub/adblockparser.

1reaction
kmikecommented, Apr 18, 2016

@15310944349 just pass max_mem argument to re.compile like it is done here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Go Colly not returning any data from website - Stack Overflow
AllowedDomains ("www.bjjheroes.com/"), ) // On every a element which has href attribute call callback c.OnHTML("a[href]", func(e *colly.
Read more >
Scrapy Multiple Domains And Start Urls - ADocLib
Learn how to use cloud based Scraping API to scrape web pages without ... The alloweddomains array restricts all further crawling to the...
Read more >
Scrapy Streaming Documentation - Read the Docs
and if you need to use extra arguments, add them using the -a parameter: ... create_spider(name, startUrls, callback[, allowedDomains, ...
Read more >
Building A Golang Web Scraper: Simple Steps + Real Example
AllowedDomains ("www.jackjones.com") setting from the collector . We'll add the ScraperAPI endpoint to our initial .Visit() function like this: ...
Read more >
Golang Web Scraper Tutorial - Oxylabs
AllowedDomains ("books.toscrape.com"), ) Link to Github. Once the instance is available, the Visit() function can be called to start the scraper.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found