Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to work with a very large “allowed_domains” attribute in scrapy?

See original GitHub issue

Because the allowed_domains is very big, it throws this exception: regex = r'^(.*.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None) How do I solve this problem?

Issue Analytics

State:
Created 7 years ago
Comments:15 (8 by maintainers)

Top GitHub Comments

4reactions

kmikecommented, Apr 7, 2016

A regex with e.g. 50K domains should be super-fast with pyre2; for such regexes stdlib re matching is O(N), but re2 can match it in O(1) time regarding number of domains in a regex. I’m using a similar approach in https://github.com/scrapinghub/adblockparser.

1reaction

kmikecommented, Apr 18, 2016

@15310944349 just pass max_mem argument to re.compile like it is done here.

Read more comments on GitHub >

Top Results From Across the Web

Go Colly not returning any data from website - Stack Overflow

AllowedDomains ("www.bjjheroes.com/"), ) // On every a element which has href attribute call callback c.OnHTML("a[href]", func(e *colly.

Scrapy Multiple Domains And Start Urls - ADocLib

Learn how to use cloud based Scraping API to scrape web pages without ... The alloweddomains array restricts all further crawling to the...

Scrapy Streaming Documentation - Read the Docs

and if you need to use extra arguments, add them using the -a parameter: ... create_spider(name, startUrls, callback[, allowedDomains, ...

Building A Golang Web Scraper: Simple Steps + Real Example

AllowedDomains ("www.jackjones.com") setting from the collector . We'll add the ScraperAPI endpoint to our initial .Visit() function like this: ...

Golang Web Scraper Tutorial - Oxylabs

AllowedDomains ("books.toscrape.com"), ) Link to Github. Once the instance is available, the Visit() function can be called to start the scraper.

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Allow multiple items through pipelines?

CrawlerProcess doesn't load Item Pipeline component