Strategy same-domain is not respected
See original GitHub issueWhen using a regexps with enqueueLinks and setting the strategy to same-domain it still veers off from the original domain when crawling.
Example:
import { CheerioCrawler } from "crawlee";
const crawler = new CheerioCrawler({
async requestHandler({ request, enqueueLinks, log }) {
log.info(request.url);
// Add all links from page to RequestQueue
await enqueueLinks({
strategy: "same-domain",
regexps: [/([\./\-_]{0,1}(19|20)\d{2})[\./\-_]{0,1}(([0-3]{0,1}[0-9][\./\-_])|(\w{3,5}[\./\-_]))([0-3]{0,1}[0-9][./\-]{0,1})?/],
});
},
});
// Run the crawler with initial request
await crawler.run(["https://www.readingeagle.com/"]);
System information:
- OS: Mac OSX 12.5.1 Monterey
- Node.js version 18.4
According to the docs I would expect that the same-domain strategy would keep the urls found within readingeagle.com but after running for a bit you will see it goes to other domains as well.
Issue Analytics
- State:
- Created a year ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
Strategy same-domain is not respected - PullAnswer
When using a regexps with enqueueLinks and setting the strategy to same-domain it still veers off from the original domain when crawling. Example:...
Read more >Is it Bad for SEO to Have Multiple Links from the Same Domain?
Therefore, building multiple links from the same domain can be a worthwhile strategy, if properly executed.
Read more >Same-origin policy - Web security | MDN
The same-origin policy is a critical security mechanism that restricts how a document or script loaded by one origin can interact with a ......
Read more >Specifying and comparing implementation strategies across ...
The use of implementation strategies is an active and purposive approach to translate research findings into routine clinical care.
Read more >X-Frame-Options Allow-From multiple domains - Stack Overflow
I have an ASP.NET 4.0 IIS7.5 site which I need secured using the X-Frame-Options header. I also need to enable my site pages...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks @vladfrangu, @B4nan and @metalwarrior665 for all the help. Keep up the great work!
Ok thanks.