question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Strategy same-domain is not respected

See original GitHub issue

When using a regexps with enqueueLinks and setting the strategy to same-domain it still veers off from the original domain when crawling.

Example:

import { CheerioCrawler } from "crawlee";

const crawler = new CheerioCrawler({
    async requestHandler({ request, enqueueLinks, log }) {
        log.info(request.url);
        
        // Add all links from page to RequestQueue
        await enqueueLinks({
            strategy: "same-domain",
            regexps: [/([\./\-_]{0,1}(19|20)\d{2})[\./\-_]{0,1}(([0-3]{0,1}[0-9][\./\-_])|(\w{3,5}[\./\-_]))([0-3]{0,1}[0-9][./\-]{0,1})?/],
        });
    },
});

// Run the crawler with initial request
await crawler.run(["https://www.readingeagle.com/"]);

System information:

  • OS: Mac OSX 12.5.1 Monterey
  • Node.js version 18.4

According to the docs I would expect that the same-domain strategy would keep the urls found within readingeagle.com but after running for a bit you will see it goes to other domains as well.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
tsrdatatechcommented, Aug 30, 2022

Thanks @vladfrangu, @B4nan and @metalwarrior665 for all the help. Keep up the great work!

0reactions
tsrdatatechcommented, Sep 7, 2022

Ok thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Strategy same-domain is not respected - PullAnswer
When using a regexps with enqueueLinks and setting the strategy to same-domain it still veers off from the original domain when crawling. Example:...
Read more >
Is it Bad for SEO to Have Multiple Links from the Same Domain?
Therefore, building multiple links from the same domain can be a worthwhile strategy, if properly executed.
Read more >
Same-origin policy - Web security | MDN
The same-origin policy is a critical security mechanism that restricts how a document or script loaded by one origin can interact with a ......
Read more >
Specifying and comparing implementation strategies across ...
The use of implementation strategies is an active and purposive approach to translate research findings into routine clinical care.
Read more >
X-Frame-Options Allow-From multiple domains - Stack Overflow
I have an ASP.NET 4.0 IIS7.5 site which I need secured using the X-Frame-Options header. I also need to enable my site pages...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found