question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CheerioCrawler overwrites request cookies when persistCookiesPerSession is false

See original GitHub issue

Describe the bug In Apify 0.2.*, it was possible to use SessionPools with CheerioCrawler but manage request cookie setting manually by toggling persistCookiesPerSession to false.

A change to _requestFunction() in Apify 2.* (and I think Apify 1.*, but we skipped that version) has made this impossible. The issue is caused by this line: https://github.com/apify/apify-js/blob/9418dde95cbb1e7bc125e7ad533f55535f8359c5/src/crawlers/cheerio_crawler.js#L649 (only this.useSessionPool is checked whereas previously this.persistCookiesPerSession was checked).

This means that if CheerioCrawler is configured to use a SessionPool (e.g. for use with proxies) and persistCookiesPerSession is false, any cookies set via a preNavigationHook (or prepareRequestFunction() in earlier Apify versions) are overwritten.

To Reproduce Configure CheerioCrawler to use a SessionPool and not persist cookies per session. Try setting cookies on the request via prepareRequestFunction() (or a pre navigation hook) and notice they get overwritten.

Expected behavior Same behaviour as in Apify 0.2.* (i.e. manual cookie setting is possible in conjunction with a session pool). Or at least a migration path to preserve this ability with later versions of Apify SDK.

System information:

  • OS: Ubuntu 20
  • Node.js version: 16
  • Apify SDK version: 2.0.7

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:18 (17 by maintainers)

github_iconTop GitHub Comments

1reaction
mnmkngcommented, Oct 6, 2021

@corford thanks for the API improvement suggestions. I can’t say that it will be soon because we’re working on other things now, but we want to get back to the drawing board with SessionPool and improve the API significantly based on our experience so far. All suggestions are very welcome.

1reaction
B4nancommented, Oct 5, 2021

I see, then it probably makes sense to do what @szmarczak suggested and change the line to something like:

const { headers } = request;

if (this.useSessionPool && !(headers.Cookie || headers.cookie)) {
  headers.Cookie = session.getCookieString(request.url);
}

That will help with your particular issue and should not break anything else. We can iterate on this later if we see a reproduction that would require merging of cookies, but I’d rather not implement something that is almost impossible to reproduce (without hacks).

Read more comments on GitHub >

github_iconTop Results From Across the Web

CheerioCrawler overwrites request cookies when ... - GitHub
With persistCookiesPerSession = false there should be no cookies on the session, unless the user manually sets them using session.setXXXCookies ...
Read more >
Session Management - Apify SDK
SessionPool is a class that allows you to handle the rotation of proxy IP addresses along with cookies and other custom settings in...
Read more >
Session Management - Crawlee
SessionPool is a class that allows us to handle the rotation of proxy IP addresses along with cookies and other custom settings in...
Read more >
Session cookie keeps getting overwritten - Stack Overflow
Try adding the following to your core.php file: Configure::write('Session.checkAgent', false); Configure::write('Session.ini' ...
Read more >
Using HTTP cookies - MDN Web Docs
Typically, an HTTP cookie is used to tell if two requests come from the same browser—keeping a user logged in, for example.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found