question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CheerioCrawler cookies not set by 302 redirect response. Expected?

See original GitHub issue

Describe the bug

When a CheerioCrawler request results in a redirect, the set-cookie header from the 302 response is not put into the cookie header of the subsequent request to the redirected-to URL. Many sites use a redirect to validate that a browser supports cookies, so crawling these sites will fail using CheerioCrawler, even if useSessionPool and persistCookiesPerSession are both true.

It’s possible that I’m misunderstanding useSessionPool and persistCookiesPerSession. My assumption was that setting both of them to true would produce cookie behavior that emulates a real browser, where each response can set cookies that will automatically be sent back to the server in subsequent requests in the same session.

Is this assumed behavior how it should work? If not, how can I get response cookies into the next request, both in the case where the “next request” is a redirect (so there’s no call to handlePageFunction) and when it’s a regular response. The Session.setCookiesFromResponse method looks promising, but I’m not sure where the session instance is accessible and when to call that method.

BTW, retaining cookies across requests is such a common use case that it’d be good to document it beyond this GitHub issue.

To Reproduce

import Apify from 'apify';
const { main, openRequestQueue, CheerioCrawler, utils } = Apify;
utils.log.setLevel(utils.log.LEVELS.ERROR);

main(async () => {
  const requestQueue = await openRequestQueue();
  await requestQueue.addRequest(
    { url: 'https://www.ventureloop.com/ventureloop/login.php' }
  );
  const crawler = new CheerioCrawler({
    maxRequestRetries: 1,
    requestQueue,
    useSessionPool: true,
    persistCookiesPerSession: true,
    preNavigationHooks: [
      async (cc) => {
        const url = cc.request.url;
        const cookie = cc.request.headers['cookie'];
        console.log(`pre-nav: cookie is '${cookie}'`);
      }],
    postNavigationHooks: [
      async (cc) => {
        const url = cc.request.url;
        const cookie = cc.request.headers['cookie'];
        console.log(`post-nav: cookie is '${cookie}'`);
      }],
    handlePageFunction: async ({ request, $ }) => {
      console.log('original URL: ', request.url);
      console.log('loaded URL: ', request.loadedUrl);
      console.log('Site output indicating cookies not retained:');
      console.log($('#formContainer').text().trim());
    },
    handleFailedRequestFunction: async (x) => {
      console.log('handleFailedRequestFunction: ', x.request.url);
      console.log(JSON.stringify(x.request.errorMessages));
    }
  });
  await crawler.run();
});

Expected Cookies set by the first request are sent back to the server when the redirected-to URL is requested.

Actual

The server complains that cookies are disabled on the client. See output below from the code above:

pre-nav: cookie is 'undefined'
post-nav: cookie is 'undefined'
original URL:  https://www.ventureloop.com/ventureloop/login.php
loaded URL:  https://www.ventureloop.com/ventureloop/message.php?source=login.php
Site output indicating cookies not retained:
We have detected that you do not have cookies enabled on your browser. Our system requires the use of cookies in order to proceed. Please contact us at Help if you have further questions (...more text is omitted)

System information:

  • OS: MacOS
  • Node.js version v16.7.0
  • Apify SDK version 2.2.0

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
mnmkngcommented, Jan 4, 2022

I think we never really thought about redirect cookies so this is an omission on the SDK end. We wrongly assumed that redirect cookies would be sent by got, if it automatically handles redirects, but it also makes sense that it wouldn’t do it, if you don’t provide a cookieJar. A bit unexpected behavior, but not incorrect.

So we need to fix this on our end. Most likely similar to how you’re doing it in the hot fix. Provide some temporary cookieJar for all requests and then extract the cookies out of there. @B4nan The important thing is to keep the cookies mapped to sessions and not leak them across requests.

1reaction
mnmkngcommented, Jan 3, 2022

I only gave it a minute, so I’m sure I missed things.

We will investigate this for sure.

Read more comments on GitHub >

github_iconTop Results From Across the Web

HTTP Redirect (302) Doesn't Use Cookie in Following GET ...
I have a redirect that does not seem to be respecting a Set-Cookie attribute in a 302 Redirect. Here are the request and...
Read more >
Set-cookie ignored for HTTP response with status 302 - Monorail
I experience the problem during an OAuth redirect sequence. Cookies are mostly not set. When I enable the inspector they are more often...
Read more >
Ventureloop Login
... CheerioCrawler cookies not set by 302 redirect response. Expected? VentureLoop - Home | Facebook; Ventureloop Careers | NEA | New Enterprise Associates ......
Read more >
1483832 - cookie is not sent to server after 302 redirect with ...
param=test - Client receives 302 redirect back to example.com with set-cookie header in response with path set to '/' and domain set to...
Read more >
add cookie in http response after 302 redirect - DevCentral
Hi, i am load balancing three web servers that require authentication and persistence. all work fine until certain links return a 302 ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found