question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using pagination URLs returns always the same posts

See original GitHub issue

Hi again! I’m testing the start_url and request_url_callback params to continue scraping data if I’m temporarily banned, but it seems like it always returns the same posts after a ban. For example, using this

import time
from facebook_scraper import *

start_url = None
def handle_pagination_url(url):
    start_url = url
set_cookies("cookies.json")
row = 1
while True:
    try:
        for post in get_posts("Nintendo", pages = None, start_url=start_url, request_url_callback=handle_pagination_url, options={"allow_extra_requests": False, "comments": False, "reactors": False, "posts_per_page": 200}, timeout=120):
            if(row % 9 == 0):
                raise exceptions.TemporarilyBanned
            print(post.get('post_url', '0'))
            row += 1
        print("All done")
        break
    except exceptions.TemporarilyBanned:
        row += 1
        print("Temporarily banned, sleeping for 1")
        time.sleep(2)

I always get the same posts again and again with different order. I have the same error in every group or page. However, in groups, posts are with different order.

https://facebook.com/Nintendo/posts/4217919734959114
https://facebook.com/Nintendo/posts/4217774244973663
https://facebook.com/Nintendo/posts/4214518608632560
https://facebook.com/Nintendo/posts/4214033132014441
https://facebook.com/Nintendo/posts/4194934713924283
https://facebook.com/Nintendo/posts/4193752747375813
https://facebook.com/Nintendo/posts/4191173794300375
https://facebook.com/Nintendo/posts/4188174317933656
Temporarily banned, sleeping for 1
https://facebook.com/Nintendo/posts/4217919734959114
https://facebook.com/Nintendo/posts/4217774244973663
https://facebook.com/Nintendo/posts/4214518608632560
https://facebook.com/Nintendo/posts/4214033132014441
https://facebook.com/Nintendo/posts/4194934713924283
https://facebook.com/Nintendo/posts/4193752747375813
https://facebook.com/Nintendo/posts/4191173794300375
https://facebook.com/Nintendo/posts/4188174317933656
Temporarily banned, sleeping for 1
https://facebook.com/Nintendo/posts/4217919734959114
https://facebook.com/Nintendo/posts/4217774244973663
https://facebook.com/Nintendo/posts/4214518608632560
https://facebook.com/Nintendo/posts/4214033132014441
https://facebook.com/Nintendo/posts/4194934713924283
https://facebook.com/Nintendo/posts/4193752747375813
https://facebook.com/Nintendo/posts/4191173794300375
https://facebook.com/Nintendo/posts/4188174317933656

Thanks in advance!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
neon-ninjacommented, Jun 14, 2021

By 200 posts, I was referring to your code that sets "posts_per_page": 200. But actually, this setting only works for pages, not groups, so you’ll be getting the group default of 20-40 posts per page, and you can’t change that.

Pagination is triggered when necessary - when you’ve consumed all of the posts on each page. The scraper makes this transition between pages as seamless as possible for you. Because a web request is only made when requesting each page, by definition, you’re only going to get a TemporaryBanned exception when requesting a new page, and not when processing posts on the same page.

You should probably increase your time.sleep from 60 seconds, in my experience temporary bans last much longer than that. Otherwise your code looks fine to me.

1reaction
dimitriskourgcommented, Jun 14, 2021

@roma-glushko Thanks for the code! You are amazing! 🙌 @neon-ninja I tested the code above, and it seems like the trigger of pagination works every 21 posts? here is the output bellow


Storing next search page https://m.facebook.com/groups/thankyounextofficial/..
https://m.facebook.com/groups/thankyounextofficial/permalink/993478204766901/
https://m.facebook.com/groups/thankyounextofficial/permalink/992989824815739/
https://m.facebook.com/groups/thankyounextofficial/permalink/992930011488387/
https://m.facebook.com/groups/thankyounextofficial/permalink/988110088637046/
https://m.facebook.com/groups/thankyounextofficial/permalink/993119688136086/
https://m.facebook.com/groups/thankyounextofficial/permalink/993401701441218/
https://m.facebook.com/groups/thankyounextofficial/permalink/993091338138921/
https://m.facebook.com/groups/thankyounextofficial/permalink/994893594625362/
https://m.facebook.com/groups/thankyounextofficial/permalink/992700148178040/
https://m.facebook.com/groups/thankyounextofficial/permalink/992944198153635/
https://m.facebook.com/groups/thankyounextofficial/permalink/992623498185705/
https://m.facebook.com/groups/thankyounextofficial/permalink/904229037025152/
https://m.facebook.com/groups/thankyounextofficial/permalink/992942021487186/
https://m.facebook.com/groups/thankyounextofficial/permalink/992625384852183/
https://m.facebook.com/groups/thankyounextofficial/permalink/992620218186033/
https://m.facebook.com/groups/thankyounextofficial/permalink/992621141519274/
https://m.facebook.com/groups/thankyounextofficial/permalink/993298128118242/
https://m.facebook.com/groups/thankyounextofficial/permalink/993335001447888/
https://m.facebook.com/groups/thankyounextofficial/permalink/993284051452983/
https://m.facebook.com/groups/thankyounextofficial/permalink/993065651474823/
https://m.facebook.com/groups/thankyounextofficial/permalink/993076514807070/
Storing next search page https://m.facebook.com/groups/438870180227709?bac=MTYyMzYyNzQxNzo5OTMwNzY1MTQ4MDcwNzA6OTkzMDc2NTE0ODA3MDcwLDAsMDoyMDpLdz09&multi_permalinks..
https://m.facebook.com/groups/thankyounextofficial/permalink/993478204766901/
https://m.facebook.com/groups/thankyounextofficial/permalink/992820888165966/
https://m.facebook.com/groups/thankyounextofficial/permalink/993141388133916/
https://m.facebook.com/groups/thankyounextofficial/permalink/993369798111075/
https://m.facebook.com/groups/thankyounextofficial/permalink/992855104829211/
https://m.facebook.com/groups/thankyounextofficial/permalink/993152168132838/
https://m.facebook.com/groups/thankyounextofficial/permalink/992881214826600/
https://m.facebook.com/groups/thankyounextofficial/permalink/993186961462692/
https://m.facebook.com/groups/thankyounextofficial/permalink/991186581662730/
https://m.facebook.com/groups/thankyounextofficial/permalink/993510978096957/
https://m.facebook.com/groups/thankyounextofficial/permalink/992623461519042/
https://m.facebook.com/groups/thankyounextofficial/permalink/994067041374684/
https://m.facebook.com/groups/thankyounextofficial/permalink/993000954814626/
https://m.facebook.com/groups/thankyounextofficial/permalink/674035226711202/
https://m.facebook.com/groups/thankyounextofficial/permalink/909691633145559/
https://m.facebook.com/groups/thankyounextofficial/permalink/993228291458559/
https://m.facebook.com/groups/thankyounextofficial/permalink/993332578114797/
https://m.facebook.com/groups/thankyounextofficial/permalink/990354088412646/
https://m.facebook.com/groups/thankyounextofficial/permalink/993133881468000/
https://m.facebook.com/groups/thankyounextofficial/permalink/993009264813795/
https://m.facebook.com/groups/thankyounextofficial/permalink/993427331438655/
Storing next search page https://m.facebook.com/groups/438870180227709?bac=MTYyMzYyNjgzODo5OTM0MjczMzE0Mzg2NTU6OTkzNDI3MzMxNDM4NjU1LDAsMToyMDpLdz09&multi_permalinks..

So, when the pagination is triggered? And if is triggered every 200 posts, is there a way to trigger it more often? Because when temp ban occurs before 200 posts start_url remains the same and it scrapes the same posts again. Finally, I use this code to continue scrape data after I have been banned. Is that correct? I’m sorry for the spam, but I really want to fix this problem! Thanks again!

import time
from facebook_scraper import *

class SearchPagePersistor:
    search_page_url: Optional[str] = None

    def get_current_search_page(self) -> Optional[str]:
        return self.search_page_url

    def set_search_page(self, page_url: str) -> None:
        print('Storing next search page {}..'.format(page_url))
        self.search_page_url = page_url

# ...
search_page_persistor: SearchPagePersistor = SearchPagePersistor()  # could be inited with a specific search page URL 

while True:
    try:
        for post_idx, post in enumerate(get_posts(
            "Nintendo",
            cookies="cookies.json",
            page_limit=None,  # try to get all pages and then decide where to stop
            start_url=search_page_persistor.get_current_search_page(),
            request_url_callback=search_page_persistor.set_search_page,
            options={"allow_extra_requests": False, "comments": False, "reactors": False, "posts_per_page": 200},
            timeout=120
        )):
            print(post.get(post)
        print("Finished!")
        break
    except exceptions.TemporarilyBanned:
        print("Temporarily banned, sleeping for 1m")
        time.sleep(60)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Post Pagination Showing Same Posts Every Page
With the paged argument set, it'll return the appropriate posts depending on the page number (the first x posts for page 1, the...
Read more >
Wordpress pagination always returns same 3 posts
So I have created dedicated page template, and on this page I want to list the 3 most recent blog posts, with the...
Read more >
Custom posttype archive – pagination – Post missing / repeating
Hello,. I'm using a custom post type with a pagination. But some posts are repeated and some are not shown. Can't figure out...
Read more >
Laravel pagination links not working - Laracasts
It does create pagination links in view but when I click on any link it ... When I click on any pagination link...
Read more >
SEO-Friendly Pagination: A Complete Best Practices Guide
In this guide, learn how pagination can hurt SEO, the pros and cons of pagination handling options, and how to track KPIs.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found