question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue received while scraping with comments

See original GitHub issue

post urls = [10158609033378741, 10159694338583054,1200387097127839,3176654032562645,204010715100946]

From this below code I get this error error name: name 'time' is not defined

from facebook_scraper import *
set_cookies("cookies.txt")
results = []
start_url = None
post_result = []

def handle_pagination_url(url):
    global start_url
    start_url = url

while True:
    try:
        post = next(
            get_posts(
                post_urls=[10158609033378741],
                options={
                    "comments": "generator",
                    "comment_start_url": start_url,
                    "comment_request_url_callback": handle_pagination_url,
                },
            )
        )
        comments = list(post["comments_full"])
        for comment in comments:
            comment["replies"] = list(comment["replies"])
            
            replies_list = []
            if comment["replies"]:
                for replies in comment["replies"]:
                    replies_list.append(replies)
            comment.update({"replies":replies_list})
            results.append(comment)
            
        print("All done")
        post.update({"comments_full":results})
        post_result.append(post)
        break
    except exceptions.TemporarilyBanned:
        print("Temporarily banned, sleeping for 10m")
        time.sleep(600)

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:23 (13 by maintainers)

github_iconTop GitHub Comments

4reactions
milesgratzcommented, Sep 3, 2021

@fashandatafields – this example of code for private groups has very cautious sleep timers (and unfortunately pretty shoddy Python) - but you can try it this way and adjust the timers as necessary. I’ve yet to be banned with it. It appends to pandas dataframes as it goes, so even if you are temp banned you should be able to write the comments you’ve parsed so far. It writes to CSV at the end, but you could change the pandas “to_csv” to “to_json” -

Lots of ways to do this.

#   https://github.com/kevinzg/facebook-scraper
#   
#
#---------------------------------------------------------------------------------------
#   https://github.com/kevinzg/facebook-scraper/issues/409
#---------------------------------------------------------------------------------------
#   Best practice is to reduce your requests per second ;). 
#   Try to stay under one request per second. What were you doing? 
#   Some requests (getting friend lists for example) are more sensitive to temp bans than others.
#
#---------------------------------------------------------------------------------------
#   https://github.com/kevinzg/facebook-scraper/issues/383
#---------------------------------------------------------------------------------------
#   When you scroll down on a Page or a Group on Facebook, it loads posts in pages. 
#   This varies depending on whether you're scraping Pages or Groups - for Pages, 
#   you get 2 posts on the first request, then 4 posts per page for each subsequent request 
#   (unless you change that default with the posts_per_page parameter). 
#   Groups give you roughly 20 posts per page, but this can vary by +-10 posts.
#   The number of posts per page for a group cannot be changed. 
#   The pages parameter limits how many pages to retrieve. 
#   Extracting comments from posts is a separate process, so yes, 
#   the page limit is independent from the number of comments. 
#   The number of comments would vary from Page to Page,
#   as some Pages are more popular than others, 
#   and tend to have more comments per post.
#   
#   As get_posts returns a generator, you can consume just enough posts until you have your
#   comment limit, then terminate iteration. You can set pages to None, so that you'll continue
#   retrieving posts until you hit your comment limit. Here's is a code example:
#
#   comments = 0
#   for post in get_posts("Nintendo", cookies="cookies.txt", pages=None, options={"comments": 3000}):
#     comments += len(post["comments_full"] or [])
#     if comments > 3000:
#       break

#   Note that I also set the limit of comments on a given post to 3000 -
#   so if the first post you get has > 3000 comments (possible for Pages belonging to movie stars), 
#   you'll only get comments for that one post
#---------------------------------------------------------------------------------------

from facebook_scraper import *
import pandas as pd
from collections import defaultdict
from time import time
from time import sleep
from datetime import datetime
from datetime import date
from random import randint

fbpage_id = 'xxxxxxxxxxxxxxxx'
posts_file = '/opt/fbscraping/data/' + fbpage_id + '_posts.csv'
comments_file = '/opt/fbscraping/data/' + fbpage_id + '_comments.csv'
replies_file = '/opt/fbscraping/data/' + fbpage_id + '_replies.csv'
cookies_file = '/opt/fbscraping/data/cookies.json'

# ------------------- current sleeping behavior -------------------
#   [POSTS] @ begin iteration
#    -> [COMMENTS] @ start, sleep for 3-7 seconds 
#    ----> [REPLIES] @ end, sleep for 10-40 seconds
#    -> [COMMENTS] @ end, sleep for 5-15 seconds
#   [POSTS] @ end, sleep for 37-89 seconds
# -----------------------------------------------------------------

# define pagination info
start_url = None
def handle_pagination_url(url):
    global start_url
    start_url = url

# define pandas dataframe
#   (refer to code example here: https://github.com/kevinzg/facebook-scraper/issues/414)
posts_df_ori = pd.DataFrame(columns = ['username', 'time', 'likes', 'comments', 'shares', 'reactions', 'post_text'])
comments_df_ori = pd.DataFrame(columns = ['post_id', 'commenter_name', 'comment_time', 'comment_reactors', 'replies', 'comment_text'])
replies_df_ori = pd.DataFrame(columns = ['post_id', 'parent_comment_id', 'commenter_name', 'comment_time', 'comment_reactors', 'comment_text'])

# [ALL_POSTS] retrieve all posts
print("[", datetime.now().strftime("%x %-I:%M:%S %p"), "][",  fbpage_id, "] STARTED - Retrieving posts")
pi=0
all_posts = get_posts(
        group=fbpage_id,
        extra_info=True,
        cookies = cookies_file,
        pages=1,
        timeout = 60,
        options={
            "comments": "generator",
            "comment_start_url": start_url,
            "comment_request_url_callback": handle_pagination_url
        },
    )

# [ALL_POSTS] iterate through using next() pagination 
while post := next(all_posts, None):
    pi += 1
    try:        
        # [POST] pandas dataframe
        print("[", datetime.now().strftime("%x %-I:%M:%S %p"), "][",  fbpage_id, "][", post["post_id"], "] Appending post info to 'posts_df_ori' dataframe. Post index: ", pi, " Total comments: ", post["comments"])
        post_dataframe = post
        post_df = pd.DataFrame.from_dict(post_dataframe, orient='index')
        post_df = post_df.transpose()
        posts_df_ori = posts_df_ori.append(post_df)

        # [COMMENT] begin loop
        ci=0
        comments = post["comments_full"]
        for comment in comments:
            
            # [COMMENT] determine replies
            ci += 1
            comment["replies"] = list(comment["replies"])
            
            # [COMMENT] pandas dataframe - transpose and add post_id
            comment_dataframe = comment
            comment_df = pd.DataFrame.from_dict(comment_dataframe, orient='index')
            comment_df = comment_df.transpose()
            comment_df.insert(0,'post_id',post['post_id'])

            # [COMMENT] append new object with post_id and comment* data to master 
            sleepCalc = randint(3,7)
            print("[", datetime.now().strftime("%x %-I:%M:%S %p"), "][",  fbpage_id, "][", post["post_id"], "] Appending comments info to 'comments_df_ori' dataframe. Post index: ", pi, " Comment index: ", ci, " Sleeping for: ", sleepCalc)
            comments_df_ori = comments_df_ori.append(comment_df)

            # [COMMENT] determine if replies exist 
            if comment["replies"]:
                ri = 0
                replies = comment['replies']
                for reply in replies:
                    ri += 1

                    # [COMMENT][REPLIES] pandas dataframe - transpose and add post_id, parent_comment_id
                    reply_dataframe = reply
                    reply_df = pd.DataFrame.from_dict(reply_dataframe, orient='index')
                    reply_df = reply_df.transpose()
                    reply_df.insert(0,'post_id',post['post_id'])
                    reply_df.insert(1,'parent_comment_id',comment['comment_id'])

                    # [COMMENT][REPLIES] append new object with post_id, parent_comment_id, and comment* data to master, sleep
                    sleepCalc = randint(10,40)
                    print("[", datetime.now().strftime("%x %-I:%M:%S %p"), "][",  fbpage_id, "][", post["post_id"], "][", comment['comment_id'],"] Appending replies to 'replies_df_ori' dataframe. Post index: ", pi, " Comment index: ", ci, "Replies index: ", ri, " Sleeping for: ", sleepCalc)
                    replies_df_ori = replies_df_ori.append(reply_df)
                    sleep(sleepCalc)

            # [COMMENT] sleep for sleepCalc duration
            sleepCalc = randint(5,15)
            sleep(sleepCalc)

        # [POST] increment index, sleep
        sleepCalc = randint(37,89)
        print("---------------------------sleeping for ", sleepCalc, " seconds-----------------------------")

    except exceptions.TemporarilyBanned:
        print("Temporarily banned..... HALTING")
        break

    except Exception as err:
        print("Error... let's try continuing...?", err)

        
# [ALL_POSTS] finished looping through all posts
print("========================================================================")
print("-------------------FINISHED LOOPING THROUGH ALL POSTS-------------------")
print("========================================================================")

############################################################
# finish
############################################################
print("[", datetime.now().strftime("%x %-I:%M:%S %p"), "][",  fbpage_id, "] COMPLETED - Writing posts and comments to file")
posts_df_ori.to_csv(posts_file, encoding='utf-8', index = False)
comments_df_ori.to_csv(comments_file, encoding='utf-8', index = False)
replies_df_ori.to_csv(replies_file, encoding='utf-8', index = False)

1reaction
fashan7commented, Sep 6, 2021

@milesgratz can we make it all in one JSON, I mean posts+comments & replies in one JSON. comments & replies needs to be in proper chain

Read more comments on GitHub >

github_iconTop Results From Across the Web

how to get the comments in a html page while scraping?
With BeautifulSoup you can do this. Try this:- from bs4 import BeautifulSoup, Comment soup = BeautifulSoup(html, 'lxml') for comment in soup ...
Read more >
How do YOU Solve this common Web Scraping issue?
This is a common error we get when an element on the page doesn't exist - abstracting out to a new function to...
Read more >
10 Tips to avoid getting Blocked while Scraping Websites
In this post we are going to understand how we can avoid getting blocked while scraping.
Read more >
What is Web Scraping and How to Use It? - GeeksforGeeks
So, when a web scraper needs to scrape a site, first the URLs are provided. Then it loads all the HTML code for...
Read more >
How to scrape Instagram posts, comments, and photos
This step-by-step guide should get you started in just a few minutes. Step 1. Go to Apify Store for Instagram Scraper. When you...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found