Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue received while scraping with comments

See original GitHub issue

post urls = [10158609033378741, 10159694338583054,1200387097127839,3176654032562645,204010715100946]

From this below code I get this error error name: name 'time' is not defined

from facebook_scraper import *
set_cookies("cookies.txt")
results = []
start_url = None
post_result = []

def handle_pagination_url(url):
    global start_url
    start_url = url

while True:
    try:
        post = next(
            get_posts(
                post_urls=[10158609033378741],
                options={
                    "comments": "generator",
                    "comment_start_url": start_url,
                    "comment_request_url_callback": handle_pagination_url,
                },
            )
        )
        comments = list(post["comments_full"])
        for comment in comments:
            comment["replies"] = list(comment["replies"])
            
            replies_list = []
            if comment["replies"]:
                for replies in comment["replies"]:
                    replies_list.append(replies)
            comment.update({"replies":replies_list})
            results.append(comment)
            
        print("All done")
        post.update({"comments_full":results})
        post_result.append(post)
        break
    except exceptions.TemporarilyBanned:
        print("Temporarily banned, sleeping for 10m")
        time.sleep(600)

Issue Analytics

State:
Created 2 years ago
Comments:23 (13 by maintainers)

Top GitHub Comments

4reactions

milesgratzcommented, Sep 3, 2021

@fashandatafields – this example of code for private groups has very cautious sleep timers (and unfortunately pretty shoddy Python) - but you can try it this way and adjust the timers as necessary. I’ve yet to be banned with it. It appends to pandas dataframes as it goes, so even if you are temp banned you should be able to write the comments you’ve parsed so far. It writes to CSV at the end, but you could change the pandas “to_csv” to “to_json” -

Lots of ways to do this.

#   https://github.com/kevinzg/facebook-scraper
#   
#
#---------------------------------------------------------------------------------------
#   https://github.com/kevinzg/facebook-scraper/issues/409
#---------------------------------------------------------------------------------------
#   Best practice is to reduce your requests per second ;). 
#   Try to stay under one request per second. What were you doing? 
#   Some requests (getting friend lists for example) are more sensitive to temp bans than others.
#
#---------------------------------------------------------------------------------------
#   https://github.com/kevinzg/facebook-scraper/issues/383
#---------------------------------------------------------------------------------------
#   When you scroll down on a Page or a Group on Facebook, it loads posts in pages. 
#   This varies depending on whether you're scraping Pages or Groups - for Pages, 
#   you get 2 posts on the first request, then 4 posts per page for each subsequent request 
#   (unless you change that default with the posts_per_page parameter). 
#   Groups give you roughly 20 posts per page, but this can vary by +-10 posts.
#   The number of posts per page for a group cannot be changed. 
#   The pages parameter limits how many pages to retrieve. 
#   Extracting comments from posts is a separate process, so yes, 
#   the page limit is independent from the number of comments. 
#   The number of comments would vary from Page to Page,
#   as some Pages are more popular than others, 
#   and tend to have more comments per post.
#   
#   As get_posts returns a generator, you can consume just enough posts until you have your
#   comment limit, then terminate iteration. You can set pages to None, so that you'll continue
#   retrieving posts until you hit your comment limit. Here's is a code example:
#
#   comments = 0
#   for post in get_posts("Nintendo", cookies="cookies.txt", pages=None, options={"comments": 3000}):
#     comments += len(post["comments_full"] or [])
#     if comments > 3000:
#       break

#   Note that I also set the limit of comments on a given post to 3000 -
#   so if the first post you get has > 3000 comments (possible for Pages belonging to movie stars), 
#   you'll only get comments for that one post
#---------------------------------------------------------------------------------------

from facebook_scraper import *
import pandas as pd
from collections import defaultdict
from time import time
from time import sleep
from datetime import datetime
from datetime import date
from random import randint

fbpage_id = 'xxxxxxxxxxxxxxxx'
posts_file = '/opt/fbscraping/data/' + fbpage_id + '_posts.csv'
comments_file = '/opt/fbscraping/data/' + fbpage_id + '_comments.csv'
replies_file = '/opt/fbscraping/data/' + fbpage_id + '_replies.csv'
cookies_file = '/opt/fbscraping/data/cookies.json'

# ------------------- current sleeping behavior -------------------
#   [POSTS] @ begin iteration
#    -> [COMMENTS] @ start, sleep for 3-7 seconds 
#    ----> [REPLIES] @ end, sleep for 10-40 seconds
#    -> [COMMENTS] @ end, sleep for 5-15 seconds
#   [POSTS] @ end, sleep for 37-89 seconds
# -----------------------------------------------------------------

# define pagination info
start_url = None
def handle_pagination_url(url):
    global start_url
    start_url = url

# define pandas dataframe
#   (refer to code example here: https://github.com/kevinzg/facebook-scraper/issues/414)
posts_df_ori = pd.DataFrame(columns = ['username', 'time', 'likes', 'comments', 'shares', 'reactions', 'post_text'])
comments_df_ori = pd.DataFrame(columns = ['post_id', 'commenter_name', 'comment_time', 'comment_reactors', 'replies', 'comment_text'])
replies_df_ori = pd.DataFrame(columns = ['post_id', 'parent_comment_id', 'commenter_name', 'comment_time', 'comment_reactors', 'comment_text'])

# [ALL_POSTS] retrieve all posts
print("[", datetime.now().strftime("%x %-I:%M:%S %p"), "][",  fbpage_id, "] STARTED - Retrieving posts")
pi=0
all_posts = get_posts(
        group=fbpage_id,
        extra_info=True,
        cookies = cookies_file,
        pages=1,
        timeout = 60,
        options={
            "comments": "generator",
            "comment_start_url": start_url,
            "comment_request_url_callback": handle_pagination_url
        },
    )

# [ALL_POSTS] iterate through using next() pagination 
while post := next(all_posts, None):
    pi += 1
    try:        
        # [POST] pandas dataframe
        print("[", datetime.now().strftime("%x %-I:%M:%S %p"), "][",  fbpage_id, "][", post["post_id"], "] Appending post info to 'posts_df_ori' dataframe. Post index: ", pi, " Total comments: ", post["comments"])
        post_dataframe = post
        post_df = pd.DataFrame.from_dict(post_dataframe, orient='index')
        post_df = post_df.transpose()
        posts_df_ori = posts_df_ori.append(post_df)

        # [COMMENT] begin loop
        ci=0
        comments = post["comments_full"]
        for comment in comments:
            
            # [COMMENT] determine replies
            ci += 1
            comment["replies"] = list(comment["replies"])
            
            # [COMMENT] pandas dataframe - transpose and add post_id
            comment_dataframe = comment
            comment_df = pd.DataFrame.from_dict(comment_dataframe, orient='index')
            comment_df = comment_df.transpose()
            comment_df.insert(0,'post_id',post['post_id'])

            # [COMMENT] append new object with post_id and comment* data to master 
            sleepCalc = randint(3,7)
            print("[", datetime.now().strftime("%x %-I:%M:%S %p"), "][",  fbpage_id, "][", post["post_id"], "] Appending comments info to 'comments_df_ori' dataframe. Post index: ", pi, " Comment index: ", ci, " Sleeping for: ", sleepCalc)
            comments_df_ori = comments_df_ori.append(comment_df)

            # [COMMENT] determine if replies exist 
            if comment["replies"]:
                ri = 0
                replies = comment['replies']
                for reply in replies:
                    ri += 1

                    # [COMMENT][REPLIES] pandas dataframe - transpose and add post_id, parent_comment_id
                    reply_dataframe = reply
                    reply_df = pd.DataFrame.from_dict(reply_dataframe, orient='index')
                    reply_df = reply_df.transpose()
                    reply_df.insert(0,'post_id',post['post_id'])
                    reply_df.insert(1,'parent_comment_id',comment['comment_id'])

                    # [COMMENT][REPLIES] append new object with post_id, parent_comment_id, and comment* data to master, sleep
                    sleepCalc = randint(10,40)
                    print("[", datetime.now().strftime("%x %-I:%M:%S %p"), "][",  fbpage_id, "][", post["post_id"], "][", comment['comment_id'],"] Appending replies to 'replies_df_ori' dataframe. Post index: ", pi, " Comment index: ", ci, "Replies index: ", ri, " Sleeping for: ", sleepCalc)
                    replies_df_ori = replies_df_ori.append(reply_df)
                    sleep(sleepCalc)

            # [COMMENT] sleep for sleepCalc duration
            sleepCalc = randint(5,15)
            sleep(sleepCalc)

        # [POST] increment index, sleep
        sleepCalc = randint(37,89)
        print("---------------------------sleeping for ", sleepCalc, " seconds-----------------------------")

    except exceptions.TemporarilyBanned:
        print("Temporarily banned..... HALTING")
        break

    except Exception as err:
        print("Error... let's try continuing...?", err)

        
# [ALL_POSTS] finished looping through all posts
print("========================================================================")
print("-------------------FINISHED LOOPING THROUGH ALL POSTS-------------------")
print("========================================================================")

############################################################
# finish
############################################################
print("[", datetime.now().strftime("%x %-I:%M:%S %p"), "][",  fbpage_id, "] COMPLETED - Writing posts and comments to file")
posts_df_ori.to_csv(posts_file, encoding='utf-8', index = False)
comments_df_ori.to_csv(comments_file, encoding='utf-8', index = False)
replies_df_ori.to_csv(replies_file, encoding='utf-8', index = False)

1reaction

fashan7commented, Sep 6, 2021

@milesgratz can we make it all in one JSON, I mean posts+comments & replies in one JSON. comments & replies needs to be in proper chain