question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrape does not get full post when there is 2 layers of <See more...>

See original GitHub issue

@neon-ninja When a post have long text or post_text with ‘double layer’ of ‘See more’ that need to be clicked, extractor only manage to get the first layer. What i had test: facebook-scraper==0.2.42 from git-master

  1. Using 2 different accounts (with 2 different cookies) in chrome and also firefox. I used EditThisCookie in chrome and Cookie Quick Manager in firefox
  2. Using both windows CLI and also from .py
  3. WIth --encoding utf-8 and without encoding.

For cli i used this code : facebook-scraper --filename najibFullPost1.csv --pages 5 najibrazak -c C:\\Users\\insane\\Desktop\\NajibRazak\\cookies.json -v --encoding utf-8

the output for 1 layer of See more is fine. But if there is two layers it will only capture the first layer :

1 Layer output

post click

post link

2 layer output

post click

post link

I have read about others that been facing this issues but none seems to solve this problem.

by using

>>> from facebook_scraper import get_posts, enable_logging
>>> import logging
>>> import pprint
>>> enable_logging(logging.DEBUG)
>>> for post in get_posts(post_urls=[10157944979490952]):
...     print(post['text'])
...

it will return correct post value, but not if in cli with username.

side note : i have a problem that the output file is printing empty space between each record (row). I fixed it by adding newline=''

with open(filename, 'w', encoding=encoding, newline='') as output_file: dict_writer = csv.DictWriter(output_file, keys) dict_writer.writeheader() dict_writer.writerows(list_of_posts)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:23

github_iconTop GitHub Comments

1reaction
neon-ninjacommented, Jun 14, 2021

Ok, I think I see the problem. For me, the HTML is

<p> 1. i-Sinar dan i-Lestari juga… <a href="/story.php?story_fbid=10157944979490952&amp;id=157851205951&amp;_ft_=mf_story_key.10157944979490952%3Atop_level_post_id.10157944979490952%3Atl_objid.10157944979490952%3Acontent_owner_id_new.157851205951%3Athrowback_story_fbid.10157944979490952%3Apage_id.157851205951%3Astory_location.4%3Astory_attachment_style.photo%3Atds_flgs.3%3Aott.AX-KtQoVMZIEDTeL&amp;__tn__=%2C%3B" data-gt="{&quot;tn&quot;:&quot;,;&quot;}">More</a></p>

but for you, it’s

<p>
       1. i-Sinar dan i-Lestari juga…
       <a data-gt="{&quot;tn&quot;:&quot;,;&quot;}" href="/story.php?story_fbid=10157944979490952&amp;id=157851205951&amp;_ft_=mf_story_key.10157944979490952%3Atop_level_post_id.10157944979490952%3Atl_objid.10157944979490952%3Acontent_owner_id_new.157851205951%3Athrowback_story_fbid.10157944979490952%3Apage_id.157851205951%3Astory_location.4%3Astory_attachment_style.photo%3Atds_flgs.3%3Aott.AX-KtQoVMZIEDTeL&amp;__tn__=%2C%3B">
        More
       </a>
      </p>

which (?<=…\s)<a href="([^"]+) does not match, as data-gt is preceding the href. This regex can be simplified - try this - https://github.com/kevinzg/facebook-scraper/commit/e7b2a50cb39ecccd66d43e0a8ff66b65f9e75311

1reaction
neon-ninjacommented, Jun 14, 2021

Git master

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python multi layer web scraping [closed] - Stack Overflow
1 Answer 1 ... YOu can find all the <a> tags with href and pull those into a list. Then just iterate over...
Read more >
Web Scraping without getting blocked | ScrapingBee
This post will guide you through all the tools websites use to block you and all the ways you can successfully overcome these...
Read more >
Data Scraping - multi layer? - Help - UiPath Community Forum
Hi @ghdunn, Welcome to the Community! Data scraping can only extract data that is currently loaded/available.
Read more >
Mohs Surgery - The Skin Cancer Foundation
Mohs surgery is considered the most effective technique for treating the two most common types of skin cancer. Learn more about the procedure....
Read more >
The Cuticle – Should You Clip, Push, or Scrape? - Bliss Kiss
Most people can't see their cuticle since the skin is so thin, but this photo captured it perfectly. Breaking the Cuticle's Grasp –...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found