Support scraping data from local files
See original GitHub issueHi @roclark! Really like the project. What do you think of supporting local HTML files that have been downloaded from sports-reference in advance?
Could be nice to let users specify that they’ve pre-downloaded certain resources through some kind of API configuration, maybe with a mapping like {'some-resource-id': 'path_to_resource_page.html'}
After looking through the code a bit, maybe this could happen in utils.py
with some new function that gets a document, choosing between PyQuery(url=x)
and PyQuery(filename=x)
?
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
How to Scrape Data From Local HTML Files using Python?
BeautifulSoup module in Python allows us to scrape data from local HTML files. For some reason, website pages might get stored in a...
Read more >HowTo - Scrape Data From Local HTML Files - WebSundew
Select Local Files. The agent's start up mode will change. Select folder with target HTML files. You can add several folders to process,...
Read more >can we scrape a local file? - Google Groups
Web Scraper won't accept local file urls. You can serve the html files as a local web site and then scrape it. If...
Read more >scraping the html file saved in local system - Stack Overflow
You can write the code in this way to scrape your own file saved in local system from bs4 import BeautifulSoup import html5lib ......
Read more >Web Scraping Basics - Towards Data Science
Web Scraping is an automatic way to retrieve unstructured data from a website and store them in a structured format.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I like your idea of the downloader, I think that would clear up a lot here! Also, allowing the user the option to specify local HTML files and downloading them if necessary would be helpful in order to minimize confusion as to which pages are needed for each class.
You are more than welcome to create a PR for this if you desire. I am focused on adding a few other features at the moment (including creating a website for a related project), so I don’t think I will be able to get to this immediately, but as mentioned, this is definitely something I see value in and would like to include. It just might take me a bit before I can call it complete. 😄
Hey @vesper8, thanks for the additional feedback! I think now is a good time to revisit this, and you make a great point on lowering the server load on their side. I will try and work on a way to incorporate this with one of the upcoming releases. I think in the utility module, I can create a way to route the pulling of the webpage, and get it from a local directory. I will work on this a bit and see if I can get something going!