misuse of abstract base classes + monolithic JobFunnel class + schema validation + localisation
See original GitHub issueDescription
Currently we are using the JobFunnel
class for to much, I want to break it down into the following:
Job(object):
def __init__(self, title: str, company: str, location: str, tags: List[str], post_date: datetime.date, key_id: str, url: str) -> None:
...
Scraper(ABC):
@abstractmethod
def scrape(self) -> List[Job]:
pass
main():
# instantiate scrapers
# run filter on list of Job
# dump pickle
# writeout CSV
Note: if I get to it, I’d also like our filters to be an ABC.
Steps to Reproduce
This is a structural technical debt issue. (n/a)
Expected behavior
Abstract base class should not be halfway abstract, Need seperation between JobFunnel and main() and inherited scrapers.
Actual behavior
JobFunnel being monolithic and half-abstract has allowed us to implement three script-like scrapers which share too many methods, without an actual Job object.
Environment
n/a
Current Status:
-
Job Object
-
Support for Internationalization
-
BaseScraper with get/set scraping logic
-
New YAML and CLI implemented
-
Schema Validation with Cerberus
-
Caching
-
Filtering with lists
-
Indeed
-
Monster
-
GlassDoorStatic (works but seems like it has bugs so fixing this).
-
Wage Scraping
-
GlassDoor Dynamic/Driven
-
Duplicates list file support
-
Integrate TFIDF similarity filter (special case filter)
-
Prevent writing out empty CSVs in --no-scrape mode
-
Prevent delayed get/set for jobs which fail filters
-
Fix multi-page Monster scraping
-
handle duplicated jobs special case
-
Make JobFilter class
-
Add TAG scraping to Monster
-
Implement job filtering as own class
-
Fix paths from -s yaml being overwritten with defaults with CLI
-
Fix concurrency issue with dependencies for get/set
-
Monkey / general usability testing
-
Update main README
-
Update other READMEs + tutorials
-
Add versioning to cache files (i.e wrapper for dict with metadata)
-
Review various FIXMES in-code
-
Fix build (Travis CI)
-
Test setup.py
-
Fix demo GIF
-
Document how to write new scrapers with localization
Future work:
- Google jobs scraper
- Ycombinator job scraper
- Assess the update experience from V2.0 --> V3.0, provide a guide
- cut a release
- Add WAGE scraping to Indeed
- Add REMOTE scraping to Indeed
- Add REMOTE scraping to Monster
Issue Analytics
- State:
- Created 3 years ago
- Comments:31 (22 by maintainers)
@thebigG FYI, I’ve just fixed a bug with the CLI parser where YAML paths were not being respected.
I have already started testing. Like I said before starting to test
cli.py
and will take it from there.Love the new JobFunnel architecture by the way, great job 🚀