ValueError: empty vocabulary
See original GitHub issueIssue Template
Description
Standard search produces web scrape error
Steps to Reproduce
Standard search with
- ‘Indeed’
- ‘Monster’
- ‘GlassDoor’
Expected behavior
Results of query
Actual behavior
No loglevel
Traceback (most recent call last):
File "C:\Users\phcre\AppData\Local\Programs\Python\Python38\Scripts\funnel-script.py", line 11, in <module>
load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\__main__.py", line 55, in main
jf.update_masterlist()
File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\jobfunnel.py", line 330, in update_masterlist
tfidf_filter(self.scrape_data, masterlist)
File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 118, in tfidf_filter
duplicate_ids = tfidf_filter(cur_dict)
File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 90, in tfidf_filter
similarities = cosine_similarity(vectorizer.fit_transform(query_words))
File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1840, in fit_transform
X = super().fit_transform(raw_documents)
File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1198, in fit_transform
vocabulary, X = self._count_vocab(raw_documents,
File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1129, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
query_words is empty therefore cannot be fit_transform by vectorizer
Debug Loglevel
GET http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/source {}
http://127.0.0.1:50081 "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/source HTTP/1.1" 200 381722
Finished Request
Found 8 glassdoor results for query=Advertising-Marketing-Coordinator-Account-Agency
GET http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {}
http://127.0.0.1:50081 "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 200 144
Finished Request
getting glassdoor page 1 : https://www.glassdoor.com/Job/allen-advertising-marketing-coordinator-account-agency-jobs-SRCH_IL.0,5_IC1139946_KE6,54.htm?radius=25
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/Job/allen-advertising-marketing-coordinator-account-agency-jobs-SRCH_IL.0,5_IC1139946_KE6,54.htm?radius=25"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 200 14
Finished Request
GET http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/source {}
http://127.0.0.1:50081 "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/source HTTP/1.1" 200 381666
Finished Request
DELETE http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/window {}
http://127.0.0.1:50081 "DELETE /session/02b7e485dd5ae5ae4fb5c16bf406267a/window HTTP/1.1" 200 14
Finished Request
found 8 unique job ids and 0 duplicates from glassdoor
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
Calculating delay...
Done! Starting scrape!
delay of 0.00s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=68087&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_c26c12d6&cb=1591932436271&jobListingId=3596513699
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=68087&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_c26c12d6&cb=1591932436271&jobListingId=3596513699"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 770
Finished Request
delay of 22.19s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_4b2ba71c&cb=1591932436271&jobListingId=3593859227
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_4b2ba71c&cb=1591932436271&jobListingId=3593859227"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 22.34s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=58033&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_9dd170bc&cb=1591932436271&jobListingId=3319079566
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=58033&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_9dd170bc&cb=1591932436271&jobListingId=3319079566"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 24.76s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=104&ao=926135&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_3f575b87&cb=1591932436271&jobListingId=3582441465
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=104&ao=926135&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_3f575b87&cb=1591932436271&jobListingId=3582441465"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 27.24s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=105&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_98a55e04&cb=1591932436271&jobListingId=3584976096
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=105&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_98a55e04&cb=1591932436271&jobListingId=3584976096"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 29.04s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=106&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_ca4062d5&cb=1591932436271&jobListingId=3579768726
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=106&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_ca4062d5&cb=1591932436271&jobListingId=3579768726"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 29.64s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=107&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_64a152f4&cb=1591932436271&jobListingId=3504589748
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=107&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_64a152f4&cb=1591932436271&jobListingId=3504589748"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 18.15s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=108&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_81ad2932&cb=1591932436272&jobListingId=3543437733
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=108&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_81ad2932&cb=1591932436272&jobListingId=3543437733"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
glassdoor scrape job took 173.619s
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
Traceback (most recent call last):
File "C:\Users\phcre\AppData\Local\Programs\Python\Python38\Scripts\funnel-script.py", line 11, in <module>
load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\__main__.py", line 55, in main
jf.update_masterlist()
File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\jobfunnel.py", line 330, in update_masterlist
tfidf_filter(self.scrape_data, masterlist)
File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 118, in tfidf_filter
duplicate_ids = tfidf_filter(cur_dict)
File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 90, in tfidf_filter
similarities = cosine_similarity(vectorizer.fit_transform(query_words))
File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1840, in fit_transform
X = super().fit_transform(raw_documents)
File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1198, in fit_transform
vocabulary, X = self._count_vocab(raw_documents,
File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1129, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
webdriver manager returning 404 errors?
Variable Contents
prev_dict
None
cur_dict.values()
odict_values([{'status': 'new', 'title': 'Account Manager Digital Marketing - Professional Services - Entertainment and Media Industry Opportunity', 'company': 'Gannett', 'location': 'Plano, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=68087&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_c26c12d6&cb=1591932436271&jobListingId=3596513699', 'id': '3596513699', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Marketing', 'company': 'The Point Group', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_4b2ba71c&cb=1591932436271&jobListingId=3593859227', 'id': '3593859227', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Marketing Coordinator', 'company': 'Gourmet Marketing LLC', 'location': 'Plano, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=58033&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_9dd170bc&cb=1591932436271&jobListingId=3319079566', 'id': '3319079566', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Client Service', 'company': 'RKD Group, Inc.', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=104&ao=926135&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_3f575b87&cb=1591932436271&jobListingId=3582441465', 'id': '3582441465', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'COLLEGE GRADS & INTERNS - Entry Level Marketing & Advertising', 'company': 'Millennium Events Management', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=105&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_98a55e04&cb=1591932436271&jobListingId=3584976096', 'id': '3584976096', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Senior Account Executive (Marketing/Advertising)', 'company': 'The Point Group', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=106&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_ca4062d5&cb=1591932436271&jobListingId=3579768726', 'id': '3579768726', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Client Service', 'company': 'RKD Group', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=107&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_64a152f4&cb=1591932436271&jobListingId=3504589748', 'id': '3504589748', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Digital Account Coordinator', 'company': 'RKD Group', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=108&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_81ad2932&cb=1591932436272&jobListingId=3543437733', 'id': '3543437733', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}])
query_ids
['3596513699', '3593859227', '3319079566', '3582441465', '3584976096', '3579768726', '3504589748', '3543437733']
query_words
['', '', '', '', '', '', '', '']
Environment
- Operating system and version: Windows 10
beautifulsoup4>=4.6.3 (4.9.1) lxml>=4.2.4 (4.5.1) requests>=2.19.1 (2.23.0) python-dateutil>=2.8.0 (2.8.1) PyYAML>=5.1 (5.3.1) scikit-learn>=0.21.2 (0.23.1) nltk>=3.4.1 (3.5) scipy>=1.4.1 (1.4.1) selenium>=3.141.0 (3.141.0) webdriver-manager>=2.4.0 (3.1.0) soupsieve>1.2 (2.0.1) certifi>=2017.4.17 (2020.4.5.2) urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (1.25.9) chardet<4,>=3.0.2 (3.0.4) idna<3,>=2.5 (2.9) six>=1.5 (1.15.0) threadpoolctl>=2.0.0 (2.1.0) joblib>=0.11 (0.15.1) numpy>=1.13.3 (1.18.5) click (7.1.2) tqdm(4.46.1) atomicwrites>=1.0; (1.4.0) packaging (20.4) pluggy<1.0,>=0.12 (0.13.1)
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (1 by maintainers)
Yea, no problem.
> funnel -s ..\mel\settings.yaml --log_level debug
settings.yaml
Awesome! I will try this as soon as possible.