question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ValueError: empty vocabulary

See original GitHub issue

Issue Template

Description

Standard search produces web scrape error

Steps to Reproduce

Standard search with

  • ‘Indeed’
  • ‘Monster’
  • ‘GlassDoor’

Expected behavior

Results of query

Actual behavior

No loglevel

Traceback (most recent call last):
  File "C:\Users\phcre\AppData\Local\Programs\Python\Python38\Scripts\funnel-script.py", line 11, in <module>
    load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\__main__.py", line 55, in main
    jf.update_masterlist()
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\jobfunnel.py", line 330, in update_masterlist
    tfidf_filter(self.scrape_data, masterlist)
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 118, in tfidf_filter
    duplicate_ids = tfidf_filter(cur_dict)
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 90, in tfidf_filter
    similarities = cosine_similarity(vectorizer.fit_transform(query_words))
  File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1840, in fit_transform
    X = super().fit_transform(raw_documents)
  File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1198, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents,
  File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1129, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

query_words is empty therefore cannot be fit_transform by vectorizer

Debug Loglevel

GET http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/source {}
http://127.0.0.1:50081 "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/source HTTP/1.1" 200 381722
Finished Request
Found 8 glassdoor results for query=Advertising-Marketing-Coordinator-Account-Agency
GET http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {}
http://127.0.0.1:50081 "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 200 144
Finished Request
getting glassdoor page 1 : https://www.glassdoor.com/Job/allen-advertising-marketing-coordinator-account-agency-jobs-SRCH_IL.0,5_IC1139946_KE6,54.htm?radius=25
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/Job/allen-advertising-marketing-coordinator-account-agency-jobs-SRCH_IL.0,5_IC1139946_KE6,54.htm?radius=25"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 200 14
Finished Request
GET http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/source {}
http://127.0.0.1:50081 "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/source HTTP/1.1" 200 381666
Finished Request
DELETE http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/window {}
http://127.0.0.1:50081 "DELETE /session/02b7e485dd5ae5ae4fb5c16bf406267a/window HTTP/1.1" 200 14
Finished Request
found 8 unique job ids and 0 duplicates from glassdoor
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
Calculating delay...
Done! Starting scrape!
delay of 0.00s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=68087&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_c26c12d6&cb=1591932436271&jobListingId=3596513699
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=68087&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_c26c12d6&cb=1591932436271&jobListingId=3596513699"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 770
Finished Request
delay of 22.19s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_4b2ba71c&cb=1591932436271&jobListingId=3593859227
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_4b2ba71c&cb=1591932436271&jobListingId=3593859227"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 22.34s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=58033&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_9dd170bc&cb=1591932436271&jobListingId=3319079566
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=58033&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_9dd170bc&cb=1591932436271&jobListingId=3319079566"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 24.76s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=104&ao=926135&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_3f575b87&cb=1591932436271&jobListingId=3582441465
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=104&ao=926135&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_3f575b87&cb=1591932436271&jobListingId=3582441465"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 27.24s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=105&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_98a55e04&cb=1591932436271&jobListingId=3584976096
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=105&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_98a55e04&cb=1591932436271&jobListingId=3584976096"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 29.04s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=106&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_ca4062d5&cb=1591932436271&jobListingId=3579768726
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=106&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_ca4062d5&cb=1591932436271&jobListingId=3579768726"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 29.64s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=107&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_64a152f4&cb=1591932436271&jobListingId=3504589748
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=107&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_64a152f4&cb=1591932436271&jobListingId=3504589748"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 18.15s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=108&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_81ad2932&cb=1591932436272&jobListingId=3543437733
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=108&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_81ad2932&cb=1591932436272&jobListingId=3543437733"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
glassdoor scrape job took 173.619s
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
Traceback (most recent call last):
  File "C:\Users\phcre\AppData\Local\Programs\Python\Python38\Scripts\funnel-script.py", line 11, in <module>
    load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\__main__.py", line 55, in main
    jf.update_masterlist()
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\jobfunnel.py", line 330, in update_masterlist
    tfidf_filter(self.scrape_data, masterlist)
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 118, in tfidf_filter
    duplicate_ids = tfidf_filter(cur_dict)
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 90, in tfidf_filter
    similarities = cosine_similarity(vectorizer.fit_transform(query_words))
  File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1840, in fit_transform
    X = super().fit_transform(raw_documents)
  File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1198, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents,
  File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1129, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

webdriver manager returning 404 errors?

Variable Contents

prev_dict

None

cur_dict.values()

odict_values([{'status': 'new', 'title': 'Account Manager Digital Marketing - Professional Services - Entertainment and Media Industry Opportunity', 'company': 'Gannett', 'location': 'Plano, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=68087&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_c26c12d6&cb=1591932436271&jobListingId=3596513699', 'id': '3596513699', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Marketing', 'company': 'The Point Group', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_4b2ba71c&cb=1591932436271&jobListingId=3593859227', 'id': '3593859227', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Marketing Coordinator', 'company': 'Gourmet Marketing LLC', 'location': 'Plano, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=58033&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_9dd170bc&cb=1591932436271&jobListingId=3319079566', 'id': '3319079566', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Client Service', 'company': 'RKD Group, Inc.', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=104&ao=926135&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_3f575b87&cb=1591932436271&jobListingId=3582441465', 'id': '3582441465', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'COLLEGE GRADS & INTERNS - Entry Level Marketing & Advertising', 'company': 'Millennium Events Management', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=105&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_98a55e04&cb=1591932436271&jobListingId=3584976096', 'id': '3584976096', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Senior Account Executive (Marketing/Advertising)', 'company': 'The Point Group', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=106&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_ca4062d5&cb=1591932436271&jobListingId=3579768726', 'id': '3579768726', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Client Service', 'company': 'RKD Group', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=107&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_64a152f4&cb=1591932436271&jobListingId=3504589748', 'id': '3504589748', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Digital Account Coordinator', 'company': 'RKD Group', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=108&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_81ad2932&cb=1591932436272&jobListingId=3543437733', 'id': '3543437733', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}]) 

query_ids

['3596513699', '3593859227', '3319079566', '3582441465', '3584976096', '3579768726', '3504589748', '3543437733']

query_words

['', '', '', '', '', '', '', '']

Environment

  • Operating system and version: Windows 10

beautifulsoup4>=4.6.3 (4.9.1) lxml>=4.2.4 (4.5.1) requests>=2.19.1 (2.23.0) python-dateutil>=2.8.0 (2.8.1) PyYAML>=5.1 (5.3.1) scikit-learn>=0.21.2 (0.23.1) nltk>=3.4.1 (3.5) scipy>=1.4.1 (1.4.1) selenium>=3.141.0 (3.141.0) webdriver-manager>=2.4.0 (3.1.0) soupsieve>1.2 (2.0.1) certifi>=2017.4.17 (2020.4.5.2) urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (1.25.9) chardet<4,>=3.0.2 (3.0.4) idna<3,>=2.5 (2.9) six>=1.5 (1.15.0) threadpoolctl>=2.0.0 (2.1.0) joblib>=0.11 (0.15.1) numpy>=1.13.3 (1.18.5) click (7.1.2) tqdm(4.46.1) atomicwrites>=1.0; (1.4.0) packaging (20.4) pluggy<1.0,>=0.12 (0.13.1)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
phcreerycommented, Jun 18, 2020

Yea, no problem.

> funnel -s ..\mel\settings.yaml --log_level debug

> python3 --version
Python 3.8.3```

> pip3 --version
pip 19.2.3 from c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\pip (python 3.8)

settings.yaml

# All paths are relative to this file.

# Paths.
# place the search right next to this file
output_path: './'

# Providers from which to search (case insensitive)
providers:
  - 'Indeed'
  - 'Monster'
  - 'GlassDoor' # This used to take ~10x longer to run than the other providers
   

# Filters.
search_terms:
  region:
    province: 'TX'
    city:     'Allen'
    domain:   'com'
    radius:   30

  keywords:
    - 'Advertising'
    - 'Marketing'
    - 'Coordinator'
    - 'Account'
    - 'Agency'

black_list:
  - 'Sales'
  - 'Media'
  - 'Digital'
  - 'Social'

# Logging level options are: critical, error, warning, info, debug, notset
log_level: 'info'

# Saves duplicates removed by tfidf filter to duplicate_list.csv
save_duplicates: False

# Turn on or off delaying
# set_delay: True 

# Delaying algorithm configuration
delay_config:
    # Functions used for delaying algorithm, options are: constant, linear, sigmoid
    function: 'linear'
    # Maximum delay/upper bound for converging random delay
    delay: 30
    # Minimum delay/lower bound for random delay  
    min_delay: 15 
    # Random delay
    random: True 
    # Converging random delay, only used if 'random' is set to True
    converge: True 
0reactions
phcreerycommented, Jun 27, 2020

Awesome! I will try this as soon as possible.

Read more comments on GitHub >

github_iconTop Results From Across the Web

empty vocabulary; perhaps the documents only contain stop ...
CountVectorizer throwing ValueError: empty vocabulary; perhaps the documents only contain stop words.
Read more >
empty vocabulary; perhaps the documents only contain stop ...
I'm trying to use Python's Tfidf to transform a corpus of text. However, when I try to fit_transform it, I get a value...
Read more >
empty vocabulary; perhaps the documents only contain stop ...
raise ValueError("empty vocabulary; perhaps the documents only" ValueError: empty vocabulary; perhaps the documents only contain stop words #4.
Read more >
empty vocabulary; perhaps the documents only contain sto ...
Pandas : Python TfidfVectorizer throwing : empty vocabulary ; perhaps the documents only contain stop words " [ Beautify Your Computer ...
Read more >
empty vocabulary; perhaps the documents only contain stop ...
Python – ValueError: empty vocabulary; perhaps the documents only contain stop words ... But I don't understand why that's happening. ... Does anybody...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found