question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running scrapy shell against a local file

See original GitHub issue

Before Scrapy 1.0, I could execute:

 scrapy shell index.html

In >=1.0, it started to throw ValueError: Missing scheme in request url: index.html:

$ scrapy shell index.html
2015-10-12 15:32:59 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-10-12 15:32:59 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-10-12 15:32:59 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
Traceback (most recent call last):
  File "/Users/user/.virtualenvs/so/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/commands/shell.py", line 50, in run
    spidercls = spidercls_for_request(spider_loader, Request(url),
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: index.html 

As a workaround, I’ve used the “file” protocol providing the full path to a file:

$ scrapy shell file:////absolute/path/to/index.html

From a comment to the relevant SO topic http://stackoverflow.com/questions/33088877/scrapy-shell-against-a-local-file, we can see that the relevant change was introduced here.

Would it be possible and would it make sense to bring the previous behavior back so that we can execute the shell against a local file as easy as scrapy shell filename?

Thanks!

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:17 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
tarunteckedgecommented, Sep 10, 2018

Good to see it works in scrapy shell “./path/to/file/hello.html”

But same url doesn’t work in spider. Anyone can help on that or can confirm this is not supposed to work there?

0reactions
alecxecommented, Jan 29, 2016

Thanks everyone for the time! Glad to see it being a part of 1.1.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy shell against a local file - python - Stack Overflow
As per discussion in Running scrapy shell against a local file, the relevant change was introduced by this commit. There was a Pull...
Read more >
Scrapy shell — Scrapy 2.7.1 documentation
The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to run the...
Read more >
Scrapy shell against a local file - DevPress - CSDN
Answer a question Before Scrapy 1.0, I could've run the Scrapy Shell against a local file quite simply: $ scrapy shell index.html After ......
Read more >
Web scraping using Python and Scrapy - UCSB Carpentry
How can I setup a scraping project using the Scrapy framework for Python? ... Request and update local objects [s] shelp() Shell help...
Read more >
Scrapy framework tips and tricks - Trickster Dev
Running scrapy shell gives you an interactive environment for experimenting with the site being scraped. For example, running fetch() with ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found