Spider Unhandled Error : f.seek(-size-self.SIZE_SIZE, os.SEEK_END) exceptions.IOError
See original GitHub issueThe spider was working correctly until this :
2014-08-08 11:55:01+0200 [seloger] DEBUG: Scraped from <200 http://www.seloger.com/annonces/achat/maison/nanteau-sur-essonne-77/91520225.htm?bd=Detail_Nav&div=2238&idtt=2&idtypebien=all&tri=d_dt_crea>
{'area': u'Nanteau sur Essonne',
'id': '91520225',
'nbroom': u'2',
'nbsleepingroom': u'1',
'phone': u'02 38 34 87 48',
'price': u'88000',
'source': 'seloger',
'surface': u'18',
'text': u"5'de malesherbes, espace et libert\xe9, c'est la sensation que vous \xe9prouverez en visitant ce terrain de loisirs de 2236 m\xb2 avec chalet et garage ferm\xe9, en bordure de rivi\xe8re dans un cadre exceptionnel ! \xc0 saisir ! Classe \xe9nergie: vierge.",
'title': u'Maison Nanteau Sur Essonne 2 pi\xe8ce (s) 18 m\xb2',
'type': 'maison',
'url': u'http://www.seloger.com/annonces/achat/maison/nanteau-sur-essonne-77/91520225.htm?p=CCCPqX4AAKvgYyNg'}
2014-08-08 11:55:01+0200 [-] Unhandled Error
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 93, in start
self.start_reactor()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 130, in start_reactor
reactor.run(installSignalHandlers=False) # blocking call
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1169, in run
self.mainLoop()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Library/Python/2.7/site-packages/scrapy/utils/reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 107, in _next_request
if not self._next_request_from_scheduler(spider):
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 134, in _next_request_from_scheduler
request = slot.scheduler.next_request()
File "/Library/Python/2.7/site-packages/scrapy/core/scheduler.py", line 64, in next_request
request = self._dqpop()
File "/Library/Python/2.7/site-packages/scrapy/core/scheduler.py", line 94, in _dqpop
d = self.dqs.pop()
File "/Library/Python/2.7/site-packages/queuelib/pqueue.py", line 43, in pop
m = q.pop()
File "/Library/Python/2.7/site-packages/scrapy/squeue.py", line 18, in pop
s = super(SerializableQueue, self).pop()
File "/Library/Python/2.7/site-packages/queuelib/queue.py", line 157, in pop
self.f.seek(-size-self.SIZE_SIZE, os.SEEK_END)
exceptions.IOError: [Errno 22] Invalid argument
2014-08-08 11:55:01+0200 [-] Unhandled Error
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 93, in start
self.start_reactor()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 130, in start_reactor
reactor.run(installSignalHandlers=False) # blocking call
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1169, in run
self.mainLoop()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Library/Python/2.7/site-packages/scrapy/utils/reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 107, in _next_request
if not self._next_request_from_scheduler(spider):
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 134, in _next_request_from_scheduler
request = slot.scheduler.next_request()
File "/Library/Python/2.7/site-packages/scrapy/core/scheduler.py", line 64, in next_request
request = self._dqpop()
File "/Library/Python/2.7/site-packages/scrapy/core/scheduler.py", line 94, in _dqpop
d = self.dqs.pop()
File "/Library/Python/2.7/site-packages/queuelib/pqueue.py", line 43, in pop
m = q.pop()
File "/Library/Python/2.7/site-packages/scrapy/squeue.py", line 18, in pop
s = super(SerializableQueue, self).pop()
File "/Library/Python/2.7/site-packages/queuelib/queue.py", line 157, in pop
self.f.seek(-size-self.SIZE_SIZE, os.SEEK_END)
exceptions.IOError: [Errno 22] Invalid argument
2014-08-08 11:55:03+0200 [seloger] INFO: Crawled 174 pages (at 23 pages/min), scraped 165 items (at 17 items/min)
2014-08-08 11:56:03+0200 [seloger] INFO: Crawled 174 pages (at 0 pages/min), scraped 165 items (at 0 items/min)
2014-08-08 11:57:03+0200 [seloger] INFO: Crawled 174 pages (at 0 pages/min), scraped 165 items (at 0 items/min)
and nothing more… it stopped here.
does someone has ever seen something similar ?
thanks
Issue Analytics
- State:
- Created 9 years ago
- Reactions:1
- Comments:12 (5 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I have the same problem with scrapy 1.4.0
This problem occurs when I use schedule queue with
-s JOBDIR=.scrapy/crawljob
The directory
requests.queue
is in.scrapy/crawljob
, and the files in it are:When I open the file
p1
, there are only several unreadable characters.I guess the issue happens because the content in p1 is corrupt.
So I remove p1, rename p2 to p1 and rename p3 to p2, then change the content in active.json to
[1, 2]
.After that, I run the command with
-s JOBDIR=.scrapy/crawljob
, it works.Hope it helps.
I have the same issue. Was running just one spider, so it is not a concurrency issue. What happened in my case however is that I ran out of disk space which made the spider crash. It happened once, and restart worked. Then it happened again, and now I am getting this. So it’s not always happening, but it should happen eventually when running out of disk space. Unfortunately I can’t scroll back far enough to see the out of disk space error. I guess that would have helped even more with the investigation. 😦