Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spider Unhandled Error : f.seek(-size-self.SIZE_SIZE, os.SEEK_END) exceptions.IOError

See original GitHub issue

The spider was working correctly until this :

2014-08-08 11:55:01+0200 [seloger] DEBUG: Scraped from <200 http://www.seloger.com/annonces/achat/maison/nanteau-sur-essonne-77/91520225.htm?bd=Detail_Nav&div=2238&idtt=2&idtypebien=all&tri=d_dt_crea>
    {'area': u'Nanteau sur Essonne',
     'id': '91520225',
     'nbroom': u'2',
     'nbsleepingroom': u'1',
     'phone': u'02 38 34 87 48',
     'price': u'88000',
     'source': 'seloger',
     'surface': u'18',
     'text': u"5'de malesherbes, espace et libert\xe9, c'est la sensation que vous \xe9prouverez en visitant ce terrain de loisirs de 2236 m\xb2 avec chalet et garage ferm\xe9, en bordure de rivi\xe8re dans un cadre exceptionnel ! \xc0 saisir ! Classe \xe9nergie: vierge.",
     'title': u'Maison Nanteau Sur Essonne 2 pi\xe8ce (s) 18 m\xb2',
     'type': 'maison',
     'url': u'http://www.seloger.com/annonces/achat/maison/nanteau-sur-essonne-77/91520225.htm?p=CCCPqX4AAKvgYyNg'}
2014-08-08 11:55:01+0200 [-] Unhandled Error
    Traceback (most recent call last):
      File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 93, in start
        self.start_reactor()
      File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 130, in start_reactor
        reactor.run(installSignalHandlers=False)  # blocking call
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/Library/Python/2.7/site-packages/scrapy/utils/reactor.py", line 41, in __call__
        return self._func(*self._a, **self._kw)
      File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 107, in _next_request
        if not self._next_request_from_scheduler(spider):
      File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 134, in _next_request_from_scheduler
        request = slot.scheduler.next_request()
      File "/Library/Python/2.7/site-packages/scrapy/core/scheduler.py", line 64, in next_request
        request = self._dqpop()
      File "/Library/Python/2.7/site-packages/scrapy/core/scheduler.py", line 94, in _dqpop
        d = self.dqs.pop()
      File "/Library/Python/2.7/site-packages/queuelib/pqueue.py", line 43, in pop
        m = q.pop()
      File "/Library/Python/2.7/site-packages/scrapy/squeue.py", line 18, in pop
        s = super(SerializableQueue, self).pop()
      File "/Library/Python/2.7/site-packages/queuelib/queue.py", line 157, in pop
        self.f.seek(-size-self.SIZE_SIZE, os.SEEK_END)
    exceptions.IOError: [Errno 22] Invalid argument

2014-08-08 11:55:01+0200 [-] Unhandled Error
    Traceback (most recent call last):
      File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 93, in start
        self.start_reactor()
      File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 130, in start_reactor
        reactor.run(installSignalHandlers=False)  # blocking call
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1169, in run
        self.mainLoop()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/Library/Python/2.7/site-packages/scrapy/utils/reactor.py", line 41, in __call__
        return self._func(*self._a, **self._kw)
      File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 107, in _next_request
        if not self._next_request_from_scheduler(spider):
      File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 134, in _next_request_from_scheduler
        request = slot.scheduler.next_request()
      File "/Library/Python/2.7/site-packages/scrapy/core/scheduler.py", line 64, in next_request
        request = self._dqpop()
      File "/Library/Python/2.7/site-packages/scrapy/core/scheduler.py", line 94, in _dqpop
        d = self.dqs.pop()
      File "/Library/Python/2.7/site-packages/queuelib/pqueue.py", line 43, in pop
        m = q.pop()
      File "/Library/Python/2.7/site-packages/scrapy/squeue.py", line 18, in pop
        s = super(SerializableQueue, self).pop()
      File "/Library/Python/2.7/site-packages/queuelib/queue.py", line 157, in pop
        self.f.seek(-size-self.SIZE_SIZE, os.SEEK_END)
    exceptions.IOError: [Errno 22] Invalid argument

2014-08-08 11:55:03+0200 [seloger] INFO: Crawled 174 pages (at 23 pages/min), scraped 165 items (at 17 items/min)
2014-08-08 11:56:03+0200 [seloger] INFO: Crawled 174 pages (at 0 pages/min), scraped 165 items (at 0 items/min)
2014-08-08 11:57:03+0200 [seloger] INFO: Crawled 174 pages (at 0 pages/min), scraped 165 items (at 0 items/min)

and nothing more… it stopped here.

does someone has ever seen something similar ?

thanks

Issue Analytics

State:
Created 9 years ago
Reactions:1
Comments:12 (5 by maintainers)

Top GitHub Comments

7reactions

lubobill1990commented, Feb 14, 2018

I have the same problem with scrapy 1.4.0

2018-02-14 11:50:03 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/scrapy/commands/crawl.py", line 58, in run
    self.crawler_process.start()
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/scrapy/crawler.py", line 291, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/twisted/internet/base.py", line 1243, in run
    self.mainLoop()
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/twisted/internet/base.py", line 1252, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/twisted/internet/base.py", line 878, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/scrapy/utils/reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/scrapy/core/engine.py", line 122, in _next_request
    if not self._next_request_from_scheduler(spider):
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/scrapy/core/engine.py", line 149, in _next_request_from_scheduler
    request = slot.scheduler.next_request()
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 71, in next_request
    request = self._dqpop()
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 106, in _dqpop
    d = self.dqs.pop()
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/queuelib/pqueue.py", line 43, in pop
    m = q.pop()
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/scrapy/squeues.py", line 19, in pop
    s = super(SerializableQueue, self).pop()
  File "/home/lubo/.conda/envs/scrapy/lib/python3.6/site-packages/queuelib/queue.py", line 162, in pop
    self.f.seek(-size-self.SIZE_SIZE, os.SEEK_END)
builtins.OSError: [Errno 22] Invalid argument

This problem occurs when I use schedule queue with -s JOBDIR=.scrapy/crawljob

The directory requests.queue is in .scrapy/crawljob, and the files in it are:

drwxrwxr-x 2 xx yy      4096 Feb 11 23:24 ./
drwxrwxr-x 3 xx yy      4096 Feb  2 16:42 ../
-rw-rw-r-- 1 xx yy         9 Feb 14 11:50 active.json
-rw-rw-r-- 1 xx yy         4 Feb 14 11:50 p1
-rw-rw-r-- 1 xx yy 304687732 Feb 14 11:50 p2
-rw-rw-r-- 1 xx yy   9885041 Feb 14 11:50 p3

When I open the file p1, there are only several unreadable characters.

I guess the issue happens because the content in p1 is corrupt.

So I remove p1, rename p2 to p1 and rename p3 to p2, then change the content in active.json to [1, 2].

After that, I run the command with -s JOBDIR=.scrapy/crawljob, it works.

Hope it helps.

1reaction

ibzcommented, Jun 27, 2018

I have the same issue. Was running just one spider, so it is not a concurrency issue. What happened in my case however is that I ran out of disk space which made the spider crash. It happened once, and restart worked. Then it happened again, and now I am getting this. So it’s not always happening, but it should happen eventually when running out of disk space. Unfortunately I can’t scroll back far enough to see the out of disk space error. I guess that would have helped even more with the investigation. 😦