msgpack errors when using iter() with intervals between each batch call
See original GitHub issueGood Day!
I’ve encountered this peculiar issue when trying to save up memory by processing the items in chunks. Here’s a strip down version of the code for reproduction of the issue:
import pandas as pd
from scrapinghub import ScrapinghubClient
def read_job_items_by_chunk(jobkey, chunk=10000):
"""In order to prevent OOM issues, the jobs' data must be read in
chunks.
This will return a generator of pandas DataFrames.
"""
client = ScrapinghubClient("APIKEY123")
item_generator = client.get_job(jobkey).items.iter()
while item_generator:
yield pd.DataFrame(
[next(item_generator) for _ in range(chunk)]
)
for df_chunk in read_job_items_by_chunk('123/123/123'):
# having a small chunk-size like 10000 won't have any problems
for df_chunk in read_job_items_by_chunk('123/123/123', chunk=25000):
# having a bug chunk-size like 25000 will throw out errors like the one below
Here’s the common error it throws:
<omitted stack trace above>
[next(item_generator) for _ in range(chunk)]
File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
_path, requests_params, **apiparams
File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
for obj in unpacker:
File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 67: invalid start byte
Moreover, it throws out a different error when using a much bigger chunk-size, like 50000:
<omitted stack trace above>
[next(item_generator) for _ in range(chunk)]
File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
_path, requests_params, **apiparams
File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
for obj in unpacker:
File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
TypeError: unhashable type: 'dict'
I find that the workaround/solution for this would be to have a lower value for chunk
. So far, 1000 works great.
This uses scrapy:1.5
stack in Scrapy Cloud.
I’m guessing this might have something to do with the long waiting time that happens when processing the pandas DataFrame chunk, and when the next batch of items are being iterated, the server might have deallocated the pointer to it or something.
May I ask if there might be a solution for this? since a much bigger chunk
size will help with the speed of our jobs.
I’ve marked it as bug for now as this is quite an unexpected/undocumented behavior.
Cheers!
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:10 (10 by maintainers)
Top GitHub Comments
Hi @vshlapakov, I’ve made a PR in #133 from your suggestion. I think having this convenient method would be really helpful in cases where we’re processing a large number of items.
@manycoding, I see that this might also be of use to arche from your issue in https://github.com/scrapinghub/arche/issues/140.
Thanks!
Thanks @vshlapakov! The project was using
2.0.3
. I’ll try to use2.1.1
to confirm if it indeed fixes up the wrong iteration behavior. I should have some results to verify in some weeks.Otherwise, I’ll try to use the the pagination suggestion you’ve introduced. Cheers!