Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

msgpack errors when using iter() with intervals between each batch call

See original GitHub issue

Good Day!

I’ve encountered this peculiar issue when trying to save up memory by processing the items in chunks. Here’s a strip down version of the code for reproduction of the issue:

import pandas as pd

from scrapinghub import ScrapinghubClient

def read_job_items_by_chunk(jobkey, chunk=10000):
    """In order to prevent OOM issues, the jobs' data must be read in
    chunks.

    This will return a generator of pandas DataFrames.
    """

    client = ScrapinghubClient("APIKEY123")

    item_generator = client.get_job(jobkey).items.iter()

    while item_generator:
        yield pd.DataFrame(
            [next(item_generator) for _ in range(chunk)]
        )

for df_chunk in read_job_items_by_chunk('123/123/123'):
    # having a small chunk-size like 10000 won't have any problems

for df_chunk in read_job_items_by_chunk('123/123/123', chunk=25000):
    # having a bug chunk-size like 25000 will throw out errors like the one below

Here’s the common error it throws:

<omitted stack trace above>

    [next(item_generator) for _ in range(chunk)]
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
    _path, requests_params, **apiparams
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
    for obj in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 67: invalid start byte

Moreover, it throws out a different error when using a much bigger chunk-size, like 50000:

<omitted stack trace above>

    [next(item_generator) for _ in range(chunk)]
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
    _path, requests_params, **apiparams
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
    for obj in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
TypeError: unhashable type: 'dict'

I find that the workaround/solution for this would be to have a lower value for chunk. So far, 1000 works great.

This uses scrapy:1.5 stack in Scrapy Cloud.

I’m guessing this might have something to do with the long waiting time that happens when processing the pandas DataFrame chunk, and when the next batch of items are being iterated, the server might have deallocated the pointer to it or something.

May I ask if there might be a solution for this? since a much bigger chunk size will help with the speed of our jobs.

I’ve marked it as bug for now as this is quite an unexpected/undocumented behavior.

Cheers!

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:10 (10 by maintainers)

Top GitHub Comments

2reactions

BurnzZcommented, Oct 4, 2019

Hi @vshlapakov, I’ve made a PR in #133 from your suggestion. I think having this convenient method would be really helpful in cases where we’re processing a large number of items.

@manycoding, I see that this might also be of use to arche from your issue in https://github.com/scrapinghub/arche/issues/140.

Thanks!

1reaction

BurnzZcommented, May 19, 2019

Thanks @vshlapakov! The project was using 2.0.3. I’ll try to use 2.1.1 to confirm if it indeed fixes up the wrong iteration behavior. I should have some results to verify in some weeks.

Otherwise, I’ll try to use the the pagination suggestion you’ve introduced. Cheers!

Top Results From Across the Web

Clients: Calling remote objects — Pyro 5.14 documentation

Pyro has to do a remote call to get every next item from the iterable. If your iterator produces lots of small individual...

Why aren't iterators in batch jobs lazy-loaded?

This leads me to conclude that the start method is querying for all results instead of lazy-loading them 100,200 at a time when...

The WebDataset Format - GitHub

WebDataset is just an instance of a standard IterableDataset . It's a single-threaded way of iterating over a dataset. Since image decompression and...

PyTorch next(iter(training_loader)) extremely slow, simple ...

When retrieving a batch with x, y = next(iter(training_loader)). you actually create a new instance of dataloader iterator at each call ...

kafka-python Documentation - Read the Docs

enable a check_version() method that probes a kafka broker and attempts to ... time options or other configuration forbids use of all the ......