question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

msgpack errors when using iter() with intervals between each batch call

See original GitHub issue

Good Day!

I’ve encountered this peculiar issue when trying to save up memory by processing the items in chunks. Here’s a strip down version of the code for reproduction of the issue:

import pandas as pd

from scrapinghub import ScrapinghubClient

def read_job_items_by_chunk(jobkey, chunk=10000):
    """In order to prevent OOM issues, the jobs' data must be read in
    chunks.

    This will return a generator of pandas DataFrames.
    """

    client = ScrapinghubClient("APIKEY123")

    item_generator = client.get_job(jobkey).items.iter()

    while item_generator:
        yield pd.DataFrame(
            [next(item_generator) for _ in range(chunk)]
        )

for df_chunk in read_job_items_by_chunk('123/123/123'):
    # having a small chunk-size like 10000 won't have any problems

for df_chunk in read_job_items_by_chunk('123/123/123', chunk=25000):
    # having a bug chunk-size like 25000 will throw out errors like the one below

Here’s the common error it throws:

<omitted stack trace above>

    [next(item_generator) for _ in range(chunk)]
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
    _path, requests_params, **apiparams
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
    for obj in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 67: invalid start byte

Moreover, it throws out a different error when using a much bigger chunk-size, like 50000:

<omitted stack trace above>

    [next(item_generator) for _ in range(chunk)]
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
    _path, requests_params, **apiparams
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
    for obj in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
TypeError: unhashable type: 'dict'

I find that the workaround/solution for this would be to have a lower value for chunk. So far, 1000 works great.

This uses scrapy:1.5 stack in Scrapy Cloud.

I’m guessing this might have something to do with the long waiting time that happens when processing the pandas DataFrame chunk, and when the next batch of items are being iterated, the server might have deallocated the pointer to it or something.

May I ask if there might be a solution for this? since a much bigger chunk size will help with the speed of our jobs.

I’ve marked it as bug for now as this is quite an unexpected/undocumented behavior.

Cheers!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
BurnzZcommented, Oct 4, 2019

Hi @vshlapakov, I’ve made a PR in #133 from your suggestion. I think having this convenient method would be really helpful in cases where we’re processing a large number of items.

@manycoding, I see that this might also be of use to arche from your issue in https://github.com/scrapinghub/arche/issues/140.

Thanks!

1reaction
BurnzZcommented, May 19, 2019

Thanks @vshlapakov! The project was using 2.0.3. I’ll try to use 2.1.1 to confirm if it indeed fixes up the wrong iteration behavior. I should have some results to verify in some weeks.

Otherwise, I’ll try to use the the pagination suggestion you’ve introduced. Cheers!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Clients: Calling remote objects — Pyro 5.14 documentation
Pyro has to do a remote call to get every next item from the iterable. If your iterator produces lots of small individual...
Read more >
Why aren't iterators in batch jobs lazy-loaded?
This leads me to conclude that the start method is querying for all results instead of lazy-loading them 100,200 at a time when...
Read more >
The WebDataset Format - GitHub
WebDataset is just an instance of a standard IterableDataset . It's a single-threaded way of iterating over a dataset. Since image decompression and...
Read more >
PyTorch next(iter(training_loader)) extremely slow, simple ...
When retrieving a batch with x, y = next(iter(training_loader)). you actually create a new instance of dataloader iterator at each call ...
Read more >
kafka-python Documentation - Read the Docs
enable a check_version() method that probes a kafka broker and attempts to ... time options or other configuration forbids use of all the ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found