question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add a data calllback to put_object / read_part_data to allow for e.g. easy checksum generation without double reads

See original GitHub issue

Hello Minio team,

I’d like to suggest adding a data_callback argument to the put_object function. There are a handful of use cases, including simple checksum generation during upload via something like this:

shasum = hashlib.sha256()

client.put_object(bucket, name, data=fstream, data_callback=shasum.update)

print(shasum.hexdigest())

Implementation

This would be easily implemented by passing the data_callback argument from minio.api.Minio.put_object() / fput_object() to minio.helpers.read_part_data(), and calling it with data_callback(data) there.

Context

I’m working in a situation where I need to handle files without caching them to read twice (input http stream directly to a minio stream) but also need to work with the data as it is read (to generate a checksum). Since minio-py handles the .read() method, there is no way to easily interface with the incoming data stream.

My proposed implementation would allow a user to work with the data in chunks as it is read. Possible use cases include:

  • Callback to hashlib to generate checksum (as above) or multiple checksums
  • Callback to magic to determine mimetype
  • Scanning / processing parts of the data
  • Function combining one or more of above

This would all be possible without needing to read the file/stream twice, so it doesn’t need to be kept on the client. This would be helpful when working with large files.

Alternative implementations

Some adjustments to the progress argument could be made to get similar functionality. However, the proposed solution allows for simple tasks without threading / handing data between threads.

Workarounds

It is possible to implement a new class with a custom read() method that would allow for similar behavior. However, this is not straightforward, and creating a checksum

I’m curious to hear feedback. If the maintainers think this is a good idea, I’d be happy to submit a PR.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
tgross35commented, Feb 26, 2022

For anyone who comes across this in the future - my solution is about as follows:

class StreamBase:
    """Stream overwriting class for custom writers and readers.

    Based on python'd codecs module."""

    allowed_methods = ("tell",)

    def __init__(self, stream: BinaryIO) -> None:
        self.stream = stream
        self.checksum = hashlib.sha256()

    def checksum_dump(self) -> str:
        "Create a string in format e.g. sha256:ffff..."
        name = self.checksum.name
        digest = self.checksum.hexdigest()
        return f"{name}:{digest}"

    def __enter__(self):
        """Start context manager."""
        return self

    def __exit__(self, type, value, tb):
        """Clean up after context."""
        self.stream.close()

    def __getattr__(self, name, getattr=getattr) -> Any:
        """Inherit all other allowed methods from the underlying stream."""
        if name in self.allowed_methods:
            return getattr(self.stream, name)
        return NotImplemented


class StreamHashReader(StreamBase):
    """Wrapper to generate a hash during read method."""

    def read(self, *args, **kwargs) -> bytes:
        """Update the checksum and read the data from the stream."""
        data = self.stream.read(*args, **kwargs)
        self.checksum.update(data)
        return data

Then just use new_stream = StreamHashReader(incoming_stream) and pass new_stream to Minio (or wherever else). new_stream.checksum_dump() can return the sha sum after the stream is finished.

0reactions
tgross35commented, Feb 10, 2022

Understood, thanks for the quick response.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Creating a Search and Callback System, Part 3 ...
In AWSManager, where we find the file, we'll use the GetObjectAsync method to stream our object's data into our object. Taking the code...
Read more >
Checksum - YouTube
Check out my new digital study guides here:https://www.maximumeducation.com.auAn outline of the Checksum error detection ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found