Add a data calllback to put_object / read_part_data to allow for e.g. easy checksum generation without double reads
See original GitHub issueHello Minio team,
I’d like to suggest adding a data_callback
argument to the put_object
function. There are a handful of use cases, including simple checksum generation during upload via something like this:
shasum = hashlib.sha256()
client.put_object(bucket, name, data=fstream, data_callback=shasum.update)
print(shasum.hexdigest())
Implementation
This would be easily implemented by passing the data_callback
argument from minio.api.Minio.put_object()
/ fput_object()
to minio.helpers.read_part_data()
, and calling it with data_callback(data)
there.
Context
I’m working in a situation where I need to handle files without caching them to read twice (input http stream directly to a minio stream) but also need to work with the data as it is read (to generate a checksum). Since minio-py handles the .read()
method, there is no way to easily interface with the incoming data stream.
My proposed implementation would allow a user to work with the data in chunks as it is read. Possible use cases include:
- Callback to
hashlib
to generate checksum (as above) or multiple checksums - Callback to
magic
to determine mimetype - Scanning / processing parts of the data
- Function combining one or more of above
This would all be possible without needing to read the file/stream twice, so it doesn’t need to be kept on the client. This would be helpful when working with large files.
Alternative implementations
Some adjustments to the progress
argument could be made to get similar functionality. However, the proposed solution allows for simple tasks without threading / handing data between threads.
Workarounds
It is possible to implement a new class with a custom read()
method that would allow for similar behavior. However, this is not straightforward, and creating a checksum
I’m curious to hear feedback. If the maintainers think this is a good idea, I’d be happy to submit a PR.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
For anyone who comes across this in the future - my solution is about as follows:
Then just use
new_stream = StreamHashReader(incoming_stream)
and passnew_stream
to Minio (or wherever else).new_stream.checksum_dump()
can return the sha sum after the stream is finished.Understood, thanks for the quick response.