Thread-safe file import on S3
See original GitHub issueI often find that importing large files at the beginning of my workflow to be a large time bottleneck. For instance, if I want to map a pair of fastq files from the EBI FTP site, it can take several hours import each one using toil.importFile()
. Same issue for, say, the phased chromosome 1000 Genomes VCFs.
This time could be cut in half if I could import the two fastq files at once. This actually works fine on a local job store, but does not work on a S3 job store. If I recall, it’s because the boto session object is not thread safe. But I don’t think it’s a major change to use multiple sessions (or maybe there’s a thread-safe API somewhere?). It would be a big speedup for importing large files.
Coupling this with some kind of asynchronous / batch import interface would be even better.
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-380
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
@w-gao @adamnovak Reference is here: https://boto3.amazonaws.com/v1/documentation/api/1.14.31/guide/session.html#multithreading-or-multiprocessing-with-sessions
➤ Adam Novak commented:
Lon says he is fixing this.