Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to import large files from S3 buckets

See original GitHub issue

I’m trying to use 1000 Genomes data which is already hosted in an AWS bucket. This is a raw reads file of about ~13 GiB. When calling toil.importFile on s3://1000genomes/phase3/data/HG01977/sequence_read/SRR360135_1.filt.fastq.gz I get the following error:

2016-04-26 14:13:37,623 ERROR:root: Got exception 'S3ResponseError: 400 Bad Request
<Error><Code>InvalidRequest</Code><Message>The specified copy source is larger than the maximum allowable size for a copy source: 5368709120</Message><RequestId>CCB50A66B881B563</RequestId><HostId>1dAVKee/hXEaLCvisKyxUHlWtLbZwqcedEXOsKVSzAoE5ISrsuykzg9teW9vzidQaIqImJxbNgI=</HostId></Error>' while writing 's3://1000genomes/phase3/data/HG01977/sequence_read/SRR360135_1.filt.fastq.gz'
Traceback (most recent call last):
  File "/usr/local/bin/cwltoil", line 9, in <module>
    load_entry_point('toil==3.2.0a2', 'console_scripts', 'cwltoil')()
  File "/usr/local/lib/python2.7/dist-packages/toil/cwl/cwltoil.py", line 580, in main
    adjustFiles(builder.job, functools.partial(writeFile, toil.importFile, {}))
  File "/usr/local/lib/python2.7/dist-packages/cwltool/process.py", line 128, in adjustFiles
    adjustFiles(rec[d], op)
  File "/usr/local/lib/python2.7/dist-packages/cwltool/process.py", line 131, in adjustFiles
    adjustFiles(d, op)
  File "/usr/local/lib/python2.7/dist-packages/cwltool/process.py", line 131, in adjustFiles
    adjustFiles(d, op)
  File "/usr/local/lib/python2.7/dist-packages/cwltool/process.py", line 126, in adjustFiles
    rec["path"] = op(rec["path"])
  File "/usr/local/lib/python2.7/dist-packages/toil/cwl/cwltoil.py", line 162, in writeFile
    index[x] = (writeFunc(rp), os.path.basename(x))
  File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 570, in importFile
    return self.jobStore.importFile(srcUrl)
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/abstractJobStore.py", line 260, in importFile
    return self._importFile(findJobStoreForUrl(url), url)
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/jobStore.py", line 279, in _importFile
    info.copyFrom(srcKey)
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/jobStore.py", line 826, in copyFrom
    headers=self._s3EncryptionHeaders()).version_id
  File "/usr/local/lib/python2.7/dist-packages/toil/jobStores/aws/jobStore.py", line 865, in _copyKey
    headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/bucket.py", line 888, in copy_key
    response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request
<Error><Code>InvalidRequest</Code><Message>The specified copy source is larger than the maximum allowable size for a copy source: 5368709120</Message><RequestId>CCB50A66B881B563</RequestId><HostId>1dAVKee/hXEaLCvisKyxUHlWtLbZwqcedEXOsKVSzAoE5ISrsuykzg9teW9vzidQaIqImJxbNgI=</HostId></Error>

Is it really necessary to copy the file into the jobStore bucket?