Crash with large datasets
See original GitHub issueIt seems that large datasets overflow the size flag used by “Pool”. I ran cooler cload succesfully with small datasets but with several billion contacts:
cooler cload tabix binfiles/bin_1kb.bed dataset.scf.bgz dataset.cool
This is 1kb binning, with > 1billion contacts in dataset.scf I get the following error:
INFO:cooler:Using 8 cores
/usr/local/lib/python2.7/dist-packages/cooler/_reader.py:232: UserWarning: NOTE: When using the Tabix aggregator, make sure the order of chromosomes in the provided chromsizes agrees with the chromosome ordering of read ends in the contact list file.
"NOTE: When using the Tabix aggregator, make sure the order of "
INFO:cooler:Creating cooler at "/scratch/dataset.cool::/"
INFO:cooler:Writing chroms
INFO:cooler:Writing bins
INFO:cooler:Writing pixels
INFO:cooler:chrM
INFO:cooler:chrY
INFO:cooler:chr21
INFO:cooler:chr22
INFO:cooler:chrX
INFO:cooler:chr19
INFO:cooler:chr20
INFO:cooler:chr18
INFO:cooler:chr9
INFO:cooler:chr17
INFO:cooler:chr16
INFO:cooler:chr15
INFO:cooler:chr13
INFO:cooler:chr14
INFO:cooler:chr8
INFO:cooler:chr7
INFO:cooler:chr6
INFO:cooler:chr5
INFO:cooler:chr12
INFO:cooler:chr4
INFO:cooler:chr11
Traceback (most recent call last):
File "/usr/local/bin/cooler", line 11, in <module>
sys.exit(cli())
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/cooler/cli/cload.py", line 186, in tabix
create(cool_path, chromsizes, bins, iterator, metadata, assembly)
File "/usr/local/lib/python2.7/dist-packages/cooler/io.py", line 164, in create
filepath, target, n_bins, iterator, h5opts, lock)
File "/usr/local/lib/python2.7/dist-packages/cooler/_writer.py", line 174, in write_pixels
for chunk in iterator:
File "/usr/local/lib/python2.7/dist-packages/cooler/_reader.py", line 313, in __iter__
for df in self._map(self.aggregate, chroms):
File "/usr/local/lib/python2.7/dist-packages/multiprocess/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/local/lib/python2.7/dist-packages/multiprocess/pool.py", line 567, in get
raise self._value
multiprocess.pool.MaybeEncodingError: Error sending result: '[ bin1_id bin2_id count
0 384852 1127363 1
0 384858 977744 1
2 384858 1890814 1
1 384858 1977029 1
2 384862 1685662 1
1 384862 2310165 2
3 384862 2603767 1
0 384862 2892389 1
0 384876 2196724 1
0 384887 1889333 1
1 384889 1079339 1
0 384889 2820833 1
0 384890 2308658 1
0 384894 1879589 1
0 384896 2261960 1
20 384899 417637 1
27 384899 528891 1
24 384899 943862 1
3 384899 1228810 1
5 384899 1241744 1
28 384899 1297021 1
26 384899 1485562 1
1 384899 1530749 1
22 384899 1530863 2
2 384899 1531657 1
4 384899 1539596 1
21 384899 1551590 1
6 384899 1662865 1
18 384899 1664090 1
10 384899 1668517 1
.. ... ... ...
43 519732 2160311 1
57 519732 2175747 1
74 519732 2186566 1
107 519732 2199037 1
26 519732 2217834 1
64 519732 2220820 1
82 519732 2254178 1
83 519732 2261352 1
44 519732 2272952 1
2 519732 2302728 1
23 519732 2356330 1
0 519732 2373634 1
106 519732 2396659 1
69 519732 2418477 1
6 519732 2435624 1
59 519732 2435848 1
50 519732 2518239 1
98 519732 2550221 1
38 519732 2558119 1
21 519732 2559581 1
66 519732 2596125 1
70 519732 2616625 1
24 519732 2626677 1
37 519732 2724006 1
101 519732 2737631 1
41 519732 2772148 1
13 519732 2838601 1
11 519732 2862142 1
71 519732 2868541 2
72 519732 2922302 1
[68481974 rows x 3 columns]]'. Reason: 'IOError('bad message length',)'
Probably an overflow on the size flag of Pool… ?
Issue Analytics
- State:
- Created 6 years ago
- Comments:11 (7 by maintainers)
Top Results From Across the Web
Issues with loading large datasets in pandas - Kaggle
I'm looking for use-cases where someone has faced issues trying to load a large dataset in pandas , leading the kernel to crash...
Read more >How to deal with large data sets that crash Excel and Access
Large datasets should not or cannot be stored in a spreadsheet. We all know that Excel can't handle data more than 1 million...
Read more >memory crash with large dataset · Issue #12912 - GitHub
CPU is at 20% and RAM fills entirely up to 160GB. I cannot share the data (proprietary) but maybe there is something obvious...
Read more >Discussions: Designer: Computer crashing - dataset too large?
Solved: Hello! I am connecting to a database, doing a little manipulation, and then streaming out a dataset to run a multi-row formula...
Read more >Sudden crashes in R Notebook when working with large data
This error occurs when I am working with relatively large datasets (~100MB); Things work fine again after restarting the R session, and ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, it seems to be working fine now. At least the cload process exits without errors and the output files seem to be consistent – I’ve tried to read a few small chunks of data from the Python API.
Indeed, especially to balance the thread loading. If you expect your multithreading jobs to be very unbalanced, like in a size(chr1)/size(chr21) ~ 5 ratio, as a rule of thumb I’d make at least 5 time more jobs than concurrent threads with a fair scheduler.