question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Crash with large datasets

See original GitHub issue

It seems that large datasets overflow the size flag used by “Pool”. I ran cooler cload succesfully with small datasets but with several billion contacts:

cooler cload tabix binfiles/bin_1kb.bed dataset.scf.bgz dataset.cool

This is 1kb binning, with > 1billion contacts in dataset.scf I get the following error:

INFO:cooler:Using 8 cores
/usr/local/lib/python2.7/dist-packages/cooler/_reader.py:232: UserWarning: NOTE: When using the Tabix aggregator, make sure the order of chromosomes in the provided chromsizes agrees with the chromosome ordering of read ends in the contact list file.
  "NOTE: When using the Tabix aggregator, make sure the order of "
INFO:cooler:Creating cooler at "/scratch/dataset.cool::/"
INFO:cooler:Writing chroms
INFO:cooler:Writing bins
INFO:cooler:Writing pixels
INFO:cooler:chrM
INFO:cooler:chrY
INFO:cooler:chr21
INFO:cooler:chr22
INFO:cooler:chrX
INFO:cooler:chr19
INFO:cooler:chr20
INFO:cooler:chr18
INFO:cooler:chr9
INFO:cooler:chr17
INFO:cooler:chr16
INFO:cooler:chr15
INFO:cooler:chr13
INFO:cooler:chr14
INFO:cooler:chr8
INFO:cooler:chr7
INFO:cooler:chr6
INFO:cooler:chr5
INFO:cooler:chr12
INFO:cooler:chr4
INFO:cooler:chr11
Traceback (most recent call last):
  File "/usr/local/bin/cooler", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/cooler/cli/cload.py", line 186, in tabix
    create(cool_path, chromsizes, bins, iterator, metadata, assembly)
  File "/usr/local/lib/python2.7/dist-packages/cooler/io.py", line 164, in create
    filepath, target, n_bins, iterator, h5opts, lock)
  File "/usr/local/lib/python2.7/dist-packages/cooler/_writer.py", line 174, in write_pixels
    for chunk in iterator:
  File "/usr/local/lib/python2.7/dist-packages/cooler/_reader.py", line 313, in __iter__
    for df in self._map(self.aggregate, chroms):
  File "/usr/local/lib/python2.7/dist-packages/multiprocess/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/local/lib/python2.7/dist-packages/multiprocess/pool.py", line 567, in get
    raise self._value
multiprocess.pool.MaybeEncodingError: Error sending result: '[     bin1_id  bin2_id  count
0     384852  1127363      1
0     384858   977744      1
2     384858  1890814      1
1     384858  1977029      1
2     384862  1685662      1
1     384862  2310165      2
3     384862  2603767      1
0     384862  2892389      1
0     384876  2196724      1
0     384887  1889333      1
1     384889  1079339      1
0     384889  2820833      1
0     384890  2308658      1
0     384894  1879589      1
0     384896  2261960      1
20    384899   417637      1
27    384899   528891      1
24    384899   943862      1
3     384899  1228810      1
5     384899  1241744      1
28    384899  1297021      1
26    384899  1485562      1
1     384899  1530749      1
22    384899  1530863      2
2     384899  1531657      1
4     384899  1539596      1
21    384899  1551590      1
6     384899  1662865      1
18    384899  1664090      1
10    384899  1668517      1
..       ...      ...    ...
43    519732  2160311      1
57    519732  2175747      1
74    519732  2186566      1
107   519732  2199037      1
26    519732  2217834      1
64    519732  2220820      1
82    519732  2254178      1
83    519732  2261352      1
44    519732  2272952      1
2     519732  2302728      1
23    519732  2356330      1
0     519732  2373634      1
106   519732  2396659      1
69    519732  2418477      1
6     519732  2435624      1
59    519732  2435848      1
50    519732  2518239      1
98    519732  2550221      1
38    519732  2558119      1
21    519732  2559581      1
66    519732  2596125      1
70    519732  2616625      1
24    519732  2626677      1
37    519732  2724006      1
101   519732  2737631      1
41    519732  2772148      1
13    519732  2838601      1
11    519732  2862142      1
71    519732  2868541      2
72    519732  2922302      1

[68481974 rows x 3 columns]]'. Reason: 'IOError('bad message length',)'

Probably an overflow on the size flag of Pool… ?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
ezoritacommented, May 8, 2017

Yes, it seems to be working fine now. At least the cload process exits without errors and the output files seem to be consistent – I’ve tried to read a few small chunks of data from the Python API.

1reaction
ezoritacommented, May 8, 2017

That might be the way to go to… Yet, still a good idea to control the chunk sizes 😉

Indeed, especially to balance the thread loading. If you expect your multithreading jobs to be very unbalanced, like in a size(chr1)/size(chr21) ~ 5 ratio, as a rule of thumb I’d make at least 5 time more jobs than concurrent threads with a fair scheduler.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues with loading large datasets in pandas - Kaggle
I'm looking for use-cases where someone has faced issues trying to load a large dataset in pandas , leading the kernel to crash...
Read more >
How to deal with large data sets that crash Excel and Access
Large datasets should not or cannot be stored in a spreadsheet. We all know that Excel can't handle data more than 1 million...
Read more >
memory crash with large dataset · Issue #12912 - GitHub
CPU is at 20% and RAM fills entirely up to 160GB. I cannot share the data (proprietary) but maybe there is something obvious...
Read more >
Discussions: Designer: Computer crashing - dataset too large?
Solved: Hello! I am connecting to a database, doing a little manipulation, and then streaming out a dataset to run a multi-row formula...
Read more >
Sudden crashes in R Notebook when working with large data
This error occurs when I am working with relatively large datasets (~100MB); Things work fine again after restarting the R session, and ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found