Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve CoNLL reader to read large datasets

See original GitHub issue

Is your feature request related to a problem? Please describe. We are attempting to load a directory of CoNLL files totalling 9000+ files and 400+MB using CoNLL().readDataset(spark, dataset), however this fails with OOM exceptions even on a driver with 24GB of RAM available. The proposed workaround to be able to load this dataset all at once for training is the following:

[R]ead the files one by one, and then write as parquet, and then read [all] at once

Describe the solution you’d like I would like to be able to avoid reading in individual files into a dataframe, writing the dataframes to parquet, then re-reading those files back into a dataframe. This approach feels more like a workaround against the spirit of what the CoNLL reader should be able to do. It would be wonderful for the reader to have some extra params, or methods on the CoNLL class that allow for efficiently loading larger CoNLL training sets.

Issue Analytics

State:
Created 2 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

albertoandreottiATgmailcommented, Nov 12, 2021

@ethnhll , @maziyarpanahi we have a candidate implementation in healthcare library, will share with Ethan soon.

0reactions

maziyarpanahicommented, May 16, 2022

I don’t know if that’s possible, at some point the memory has to be enough for the file that is being processed. The changes here were to support multiple CoNLL files and speed up by caching and processing in parallel.

I can only recommend increasing the memory and at the same time breaking your large CoNLL file into smaller ones so they can be processed. (at this point there is no other way)