Improve CoNLL reader to read large datasets
See original GitHub issueIs your feature request related to a problem? Please describe.
We are attempting to load a directory of CoNLL files totalling 9000+ files and 400+MB using CoNLL().readDataset(spark, dataset)
, however this fails with OOM exceptions even on a driver with 24GB of RAM available. The proposed workaround to be able to load this dataset all at once for training is the following:
[R]ead the files one by one, and then write as parquet, and then read [all] at once
Describe the solution you’d like
I would like to be able to avoid reading in individual files into a dataframe, writing the dataframes to parquet, then re-reading those files back into a dataframe. This approach feels more like a workaround against the spirit of what the CoNLL reader should be able to do. It would be wonderful for the reader to have some extra params, or methods on the CoNLL
class that allow for efficiently loading larger CoNLL training sets.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
@ethnhll , @maziyarpanahi we have a candidate implementation in healthcare library, will share with Ethan soon.
I don’t know if that’s possible, at some point the memory has to be enough for the file that is being processed. The changes here were to support multiple CoNLL files and speed up by caching and processing in parallel.
I can only recommend increasing the memory and at the same time breaking your large CoNLL file into smaller ones so they can be processed. (at this point there is no other way)