question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve CoNLL reader to read large datasets

See original GitHub issue

Is your feature request related to a problem? Please describe. We are attempting to load a directory of CoNLL files totalling 9000+ files and 400+MB using CoNLL().readDataset(spark, dataset), however this fails with OOM exceptions even on a driver with 24GB of RAM available. The proposed workaround to be able to load this dataset all at once for training is the following:

[R]ead the files one by one, and then write as parquet, and then read [all] at once

Describe the solution you’d like I would like to be able to avoid reading in individual files into a dataframe, writing the dataframes to parquet, then re-reading those files back into a dataframe. This approach feels more like a workaround against the spirit of what the CoNLL reader should be able to do. It would be wonderful for the reader to have some extra params, or methods on the CoNLL class that allow for efficiently loading larger CoNLL training sets.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
albertoandreottiATgmailcommented, Nov 12, 2021

@ethnhll , @maziyarpanahi we have a candidate implementation in healthcare library, will share with Ethan soon.

0reactions
maziyarpanahicommented, May 16, 2022

I don’t know if that’s possible, at some point the memory has to be enough for the file that is being processed. The changes here were to support multiple CoNLL files and speed up by caching and processing in parallel.

I can only recommend increasing the memory and at the same time breaking your large CoNLL file into smaller ones so they can be processed. (at this point there is no other way)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tutorial on reading large datasets - Kaggle
This notebook aims to describe and summarize some of these techniques. The Riiid! Answer Correctness Prediction dataset is quite a nice sample to...
Read more >
Why can't I read in .conll file with Python (confusing parse ...
The problem is that "CoNLL" formats differ in the number, order and content of columns. According to the error, your parser seems to...
Read more >
arXiv:2012.15573v2 [cs.CL] 9 Jun 2021
We use the CoNLL-2012 coreference resolution dataset (Pradhan et al., 2012a) and convert it to a reading comprehension format, i.e., CoNLLbart.
Read more >
How to Handle Large Datasets in Python | by Leonie Monigatti
A text file that uses a comma to separate values. The file extension is .csv . In this article, we will use gzip...
Read more >
Machine Learning Datasets - Papers With Code
CoNLL ++ is a corrected version of the CoNLL03 NER dataset where 5.38% of the ... Reasoning Dataset (ReCoRD) is a large-scale reading...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found