Creating several partitions on a single file
See original GitHub issuetl;dr: Is it/Would it be possible to handle parallelism for processing a single large file with dask?
Use case example: Processing wikipedia articles dump.
I’m trying to reproduce a tokenization task from gensim with dask: http://radimrehurek.com/gensim/wiki.html#preparing-the-corpus
Convert the articles to plain text (process Wiki markup) and store the result as sparse TF-IDF vectors. In Python, this is easy to do on-the-fly and we don’t even need to uncompress the whole archive to disk. There is a script included in gensim that does just that, run:
$ python -m gensim.scripts.make_wiki
That script uses the class WikiCorpus, which uses multiprocessing the following way:
The file is currently ~11GB compressed.
This problem could also help this library to extract wiki dumps: https://github.com/attardi/wikiextractor
where they were also asking for help on a multi-threading/multi-processing that same task: https://github.com/attardi/wikiextractor/issues/4
I’d like to tackle this problem with dask but, I don’t know how/where I should start. Is it possible to create several bags after parsing a single file? Similarly to what gensim in that script does?
cc: @mrocklin
Issue Analytics
- State:
- Created 8 years ago
- Comments:17 (16 by maintainers)
Top GitHub Comments
I had a good conversation about this with @eriknw.
Use case: I have a large file, that I know how to parse to get items that are not line delimited. I have a function that can yield those items. I want to be able to easily express: have a worker that takes x of those items and does some computation with them, another worker comes and takes the next x items and does some computation with them…
Maybe something like:
or something like
To get the partitions, you could pass the count of items and the chunksize.
Probably @eriknw can express this idea better than me.
@chdoig did you ever manage to find a dask-friendly method of parsing large WARC files? I am looking into this to process Commoncrawl data