question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Creating several partitions on a single file

See original GitHub issue

tl;dr: Is it/Would it be possible to handle parallelism for processing a single large file with dask?

Use case example: Processing wikipedia articles dump.

I’m trying to reproduce a tokenization task from gensim with dask: http://radimrehurek.com/gensim/wiki.html#preparing-the-corpus

Convert the articles to plain text (process Wiki markup) and store the result as sparse TF-IDF vectors. In Python, this is easy to do on-the-fly and we don’t even need to uncompress the whole archive to disk. There is a script included in gensim that does just that, run:

$ python -m gensim.scripts.make_wiki

That script uses the class WikiCorpus, which uses multiprocessing the following way:

https://github.com/piskvorky/gensim/blob/f1a904d61cdea8e44681342678a5eefd178d7b12/gensim/corpora/wikicorpus.py#L291-L308

The file is currently ~11GB compressed.

This problem could also help this library to extract wiki dumps: https://github.com/attardi/wikiextractor

where they were also asking for help on a multi-threading/multi-processing that same task: https://github.com/attardi/wikiextractor/issues/4

I’d like to tackle this problem with dask but, I don’t know how/where I should start. Is it possible to create several bags after parsing a single file? Similarly to what gensim in that script does?

cc: @mrocklin

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:17 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
chdoigcommented, Jun 12, 2015

I had a good conversation about this with @eriknw.

Use case: I have a large file, that I know how to parse to get items that are not line delimited. I have a function that can yield those items. I want to be able to easily express: have a worker that takes x of those items and does some computation with them, another worker comes and takes the next x items and does some computation with them…

Maybe something like:

# No dask, I create the extract_pages function
texts = ((text, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(fname)))

# dask
import dask.bag as db
db.from_generator(texts).map(...).compute()

or something like

# No dask, provided by user, custom to file format
def read_file(f):
    while True:
        record = f.read_record()
        t = record.payload.read()
        yield t

# dask
b = (db.from_sequence([fname])
     .map(lambda l: warc.WARCFile(fileobj=gzip.open(l))) # custom read your file
     .map(lambda f: read_file(f)) # custom parsing of your file
     .concat().take(100) # chunkify every 100 items
)

# Do something with each partition
b.map(...).concat().compute()

To get the partitions, you could pass the count of items and the chunksize.

Probably @eriknw can express this idea better than me.

0reactions
timClickscommented, May 10, 2018

@chdoig did you ever manage to find a dask-friendly method of parsing large WARC files? I am looking into this to process Commoncrawl data

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Combine Multiple Partitions Into a Single Partition
These two partitions must be on the same drive. If they're on different drives, this won't work. They'll also need to be adjacent...
Read more >
How to Partition a Hard Drive | PCMag
First, open Windows' File Explorer and make sure you have enough free space for the partition you want to create. Click on This...
Read more >
Create and format a hard disk partition - Microsoft Support
The select Control Panel > System and Security > Administrative Tools, and then double-click Computer Management. In the left pane, under Storage, select...
Read more >
How to set up multiple partitions on a USB flash drive on ...
Creating multiple partitions flash drive · Use the "Allocation unit size" drop-down menu, and select the Default option. · In the "Value label" ......
Read more >
How to create Single or Multiple Partitions on a Storage
This video explains how to create Single or Multiple Partitions on a Storage device.It is very helpful for those who are willing to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found