question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Persist/checkpoint to disk

See original GitHub issue

Do we have some function to persist data into disk when not using a cluster? It would just be a small function that calls compute, writes the data to disk and then loads it back again. I’m currently writing my own wrapper function.

ddf = ddf.checkpoint(to_parquet, filename...)

#do more work with ddf, but computation is faster since ddf is persisted
a = ddf[...]

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
martindurantcommented, Apr 30, 2019

This is sidelong related to the concept of persist in Intake, which specifically has exactly one file format to output for each type of data source. In dask, we could say that the canonical storage for dataframes is parquet and for arrays zarr… or something like that, but this is not a simple problem at all. It may be that Intake or some other pipeline-like system would be a good layer over dask to handle intermediate persistence.

1reaction
martindurantcommented, Apr 30, 2019
Read more comments on GitHub >

github_iconTop Results From Across the Web

What is the difference between spark checkpoint and persist to ...
Persist (MEMORY_AND_DISK) will store the data frame to disk and memory temporary without breaking the lineage of the program i.e. df.rdd.
Read more >
What is the difference between spark checkpoint ... - Intellipaat
1 Answer ; Checkpointing. Persist ; Checkpointing stores the RDD in HDFS. It deletes the lineage which created it. When we persist RDD...
Read more >
Persist, Cache, Checkpoint in Apache Spark - LinkedIn
Best strategy is to start from some checkpoint in case of failure. Checkpointing save some stage of the RDD on disk and breaks...
Read more >
Checkpoint Deep Dive - Fugue Tutorials
Spark persist is to cache data into memory (or executor local disk), it does not break the lineage in Spark execution). It can't...
Read more >
16. Cache and checkpoint: enhancing Spark's performances
Spark offers two methods to leverage caching: cache() and persist() . The cache() method is really a synonym to the persist() method. However,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found