question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

General performance optimizations

See original GitHub issue

Tasks

For the past couple weeks I’ve been investigating datashader’s performance and how we can improve upon it. I’m now documenting my remaining tasks, in case I get pulled away on a different project. Below is a list of tasks/issues that I’m currently addressing:

  • Extend the filetimes.py and filetimes.yml benchmarking environment to find the optimal file format for datashader/dask (Issue #129)
  • Benchmark numba compared to handwritten ufuncs in vaex (Issue #310)
  • Gather perf information about dask locking behavior (Issue #314)
  • Investigate why Cachey leads to better runtime performance for repeat datashader aggregations
  • Document memory usage findings (Issue #305)
  • Investigate how datashader’s performance changes with data types (doubles vs floats, etc) (Issue #305)
  • Verify that repeat aggregations no longer depend on file format (Issue #129)
  • Investigate distributed scheduler vs threaded scheduler for single-machine use case (#331, #332, #334)
  • Identify issues hindering distributed scheduler from performing more effectively - credit goes to @martindurant ( #332, #336, #337 )

Performance takeaways

Below are some performance-related takeaways that fell out of my experiments and optimizations with datashader and dask:

General

  • Use the latest version of numba (>=0.33). This includes bugfixes providing ~3-5x speedups for many cases (numba/numba#2345, numba/numba#2349, numba/numba#2350)

  • When interacting with data on the filesystem, store it in the Apache Parquet format when possible. Snappy compression should be used when writing out parq files, and the data should rely on categorical dtypes (when possible) before writing the parq files, as parquet supports categoricals in its binary format (#129)

  • Use the categorical dtype for columns with data that takes on a limited, fixed number of possible values. Categorical columns use a more memory-efficient data representation and are optimized for common operations such as sorting and finding uniques. Example of how to convert a column to the categorical dtype:

    df[colname] = df[colname].astype('category')
    
  • There is promise with enhancing datashader’s performance even further by using single-precision floats (np.float32) instead of double-precision floats (np.float64). In past experiments this cut down the time to load data off of disk (assuming the data was written out in single-precision float) as well as datashader’s aggregation times. Care should be taken using this approach, as using single-precision (in any software application, not just datashader) leads to different numerical results than double-precision (#305)

  • When using pandas dataframes, there will be a speedup if you cache the cvs.x_range and cvs.y_range variables, and pass them back into the Canvas() constructor during future instantiations. As of #344 , dask dataframes automatically memoize the x_range and y_range calculations; this works for dask because dask’s dataframes are immutable, unlike pandas (#129)

Single machine

  • A rule-of-thumb for the number of partitions to use while converting pandas dataframes into dask dataframes is multiprocessing.cpu_count(). This allows dask to use one thread per core for parallelizing computations (#129)

  • When the entire dataset fits into memory at once, persist the dataframe as a Dask dataframe prior to passing it into datashader (#129). One example of how to do this:

    from dask import dataframe as dd
    import multiprocessing
    dask_df = dd.from_pandas(df, npartitions=multiprocessing.cpu_count()).persist()
    ...
    cvs = datashader.Canvas(...)
    agg = cvs.points(dask_df, ...)
    
  • When the entire dataset doesn’t fit into memory at once, use the distributed scheduler (#331) without persisting (there is an outstanding issue #332 that illustrates the problem with the distributed scheduler + persist).

Multiple machines

  • Use the distributed scheduler to farm computations out to remote machines. client.persist(dask_df) may help in certain cases, but be sure to include distributed.wait() to block until the data is read into RAM on each worker (#332)

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
gbrenercommented, Apr 20, 2017

Benchmarking with numba exposed a bug affecting performance under multithreaded workloads. Once it is fixed, there should be a significant performance increase to datashader (at least 3x in many cases): https://github.com/numba/numba/issues/2345

0reactions
jbednarcommented, May 8, 2017

Thanks for all the great work, @gbrener! Reflecting these recommendations into our documentation is now on our to-do list.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Performance Optimization - an overview
Performance optimization, also known as “performance tuning”, is usually an iterative approach to making and then monitoring modifications to an application and ...
Read more >
General Performance Optimization and Tuning
The Ethernet Network Adapter general tuning can be performed during installation by modifying some of Windows registries as explained in Registry Tuning ......
Read more >
General Performance Optimization Guide
Learn how to optimize your gaming performance with this guide, including how to increase FPS and reduce motion sickness.
Read more >
18 Tips for Website Performance Optimization
Optimizing and speeding up your website is always something that should be top priority. Check out these 18 tips for website performance ......
Read more >
6 Performance Optimization
The following is a general checklist to keep in mind when developing Java applications. Do not make code more complicated than necessary. Write...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found