question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting data out of a dataframe can be slow (toArray, toPairs, etc)

See original GitHub issue

Hi, this is related to https://github.com/data-forge/data-forge-ts/issues/11. I’m opening a new issue since I don’t have permission to reopen the previous one and I’m not sure if the issue is in toPairs or the use of fillGaps and rollingWindow. The issue is very slow performance with from toPairs. Copying from my comment in the other issue.

I just pushed up a change to our test repo. The changes are:

  1. Update package.json to use the most recent version of data-forge
  2. Slightly change the reported timings to make it more clear where the performance issue happens. Specifically, the slow down looks like it’s coming out of the call to toPairs().

The tests I’m looking at are method-1.js and method-2.js. The only difference between them is:

$ diff method-1.js method-2.js
76c76
< const mySeries = dfWithoutGaps.getSeries('value');
---
> const mySeries = new dataForge.Series(dfWithoutGaps.getSeries('value').toArray());

Output from running the tests:

cberthiaume@slow-lane:~/data-forge-performance-test-issue-11$ node method-1.js
Time to require: 980.9740000000002
Time to create DataFrame and getSeries: 8.874000000000024
Time for rolling window: 0.08599999999978536
Time for toPairs: 2067.8320000000003
cberthiaume@slow-lane:~/data-forge-performance-test-issue-11$ node method-2.js
Time to require: 975.4
Time to create DataFrame and getSeries: 63.06100000000015
Time for rolling window: 0.10500000000001819
Time for toPairs: 17.16599999999994
cberthiaume@slow-lane:~/data-forge-performance-test-issue-11$

The key difference is the huge difference in time to call toPairs(). Our use case requires us to call toPairs() and the only way to get acceptable performance when doing that is to recreate the series as you see in the diff above. However, our needs have changed that the slow down required to implement this workaround is becoming a bottleneck. Is there a a better way to get good performance from toPairs() without using this workaround? Should I open a separate ticket to track this?

Thanks again for all your help.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:22 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
ashleydaviscommented, Mar 20, 2019

Thanks for continuing to give feedback. I’ll look at this soon.

0reactions
ashleydaviscommented, Sep 9, 2019

Probably because it’s going through a JavaScript iterator.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas convert dataframe to array of tuples - Stack Overflow
This requires me to convert the dataframe into an array of tuples, with each tuple corresponding to a "row" of the dataframe. My...
Read more >
data-forge - npm
Start using data-forge in your project by running `npm i ... A data frame can also be created from an array of JavaScript...
Read more >
What does "toArray() method loads into RAM all documents ...
according to Iterator Index In mongosh, you can use the toArray() method to iterate the cursor and return the documents in an array, ......
Read more >
Data Wrangling with JavaScript - GitHub
1.4 What will you get out of this book? ... Due to slow internet speeds, remote data access wasn't going to work well...
Read more >
Pandas DataFrame To NumPy Array – df.to_numpy()
By default Pandas will return the NA default for that column data type. If you wanted to specify another value, go ahead and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found