Provide an ability to work with float instead of decimal
See original GitHub issueOverview
I tried using frictionless extract
on a small-ish mostly numeric dataset (~50,000 x 500), and it took a very long time (12.5 mins) to process the data.
Using the API directly was a bit faster, but still slow (2 mins).
I profiled the two different scenarios (attached).
To reproduce
Create some fake data:
#!/bin/env python
import random
import pandas as pd
import numpy as np
from string import ascii_lowercase
# set random seeds
random.seed(1)
np.random.seed(1)
# create a fake data set with a single column of "ID's" and the rest numeric data
m = 50000
n = 500
# create random row ids
ids = [''.join(random.choice(ascii_lowercase) for i in range(10)) for j in range(m)]
# generate numeric portion of dataset and combine with row ids into a single dataframe
dat = pd.DataFrame(np.random.normal(0, 1, m * n).reshape([m, n]))
dat = pd.concat([pd.Series(ids), dat], axis=1)
# add colnames and store result
dat.columns = ["col_" + str(i) for i in range(n + 1)]
dat.to_csv("test_data.csv", index=False)
- CLI call:
# normal call
frictionless extract test_data.csv
# to generate a profile
python -m cProfile -o profile_frictionless_cli.prof $HOME/.local/bin/frictionless extract test_data.csv
- API call:
# profile_frictionless.py
#!/bin/env python
from frictionless import extract
rows = extract('test_data.csv')
# to profile:
python -m cProfile -o profile_frictionless.prof profile_frictionless.py
Observations
For the CLI calls, it seems like most of the time is spent in methods relating to pretty printing the tables (wcwidth.py and _pydecimal.py).
The actual number of calls is insanely large though – for the example 50,000 x 500 dataset, there are 75,000,000 calls to to _pydecimal.py, which is 3x the amount of individual values in the dataset.
If you follow the smaller execution path (~180/770s) on the right-side, the result is similar. It goes:
main.py -> table.py -> table.py -> row.py -> field.py -> number.py -> _pydecimal.py
Calling the API directly basically looks like the smaller portion of the CLI profile, with a similar execution path to that shown above taking up most of the time.
I’m not sure how feasible it would be to try and optimize this / reduce the number of calls since, at a glance, it seems mostly to be from code outside of frictionless.
It could perhaps be possible to parallelize some of the code, although again, if it’s just calls to some pretty print function that gets rendered directly to the screen, that also isn’t likely to work / may be a pain to implement even if it could…
My cat would also like to add:
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
Great point.
But more seriously, perhaps a better/much simpler solution would be to just have frictionless default to only printing the top N
rows of data?
This is similar to how R/Pandas/Julia print matrices and dataframes to the screen.
You could similarly limit the number of columns displayed by default to make it more readable. If you could detect the terminal width, then the exact number could probably be optimized.
An option could then be included (e.g. --num-rows
) to allow the user to override the default, with “0” indicating that all rows should be rendered.
Just a thought…
Thanks again for the help on discord! It is much appreciated.
Cheers, Keith
Snakeviz profiles
profile_frictionless.prof.gz profile_frictionless_cli.prof.gz
System info
- Arch Linux 5.8.12 64-bit
- Python 3.8.5
- frictionless 3.19.2
- py-yaml 5.3.1
Please preserve this line to notify @roll (lead of this repository)
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
Great! Thanks for taking the time to work on this and keep me updated. I appreciate it.
Support for float numbers is added in #569 which has its own separate value BUT my expectations were wrong and it didn’t really help to solve the performance on numbers issue.
I’ve created a new card to continue - https://github.com/frictionlessdata/frictionless-py/issues/568