question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Provide an ability to work with float instead of decimal

See original GitHub issue

Overview

I tried using frictionless extract on a small-ish mostly numeric dataset (~50,000 x 500), and it took a very long time (12.5 mins) to process the data.

Using the API directly was a bit faster, but still slow (2 mins).

I profiled the two different scenarios (attached).

To reproduce

Create some fake data:

#!/bin/env python
import random
import pandas as pd
import numpy as np
from string import ascii_lowercase

# set random seeds
random.seed(1)
np.random.seed(1)

# create a fake data set with a single column of "ID's" and the rest numeric data
m = 50000
n = 500

# create random row ids
ids = [''.join(random.choice(ascii_lowercase) for i in range(10)) for j in range(m)]

# generate numeric portion of dataset and combine with row ids into a single dataframe
dat = pd.DataFrame(np.random.normal(0, 1, m * n).reshape([m, n]))
dat = pd.concat([pd.Series(ids), dat], axis=1)

# add colnames and store result
dat.columns = ["col_" + str(i) for i in range(n + 1)]

dat.to_csv("test_data.csv", index=False)
  1. CLI call:
# normal call
frictionless extract test_data.csv

# to generate a profile
python -m cProfile -o profile_frictionless_cli.prof $HOME/.local/bin/frictionless extract test_data.csv
  1. API call:
# profile_frictionless.py
#!/bin/env python
from frictionless import extract
rows = extract('test_data.csv')

# to profile:
python -m cProfile -o profile_frictionless.prof profile_frictionless.py

Observations

For the CLI calls, it seems like most of the time is spent in methods relating to pretty printing the tables (wcwidth.py and _pydecimal.py).

The actual number of calls is insanely large though – for the example 50,000 x 500 dataset, there are 75,000,000 calls to to _pydecimal.py, which is 3x the amount of individual values in the dataset.

If you follow the smaller execution path (~180/770s) on the right-side, the result is similar. It goes:

main.py -> table.py -> table.py -> row.py -> field.py -> number.py -> _pydecimal.py

Calling the API directly basically looks like the smaller portion of the CLI profile, with a similar execution path to that shown above taking up most of the time.

I’m not sure how feasible it would be to try and optimize this / reduce the number of calls since, at a glance, it seems mostly to be from code outside of frictionless.

It could perhaps be possible to parallelize some of the code, although again, if it’s just calls to some pretty print function that gets rendered directly to the screen, that also isn’t likely to work / may be a pain to implement even if it could…

My cat would also like to add:

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

Great point.

But more seriously, perhaps a better/much simpler solution would be to just have frictionless default to only printing the top N rows of data?

This is similar to how R/Pandas/Julia print matrices and dataframes to the screen.

You could similarly limit the number of columns displayed by default to make it more readable. If you could detect the terminal width, then the exact number could probably be optimized.

An option could then be included (e.g. --num-rows) to allow the user to override the default, with “0” indicating that all rows should be rendered.

Just a thought…

Thanks again for the help on discord! It is much appreciated.

Cheers, Keith

Snakeviz profiles

profile_frictionless.prof.gz profile_frictionless_cli.prof.gz

System info

  • Arch Linux 5.8.12 64-bit
  • Python 3.8.5
  • frictionless 3.19.2
  • py-yaml 5.3.1

Please preserve this line to notify @roll (lead of this repository)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
khughittcommented, Dec 4, 2020

Great! Thanks for taking the time to work on this and keep me updated. I appreciate it.

1reaction
rollcommented, Dec 3, 2020

Support for float numbers is added in #569 which has its own separate value BUT my expectations were wrong and it didn’t really help to solve the performance on numbers issue.

I’ve created a new card to continue - https://github.com/frictionlessdata/frictionless-py/issues/568

Read more comments on GitHub >

github_iconTop Results From Across the Web

SQL: Newbie Mistake #1: Using float instead of decimal
float is used to store approximate values, not exact values. It has a precision from 1 to 53 digits. real is similar but...
Read more >
Difference Between Decimal and Float | by mayuri budake
Decimal used within financial applications that require a high degree of accuracy and easy to avoid rounding errors whereas Float used when you...
Read more >
Float vs Decimal in Python - LAAC Technology
Both the float and decimal types store numerical values in Python, and at the beginning, choosing when to use each can be confusing....
Read more >
decimal — Decimal fixed point and floating point arithmetic ...
The decimal module provides support for fast correctly rounded decimal floating point arithmetic. It offers several advantages over the float datatype:.
Read more >
Difference between decimal, float and double in .NET?
Decimals and Floats/Doubles cannot be compared without a cast whereas Floats and Doubles can. Decimals also allow the encoding or trailing zeros ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found