question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pandas.dataframe.values floating number changes automatically

See original GitHub issue

Code Sample, a copy-pastable example if possible

test.xlsx

# Your code here
import pandas as pd
df = pd.read_excel('test.xlsx', 'test1', header=0, index_col=None)
print(df.values)

Problem description

I loaded a pandas dataframe from the attached test.xlsx, of which the content is as follows: name c1 c2 0 r1 0.014 0.000-0.054 1 r2 0.984 0.025-1.785 As we can see, the c1 columns has been well rounded. For some reasons, I needed only the values numpy.darray, but the floating precision expands undesirably and changes a little as follows:

array([[‘r1’, 0.013999999999999999, ‘0.000-0.054’], [‘r2’, 0.9840000000000001, ‘0.025-1.785’]], dtype=object)

what is odd is that I have some other similar tables which resulted in the expected results. So this really beyond me.

Expected Output

What I wanted was the perfectly correspondance of dataframe: array([[‘r1’, 0.0134, ‘0.000-0.054’], [‘r2’, 0.984, ‘0.025-1.785’]], dtype=object)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-59-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: zh_CN.utf8 LANG: en_US.UTF-8 LOCALE: zh_CN.UTF-8

pandas: 0.19.1 nose: None pip: 10.0.1 setuptools: 26.1.1 Cython: None numpy: 1.13.3 scipy: 0.18.1 statsmodels: None xarray: None IPython: 6.2.1 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 2.1.0 openpyxl: None xlrd: 1.0.0 xlwt: 1.3.0 xlsxwriter: 0.7.3 lxml: None bs4: 4.5.1 html5lib: 1.0b10 httplib2: 0.9.1 apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:1
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
gfyoungcommented, Feb 2, 2019

As a side note, the almost identical xstrtod is defined again in parser_helper.h, which seems like an oversight.

Weird…I think it would be good to see if we could unify the two.

This seems like a fairly straightforward fix of replacing xstrtod with the Python float evaluator PyOS_string_to_double from Python.h, unless there is a good reason to stick with the original xstrtod

I’m not sure what the reason was implementing our own. However, I would encourage you to investigate the consequences of doing so, both from an accuracy and performance perspective.

2reactions
david-liu-brattle-1commented, Feb 2, 2019

@gfyoung The string parsing functions seem to call a custom built xstrtod function

https://github.com/pandas-dev/pandas/blob/bb43726e1f52a0ddee45fcf485690719f262870d/pandas/_libs/src/parser/tokenizer.c#L1532-L1534

which does a fine job of evaluating the string but the issue here is it’s not evaluating it exactly as python (or numpy) is evaluating it. float('0.014')==0.014==np.fromstring(b'0.014',sep=' ')[0] but the xstrtod('0.014') != 0.014. For consistency’s sake I think it makes sense that a number read in by pandas as string should be evaluated and written back out as the same number. (currently 0.014 is written back out as 0.0139999999 after being evaluated). It’s a fluke that this issue isn’t being picked up by any of the tests. For example, if “0.014” would make the following fail if it in the array: https://github.com/pandas-dev/pandas/blob/bb43726e1f52a0ddee45fcf485690719f262870d/pandas/tests/dtypes/test_inference.py#L398-L405

This seems like a fairly straightforward fix of replacing xstrtod with the Python float evaluator PyOS_string_to_double from Python.h, unless there is a good reason to stick with the original xstrtod?

As a side note, the almost identical xstrtod is defined again in parser_helper.h, which seems like an oversight. https://github.com/pandas-dev/pandas/blob/bb43726e1f52a0ddee45fcf485690719f262870d/pandas/_libs/src/parse_helper.h#L148-L151

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas data frame. Change float format. Keep type "float"
I'm asking for both ways: 0.00 and without decimals at all, but keeping float type. – Jerry · if i'm not mistaken format...
Read more >
How to Convert Floats to Integers in Pandas DataFrame
The goal is to convert the float values to integers, as well as replace the NaN values with zeros. Here is the code...
Read more >
Indexing and Selecting Data — pandas 0.13.1 documentation
This plot was created using a DataFrame with 3 columns each containing floating point values generated using numpy.random.randn().
Read more >
Handling Missing Data | Python Data Science Handbook
The way in which Pandas handles missing values is constrained by its reliance ... to floating point, Pandas automatically converts the None to...
Read more >
10 tricks for converting Data to a Numeric Type in Pandas
10 tricks for converting Data to a Numeric Type in Pandas · 1. Converting string to int/float · 2. Converting float to int...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found