pandas.dataframe.values floating number changes automatically
See original GitHub issueCode Sample, a copy-pastable example if possible
# Your code here
import pandas as pd
df = pd.read_excel('test.xlsx', 'test1', header=0, index_col=None)
print(df.values)
Problem description
I loaded a pandas dataframe from the attached test.xlsx, of which the content is as follows: name c1 c2 0 r1 0.014 0.000-0.054 1 r2 0.984 0.025-1.785 As we can see, the c1 columns has been well rounded. For some reasons, I needed only the values numpy.darray, but the floating precision expands undesirably and changes a little as follows:
array([[‘r1’, 0.013999999999999999, ‘0.000-0.054’], [‘r2’, 0.9840000000000001, ‘0.025-1.785’]], dtype=object)
what is odd is that I have some other similar tables which resulted in the expected results. So this really beyond me.
Expected Output
What I wanted was the perfectly correspondance of dataframe: array([[‘r1’, 0.0134, ‘0.000-0.054’], [‘r2’, 0.984, ‘0.025-1.785’]], dtype=object)
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-59-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: zh_CN.utf8 LANG: en_US.UTF-8 LOCALE: zh_CN.UTF-8
pandas: 0.19.1 nose: None pip: 10.0.1 setuptools: 26.1.1 Cython: None numpy: 1.13.3 scipy: 0.18.1 statsmodels: None xarray: None IPython: 6.2.1 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 2.1.0 openpyxl: None xlrd: 1.0.0 xlwt: 1.3.0 xlsxwriter: 0.7.3 lxml: None bs4: 4.5.1 html5lib: 1.0b10 httplib2: 0.9.1 apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:7 (4 by maintainers)
Top GitHub Comments
Weird…I think it would be good to see if we could unify the two.
I’m not sure what the reason was implementing our own. However, I would encourage you to investigate the consequences of doing so, both from an accuracy and performance perspective.
@gfyoung The string parsing functions seem to call a custom built
xstrtod
functionhttps://github.com/pandas-dev/pandas/blob/bb43726e1f52a0ddee45fcf485690719f262870d/pandas/_libs/src/parser/tokenizer.c#L1532-L1534
which does a fine job of evaluating the string but the issue here is it’s not evaluating it exactly as python (or numpy) is evaluating it.
float('0.014')==0.014==np.fromstring(b'0.014',sep=' ')[0]
but thexstrtod('0.014') != 0.014
. For consistency’s sake I think it makes sense that a number read in by pandas as string should be evaluated and written back out as the same number. (currently 0.014 is written back out as 0.0139999999 after being evaluated). It’s a fluke that this issue isn’t being picked up by any of the tests. For example, if “0.014” would make the following fail if it in the array: https://github.com/pandas-dev/pandas/blob/bb43726e1f52a0ddee45fcf485690719f262870d/pandas/tests/dtypes/test_inference.py#L398-L405This seems like a fairly straightforward fix of replacing
xstrtod
with the Python float evaluatorPyOS_string_to_double
fromPython.h
, unless there is a good reason to stick with the originalxstrtod
?As a side note, the almost identical
xstrtod
is defined again in parser_helper.h, which seems like an oversight. https://github.com/pandas-dev/pandas/blob/bb43726e1f52a0ddee45fcf485690719f262870d/pandas/_libs/src/parse_helper.h#L148-L151