Rank(pct=True) behaves strangely on big data
See original GitHub issueCode Sample, a copy-pastable example if possible
smallData = pd.DataFrame({'a': [0]*10 + [1,2,3]})
print(smallData.a.rank(pct=True).tail())
bigData = pd.DataFrame({'a': [0]*100000000 + [1,2,3]})
print(bigData.a.rank(pct=True).tail())
When I use pd.DataFrame().rank(pct=True) on small data (see the first example), it gives me percentages or percentiles. However when data is big, it doesn’t return percentages. Maybe it expected output, I just want to calculate percentiles on big data.
[this should explain why the current behaviour is a problem and why the expected output is a better solution.]
Output
8 0.423077 9 0.423077 10 0.846154 11 0.923077 12 1.000000
99999998 2.980232 99999999 2.980232 100000000 5.960465 100000001 5.960465 100000002 5.960465
Expected Output
I would expect something close to 0.5 for all 0 and something close to 1 for all other values
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None python: 3.6.3.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.14.0 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.2 openpyxl: None xlrd: 1.1.0 xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
Ha no worries. Changes are exactly the same so gets to the same spot. Let’s stick with yours
@WillAyd : Oops, just saw this after posting a PR of my own!