Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Rank(pct=True) behaves strangely on big data

See original GitHub issue

Code Sample, a copy-pastable example if possible

smallData = pd.DataFrame({'a': [0]*10 + [1,2,3]})
print(smallData.a.rank(pct=True).tail())

bigData = pd.DataFrame({'a': [0]*100000000 + [1,2,3]})
print(bigData.a.rank(pct=True).tail())

When I use pd.DataFrame().rank(pct=True) on small data (see the first example), it gives me percentages or percentiles. However when data is big, it doesn’t return percentages. Maybe it expected output, I just want to calculate percentiles on big data.

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Output

8 0.423077 9 0.423077 10 0.846154 11 0.923077 12 1.000000

99999998 2.980232 99999999 2.980232 100000000 5.960465 100000001 5.960465 100000002 5.960465

Expected Output

I would expect something close to 0.5 for all 0 and something close to 1 for all other values

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line] INSTALLED VERSIONS

commit: None python: 3.6.3.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.14.0 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.2 openpyxl: None xlrd: 1.1.0 xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 5 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

WillAydcommented, Nov 14, 2018

Ha no worries. Changes are exactly the same so gets to the same spot. Let’s stick with yours

0reactions

jschendelcommented, Nov 14, 2018

@WillAyd : Oops, just saw this after posting a PR of my own!

Top Results From Across the Web

Pandas Rank: unexpected behavior for method = 'dense' and ...

All that pct=True does is divide by the nobs, which gives unexpected behavior for method = 'dense', so this considered as a bug...

pandas.Series.rank — pandas 1.5.2 documentation

Compute numerical data ranks (1 through n) along axis. ... pct_rank: when setting pct = True , the ranking is expressed as percentile...

Pandas .groupby(), Lambda Function, & Pivot Table Tutorial

This lesson of the Python Tutorial for Data Analysis covers grouping data with pandas .groupby(), using lambda functions and pivot tables, and sorting...

CREAK Data Explorer - UT Computer Science

The film Pinocchio depicted a boy being swallowed by a large dinosaur. ... Poultry accounts for a higher percentage of total meat production...

Pandas Rank – Rank Your Data – pd.df.rank()

Ranking Ascending True/False; Ranking with different methods; Ranking via pct; Ranking with Group By. But first, let's create our DataFrame. In ...