Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: Add high-performance DecimalArray type?

See original GitHub issue

Is your feature request related to a problem?

Doing operations with decimal types is currently very slow, since pandas stores them as python Decimal objects. I think it would be nice to have a high-performance, vectorized DecimalArray ExtensionArray type. Would you be open to a PR adding a type like this, or is this something you think belongs outside the pandas library?

Describe the solution you’d like

Create an ExtensionArray datatype that internally represents numbers similar to in pyarrow (as ints, with precision and scale). (Require that all values have the same precision and scale.) Implement basic arithmetic operations (+, -, *, /, etc.).

API breaking implications

By adding a __from_arrow__ and __to_arrow__ field to DecimalArray, this would cause decimal data from pyarrow to be converted to DecimalArray rather than object array by default, which would be a breaking change.

Describe alternatives you’ve considered

The status quo (loading as object arrays) works, but is slow.

Additional context

This would also be really helpful for loading parquet files that decimal types via pyarrow more quickly, since the step to instantiate python Decimal objects for each value can be very slow for large DataFrames.

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:10 (6 by maintainers)

Top GitHub Comments

3reactions

jorisvandenbosschecommented, May 16, 2020

It looks like pyarrow has started implementing operations on arrays for some types in the C++ library at cpp/src/arrow/compute/kernels, but it doesn’t look like they’ve added any decimal functions, other than casting Decimal to Decimal or Decimal to Int.

Note there is a bit more than just casting. What is eg exposed in pyarrow and works for decimals is unique / dictionary_encode (which could already support groupby with a decimal column as key), and take for general subsetting/indexing. (I know this is not much computational, but it are still essential functionalities to get it working as a pandas extension array)

Further, there are a bunch of element-wise kernels available in the Gandiva submodule, which is a JIT expression compiler for Arrow memmory, which can be used from python on decimal arrays (but also not reductions).

Querying what is currently available for decimal types:

In [1]: from pyarrow import gandiva   

In [2]: [f for f in gandiva.get_registered_function_signatures() if any(pa.types.is_decimal(t) for t in f.param_types()) or pa.types.is_decimal(f.return_type())]   
Out[2]: 
[FunctionSignature(int64 castBIGINT(decimal(38, 0))),
 FunctionSignature(double castFLOAT8(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) castDECIMAL(int32)),
 FunctionSignature(decimal(38, 0) castDECIMAL(int64)),
 FunctionSignature(decimal(38, 0) castDECIMAL(float)),
 FunctionSignature(decimal(38, 0) castDECIMAL(double)),
 FunctionSignature(decimal(38, 0) castDECIMAL(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) castDECIMAL(string)),
 FunctionSignature(decimal(38, 0) castDECIMALNullOnOverflow(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) add(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(decimal(38, 0) subtract(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(decimal(38, 0) multiply(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(decimal(38, 0) divide(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(decimal(38, 0) mod(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(decimal(38, 0) modulo(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool equal(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool eq(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool same(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool not_equal(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool less_than(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool less_than_or_equal_to(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool greater_than(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool greater_than_or_equal_to(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(int32 hash(decimal(38, 0))),
 FunctionSignature(int32 hash32(decimal(38, 0))),
 FunctionSignature(int32 hash32AsDouble(decimal(38, 0))),
 FunctionSignature(int32 hash32(decimal(38, 0), int32)),
 FunctionSignature(int32 hash32AsDouble(decimal(38, 0), int32)),
 FunctionSignature(int64 hash64(decimal(38, 0))),
 FunctionSignature(int64 hash64AsDouble(decimal(38, 0))),
 FunctionSignature(int64 hash64(decimal(38, 0), int64)),
 FunctionSignature(int64 hash64AsDouble(decimal(38, 0), int64)),
 FunctionSignature(bool isnull(decimal(38, 0))),
 FunctionSignature(bool isnotnull(decimal(38, 0))),
 FunctionSignature(bool isnumeric(decimal(38, 0))),
 FunctionSignature(bool is_distinct_from(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(bool is_not_distinct_from(decimal(38, 0), decimal(38, 0))),
 FunctionSignature(decimal(38, 0) abs(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) ceil(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) floor(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) round(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) truncate(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) trunc(decimal(38, 0))),
 FunctionSignature(decimal(38, 0) round(decimal(38, 0), int32)),
 FunctionSignature(decimal(38, 0) truncate(decimal(38, 0), int32)),
 FunctionSignature(decimal(38, 0) trunc(decimal(38, 0), int32)),
 FunctionSignature(string castVARCHAR(decimal(38, 0), int64))]

And there is currently work being done in pyarrow to refactor the kernel implementation, which amongst other things should allow those gandiva functions also to be reused in the precompiled kernels (see https://issues.apache.org/jira/browse/ARROW-8792). Once that has landed, it should be more straightforward to add a bunch of kernels for the decimal type.

I’m not sure if exposing that interface is one of pyarrow’s goals, or if they want other libraries like fletcher to make that interface

It’s certainly the goal of pyarrow to provide to core computational building blocks (so a developer API, just not an end-user API, but it (eventually) should provide everything to allow fletcher / another project to build a decimal ExtensionArray)

Regarding fletcher, it also seems that they don’t support decimal types right now

I am not sure whether this is on purpose, or whether they would welcome contributions (cc @xhochy)

IMO converting to Decimal in operations sounds like a bad idea. That would be as slow as the current solution of using object arrays (actually slower, since it would require the additional step of converting to and from Decimal each time).

I agree that such conversion should always be avoided. But I also think it can be beneficial to already start experimenting with the pandas ExtensionArray side of things (to allow people to start using it for those things it implements, and provide feedback etc), before all necessary computational functionality is available in pyarrow.

I think it might be good to keep this issue open as general issue about “fast decimal support in pandas”, until there is an actual solution / project we can point users towards.

2reactions

jrebackcommented, May 16, 2020

IMO converting to Decimal in operations sounds like a bad idea. That would be as slow as the current solution of using object arrays (actually slower, since it would require the additional step of converting to and from Decimal each time).

So I think I should look into contributing to pyarrow then! Thanks for the guidance, and feel free to close the issue.

@sid-kap something working but slow and available now is much better than fast but not available

Top Results From Across the Web

2.2: Floating-Point Numbers - Engineering LibreTexts

Using this technique, each real number is stored as a scaled integer. This solves the problem that base-10 fractions such as 0.1 or...

Quick way to turn a binary array into a decimal string

The "obvious" approach that i found online was, to iterate over the array, add everything up while keeping track of the base and...

Performance Evaluation of Decimal Floating-Point Arithmetic

the performance of decimal floating-point arithmetic operations when implemented on superscalar processors using either software libraries or specialized ...

New Features in Apache Impala

Faster runtime performance for DECIMAL constant values, through improved native code generation for all combinations of precision and scale. See ...

NEP 34 — Disallow inferring dtype=object from sequences

array ([Decimal(10), Decimal(10)]) . This too is out of scope for the current NEP. Rationale: it's harder to asses the impact of this...