question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BaseN encoding speed improvement

See original GitHub issue

By using this libdivide4j I was able to double the speed of your BaseN encoder for non powers of 2.

public class FastDivNRemainderEncoder extends BaseNEncoder {

    private final int radix;
    private final int length;
    private final char padding;

    private static final int UUID_INTS = 4;
    private static final long HALF_LONG_MASK = 0x00000000ffffffffL;
    
    private final FastDivision.Magic magic;

    public FastDivNRemainderEncoder(BaseN base) {
        super(base);
        radix = base.getRadix();
        length = base.getLength();
        padding = base.getPadding();
        magic = FastDivision.magicUnsigned((long) radix);
    }

    @Override
    public String apply(@SuppressWarnings("null") UUID uuid) {

        // unsigned 128 bit number
        int[] number = new int[UUID_INTS];
        number[0] = (int) (uuid.getMostSignificantBits() >>> 32);
        number[1] = (int) (uuid.getMostSignificantBits() & HALF_LONG_MASK);
        number[2] = (int) (uuid.getLeastSignificantBits() >>> 32);
        number[3] = (int) (uuid.getLeastSignificantBits() & HALF_LONG_MASK);

        char[] buffer = new char[length];
        int b = length; // buffer index

        // fill in the buffer backwards using remainder operation
        while (!isZero(number)) {
            final int[] quotient = new int[UUID_INTS]; // division output
            final int remainder = remainder(number, quotient);
            buffer[--b] = alphabet.get(remainder);
            number = quotient;
        }

        // add padding to the leading
        while (b > 0) {
            buffer[--b] = padding;
        }

        return new String(buffer);
    }

    protected int remainder(int[] number, int[] quotient /* division output */) {

        long temporary = 0;
        long remainder = 0;

        for (int i = 0; i < UUID_INTS; i++) {
            temporary = (remainder << 32) | (number[i] & HALF_LONG_MASK);
            // quotient[i] = (int) (temporary / divisor);
            long q = FastDivision.divideUnsignedFast(temporary, magic);
            // remainder = temporary % divisor;
            long r = temporary - q * magic.divider;
            
            quotient[i] = (int) q;
            remainder = (int) r;
        }

        return (int) remainder;
    }

    private boolean isZero(int[] number) {
        return number[0] == 0 && number[1] == 0 && number[2] == 0 && number[3] == 0;
    }
}

I don’t really fully understand how the algorithm works but it speeds up divmod-ing quite a bit on non power of 2s. I also don’t understand why you modulus (%) after you already done the division. Maybe the JIT is smart enough. Doing multi and then subtracting (like I did here) might speed it up as well.

I also have an in house encoder that I call Fast57 (but it really could be any BaseN) that divmods the MSB long and the LSB long separately using the fast division algorithm and then concatenates them. That approach is about twice as fast as even the improved BaseN encoder from the above however it can’t do arbitrary lengths of bytes since it relies on longs.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
fabiolimacecommented, Mar 14, 2022

Making a copy of libdivide4j is also adding a dependency. I intend to keep this lib self-contained and efficient at the same time, although I know it’s an unattainable ideal as nothing is really independent. Alternatively, I could just copy part of libdivide4j, but I’m not confident in adding part of something I can’t fully understand. libdivide is really magic for me.

However, if a developer needs to speed up performance with libdivide4j, this can be done by injecting a custom division function. It can be considered an option. I didn’t notice any significant performance difference when plugging FastDivision wrapped in a CustomDivider. I think the JIT compiler does a pretty good job of optimizing it for us. Please take a look at this benchmark I just did:

Benchmark                      Mode  Cnt     Score     Error   Units
InjectedWithCustomDivider     thrpt    5  3156,578 ±  23,137  ops/ms
IncludedAsDependency          thrpt    5  3225,171 ±  99,144  ops/ms  +2%

I really appreciate the interest and advice other developers give this library. But I respectfully refuse to add libdivide4j as a dependency.

0reactions
agentgtcommented, Mar 15, 2022

I respect that and appreciate it. I wish more libraries did that. I was only pointing it out as you might have thought it was a ton of code but its only one class. If en/decoding was the primary goal of this project I would make a bigger deal about it but its not.

EDIT: To be clear I agree with your decision 👍

Read more comments on GitHub >

github_iconTop Results From Across the Web

Smarter Ways to Encode Categorical Data for Machine Learning
Use Category Encoders to improve model performance when you have nominal ... encoding options: Ordinal, One Hot, Binary, BaseN, and Hashing.
Read more >
Improvement for binary encoding. · Issue #74 - GitHub
I would like to suggest an improvement for the binary encoder. I am using the binary encoder this way: import category_encoders as ce ......
Read more >
What is Categorical Data | Categorical Data Encoding Methods
Hence BaseN encoding technique further reduces the number of features required to efficiently represent the data and improving memory usage. The ...
Read more >
Java: Universal BaseN encoder/decoder working with large ...
I'm looking for a decent BaseN encoder (with custom charset) in Java, that is not limited by input data size (array of bytes)....
Read more >
How to use transform categorical variables using encoders
BaseN Encoding converts the numeric index of a categorical variable to a numeric form. It can work with a range of different base...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found