question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Write a Cython/C in-place rstrip function to speed up writing of FITS tables

See original GitHub issue

I was profiling writing of FITS tables with string columns (which has terrible performance at the moment) and I found that one of the bottlenecks is the following function in fitsrec.py

def _rstrip_inplace(array, chars=None):
    """
    Performs an in-place rstrip operation on string arrays.
    This is necessary since the built-in `np.char.rstrip` in Numpy does not
    perform an in-place calculation.  This can be removed if ever
    https://github.com/numpy/numpy/issues/6303 is implemented (however, for
    the purposes of this module the only in-place vectorized string functions
    we need are rstrip and encode).
    """

    for item in np.nditer(array, flags=['zerosize_ok'],
                                 op_flags=['readwrite']):
        item[...] = item.item().rstrip(chars)

At the moment, https://github.com/numpy/numpy/issues/6303 hasn’t been implemented yet, but it should be reasonably easy to write a C or Cython function that would implement something like this ourselves for now. Specifically, we need a function that loops over elements of a string array and replaces trailing spaces with null. I think the changes to Numpy would be a lot harder to do.

This could speed up writing FITS tables with character arrays by a factor of ~10x or more when combined with some other changes I’m preparing locally (PR forthcoming)

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
mhvkcommented, Nov 29, 2017

@astrofrog - the numpy code essentially does the same as the above loop (it ends up calling back to python for the rstrip method), so I agree that is no solution. I played around a little and the following speeds up the code by a factor of 2 (for bytes; for unicode it is essentially the same, sadly). Maybe worth including until we do write the cython function…

def _rstrip(array, chars=None):
    flat = array.flat
    for j, item in enumerate(array.flat):
        flat[j] = item.rstrip(chars)

… … … But one does a factor 40 better by turning the arrays temporarily into numbers:

def _rstrip2(array):
    dt = array.dtype
    if dt.kind not in 'SU':
        raise
    dt_str = str(dt)
    dt_int = dt_str[2:] + dt.str[0] + 'u' + ('1' if dt_str[1] == 'S' else '4')
    b = array.view(dt_int)
    for j in range(0, b.shape[0], 10000):
        c = b[j:j+10000]
        mask = np.ones(c.shape[:-1], dtype=bool)
        for i in range(-1, -c.shape[-1], -1):
            mask &= c[..., i] == 32
            c[..., i][mask] = 0
            mask = c[..., i] == 0
    return array

Here the loop in j tries to avoid creating very large temporary mask arrays (which would perhaps defeat the purpose of doing this in-place to start with). As written, it is obviously not safe for strange array shapes.

a = np.array(['abc  ', 'd  ']*10000)
%timeit _rstrip(a)
# 100 loops, best of 3: 12.4 ms per loop
%timeit _rstrip2(a.copy())
# 1000 loops, best of 3: 325 µs per loop
a = np.array([b'abc  ', b'd  ']*10000)
%timeit _rstrip(a)
#100 loops, best of 3: 4.85 ms per loop
%timeit _rstrip2(a)
1000 loops, best of 3: 251 µs per loop
0reactions
mhvkcommented, Dec 20, 2017

I meant complexity mostly for code developers, of having to look at a differently coded piece to see what’s happening (for me at least, this is not a negligible hurdle).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python String rstrip() Method - GeeksforGeeks
Python String rstrip() method returns a copy of the string with trailing characters removed (based on the string argument passed).
Read more >
Python String | rstrip() method with Examples - Javatpoint
Python rstrip () method removes all the trailing characters from the string. It means it removes all the specified characters from right side...
Read more >
How can I speed up reading multiple files and putting the data ...
A much faster way of doing this is to read the contents of the input file into a primitive data structure such as...
Read more >
Your Guide to the Python print() Function
This tutorial will get you up to speed with using Python print() effectively. ... character from a string in Python, use its .rstrip()...
Read more >
Usage - reading/writing tables — petl 1.7.12 documentation
petl uses simple python functions for providing a rows and columns abstraction for reading and writing data from files, databases, and other sources....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found