Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEATURE-REQUEST] Opposite String Startswith Search in VAEX Dataframe

See original GitHub issue

Description Say we have a string to match in a database,

We can accomplish that by a simple select and evaluate in VAEX:

df_vaex.select(df_vaex["name"].str.startswith(search_string))

However, this searches for search_string in Database Entries rather than Database Entries in search_string.

Can this be performed using Vaex?

Is your feature request related to a problem? Please describe.

search_string1 = "ASTHA MAT" 
search_string2 = "ASTHA MATERIALS INDIA" 

df_vaex.select(df_vaex["name"].str.startswith(search_string1))
df_vaex.evaluate(df_vaex["name"], selection=True)
# ASTHA MATERIALS

df_vaex.select(search_string2.startswith(df_vaex["name"]))
# TypeError: startswith first arg must be str or a tuple of str, not Expression

Additional context Would be great to have a reverse search technology in a vectorized fashion for quick searching as in pandas.Series.isin!

Issue Analytics

State:
Created a year ago
Comments:18 (7 by maintainers)

Top GitHub Comments

2reactions

Ben-Epsteincommented, Jul 13, 2022

@khanfarhan10 I would think regex is your best bet. If you want something a bit more specific you could do a registered function

import vaex
import pyarrow as pa

dict_data = dict(name=["ASTHA MATERIALS" , "LOREM IPSUM" ], locationID=[5454,6767]) # with other cols as well

df = vaex.from_dict(dict_data)
search_string = "ASTHA MATERIALS INDIA"



@vaex.register_function()
def str_contains_col(col_vals, str_search):
    return pa.array([str_search.startswith(v) for v in col_vals.to_pylist()])

df.func.str_contains_col(df["name"], search_string)

2reactions

maartenbreddelscommented, Jun 24, 2022

looks that way… although our apply should be parallel (multiprocessing), but it’s a good point, I’ll see if we can have the reverse without too many changes.