Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Appetite for PR to accommodate `filters` with strings and `in` operator?

See original GitHub issue

When using ParquetFile() or read_parquet(), there is a use case for filters= with sets of strings as the values in the filters, particularly using the in or equality operators. For example:

filters = [('planet','in','jedha'), ('planet','in','scarif')]

This isn’t currently accommodated. Instead, as seen in fastparquet.api.filter_val(), there is a lexicographic comparison involving max/min operators. (There is also no warning for this behavior, and this use of filters is not proscribed by the docs.)

Can I make a PR to modify filter_val() for this use case?

Issue Analytics

State:
Created 5 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Aug 23, 2018

The best thing to do would be to look at similar tests and copy the style, e.g., https://github.com/dask/fastparquet/blob/master/fastparquet/test/test_api.py#L378 The only magic here is tempdir (the function argument), which automatically creates a space to put temporary files, and clears them up when the test is done.

You will want to fork this repo to your github (button in top-right) and set it as a remote url (git remote add), before putting changes in a new branch locally and pushing to the new remote. You then create the PR from your copy of the repo on github, and all the tests will run automatically on TravicCI.

0reactions

martindurantcommented, Aug 24, 2018

You may include fresh parquet data files, if most convenient, so long as they are small. You will see that there are a number of binary files in the repo already, under test-data/.

However, most tests generate their test data files on-the-fly in the given temporary directory. If possible, that would be preferable - it depends on whether fastparquet can conveniently produce the type of output you want to read.

(note: row_group_offsets is documented here in the docstring)

Before starting, you would do well to write a function that shows that the current implementation for "in" doesn’t do what you think it should. Perhaps it’s better to fix than than to introduce a new operator?

Top Results From Across the Web

Can you filter (in search bar) using contains operator for ...

I want to filter a search by looking for three possible strings within the values of a field. This works: fieldname contains 'string1', ......

How to use filter operators? - Knowledge Base - SellerLegend

In other words, a filter operator specifies the method of comparison to be used when comparing fields with the value entered in the...

Operators used in filters

Operators used in filters ; equals or after. The value of the field is the same as or greater than the specified value....

Logs Query Options - REST API Guide Overview

Field Name Field Type Supported Operators adf Boolean eq,ne significant Integer eq,lt,le,gt,ge,ne significance String eq,sw,ne,co,nc

Filter Operators Reference - Salesforce Help

Operators specify how filter criteria relate to each other. Refer to this list of filter operators when setting filters on list views, reports,...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Appetite for PR to accommodate `filters` with strings and `in` operator?

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED']

Would it be worth making this method accept both filepath and BytesIO/StringIO object?