[BUG] Filtering with user-defined-functions returns different dataset as when using filtering with pythonic query language
See original GitHub issueππ Bug Report
βοΈ Current Behavior
Using a user-defined-function or the pythonic query language (see Step 9: Dataset Filtering) to filter a dataset on the same parameter doesnβt return the same dataset.
Input Code
Initialization
import hub
import numpy as np
# Initialize the empty dataset
ds = hub.empty('/dataset/path')
with ds:
# Classes
ds.create_tensor('classes', htype='class_label', class_names = ['class_0', 'class_1', 'class_2'])
ds.classes.info.update(notes = 'Different dummy classes.')
ds.summary()
Dataset(path='/dataset/path', tensors=['classes'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
classes class_label (0,) uint32 None
Adding elements
@hub.compute
def create_dataset(class_num, sample_out):
""" Add new element with a specific class"""
sample_out.append({
"classes": np.uint32(class_num),
})
return sample_out
with ds:
# Add 30 elements with randomly generated class
list_classes = list(np.random.randint(len(ds.classes.info.class_names), size=30))
create_dataset().eval(list_classes, ds, num_workers = 2)
ds.summary()
Evaluating create_dataset: 100%|βββββββββββββββββββββββββββββββββββββββββ| 30/30 [00:00<00:00, 1556.54it/s]
Dataset(path='/dataset/path', tensors=['classes'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
classes class_label (30, 1) uint32 None
Filtering using pythonic query language
ds_view = ds.filter("classes == 'class_0'", scheduler = 'threaded', num_workers = 0)
# Print the class index for all the elements in the new dataset view.
print(ds_view.classes[::].numpy()[:,0])
100%|βββββββββββββββββββββββββββββββββββββββββ| 30/30 [00:00<00:00, 15650.39it/s]
[0 0 0 0 1 0 2 1 1]
Filtering using User-defined-function
@hub.compute
def filter_classes(sample_in, class_list, class_names):
text_class = class_names[sample_in.classes.numpy()[0]]
return text_class in class_list
ds_view = ds.filter(filter_classes(['class_0'], ds.classes.info.class_names), scheduler = 'threaded', num_workers = 0)
# Print the class index for all the elements in the new dataset view.
print(ds_view.classes[::].numpy()[:,0])
100%|βββββββββββββββββββββββββββββββββββββββββ| 30/30 [00:00<00:00, 18014.19it/s]
[0 0 0 0 0 0 0 0 0]
Expected behavior/code It would be expected that both the user-defined-function and pythonic query language filtering returns the same dataset when filtering with the same parameters.
In our example, it is obvious that the dataset returned by the pythonic query language is incorrect while the one using the user-defined-function is correct.
βοΈ Environment
Python
version(s): 3.8.10OS
: Ubuntu 20.04.3 LTSIDE
: VS-Code (1.66.2) + Jupyter Lab (3.3.0)Packages
:hub==2.3.4 - latest
Issue Analytics
- State:
- Created a year ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Best practices for filtering and ordering | BigQuery
Describes best practices for using A WHERE clause to filter data or an ORDER BY clause to order data.
Read more >Introduction to SQL Using Python: Filtering Data with ... - Medium
This tutorial will show you how to access the power of SQL using Python and how data can be filtered using the WHERE...
Read more >Filtering SQL query based on parameters with more than one ...
An IN test compares a column name against multiple values listed in the (...) parentheses. Given that you are using psycopg2 here, you...
Read more >Query and analyze data - Mode Support
Querying multiple data sources. Mode reports can contain multiple queries, and each individual query can retrieve data from any one connected database.
Read more >Filtering - Palantir
filter (expression). Returns a new DataFrame with a subset of rows determined by the boolean expression. The expression parameter is a boolean column...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@LucasVandroux Iβll check in with @istranic and heβll update you shortly on this! Sorry for the holdup.
hey @LucasVandroux , sorry that youβve run into this issue. Iβve brought it up with the team and theyβre on it, hang tight! π
Tagging @istranic and @Diveafall for querying visibility.