question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Filtering with user-defined-functions returns different dataset as when using filtering with pythonic query language

See original GitHub issue

πŸ›πŸ› Bug Report

βš—οΈ Current Behavior

Using a user-defined-function or the pythonic query language (see Step 9: Dataset Filtering) to filter a dataset on the same parameter doesn’t return the same dataset.

Input Code

Initialization

import hub
import numpy as np

# Initialize the empty dataset
ds = hub.empty('/dataset/path')

with ds:    
    # Classes
    ds.create_tensor('classes', htype='class_label', class_names = ['class_0', 'class_1', 'class_2'])
    ds.classes.info.update(notes = 'Different dummy classes.')
    
ds.summary()
Dataset(path='/dataset/path', tensors=['classes'])

 tensor      htype      shape    dtype  compression
 -------    -------    -------  -------  ------- 
 classes  class_label   (0,)    uint32    None   

Adding elements

@hub.compute
def create_dataset(class_num, sample_out):
    """ Add new element with a specific class"""
    
    sample_out.append({
        "classes": np.uint32(class_num),
    })
        
    return sample_out

with ds:
    # Add 30 elements with randomly generated class
    list_classes = list(np.random.randint(len(ds.classes.info.class_names), size=30))
    create_dataset().eval(list_classes, ds, num_workers = 2)
    
ds.summary()
Evaluating create_dataset: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 30/30 [00:00<00:00, 1556.54it/s]
Dataset(path='/dataset/path', tensors=['classes'])

 tensor      htype      shape    dtype  compression
 -------    -------    -------  -------  ------- 
 classes  class_label  (30, 1)  uint32    None 

Filtering using pythonic query language

ds_view = ds.filter("classes == 'class_0'", scheduler = 'threaded', num_workers = 0)

# Print the class index for all the elements in the new dataset view.
print(ds_view.classes[::].numpy()[:,0])
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 30/30 [00:00<00:00, 15650.39it/s]
[0 0 0 0 1 0 2 1 1]

Filtering using User-defined-function

@hub.compute
def filter_classes(sample_in, class_list, class_names):
    text_class = class_names[sample_in.classes.numpy()[0]]
    
    return text_class in class_list

ds_view = ds.filter(filter_classes(['class_0'], ds.classes.info.class_names), scheduler = 'threaded', num_workers = 0)

# Print the class index for all the elements in the new dataset view.
print(ds_view.classes[::].numpy()[:,0])
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 30/30 [00:00<00:00, 18014.19it/s]
[0 0 0 0 0 0 0 0 0]

Expected behavior/code It would be expected that both the user-defined-function and pythonic query language filtering returns the same dataset when filtering with the same parameters.

In our example, it is obvious that the dataset returned by the pythonic query language is incorrect while the one using the user-defined-function is correct.

βš™οΈ Environment

  • Python version(s): 3.8.10
  • OS: Ubuntu 20.04.3 LTS
  • IDE: VS-Code (1.66.2) + Jupyter Lab (3.3.0)
  • Packages: hub==2.3.4 - latest

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
mikayelhcommented, Apr 28, 2022

@LucasVandroux I’ll check in with @istranic and he’ll update you shortly on this! Sorry for the holdup.

1reaction
mikayelhcommented, Apr 21, 2022

hey @LucasVandroux , sorry that you’ve run into this issue. I’ve brought it up with the team and they’re on it, hang tight! πŸ˜ƒ

Tagging @istranic and @Diveafall for querying visibility.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Best practices for filtering and ordering | BigQuery
Describes best practices for using A WHERE clause to filter data or an ORDER BY clause to order data.
Read more >
Introduction to SQL Using Python: Filtering Data with ... - Medium
This tutorial will show you how to access the power of SQL using Python and how data can be filtered using the WHERE...
Read more >
Filtering SQL query based on parameters with more than one ...
An IN test compares a column name against multiple values listed in the (...) parentheses. Given that you are using psycopg2 here, you...
Read more >
Query and analyze data - Mode Support
Querying multiple data sources. Mode reports can contain multiple queries, and each individual query can retrieve data from any one connected database.
Read more >
Filtering - Palantir
filter (expression). Returns a new DataFrame with a subset of rows determined by the boolean expression. The expression parameter is a boolean column...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found