question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected behavior when attempt to filter a DF from column in another DF; works with pandas or when os.environ["MODIN_CPUS"] = "1" applied

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • Modin version (modin.__version__): 0.9.1
  • Python version: 3.8.11 via conda 4.10.3
  • Code we can use to reproduce: –method 1 df = df[~df[“field_name”].isin(suppression_df[“field_name”])] –OR –method 2 df = ( df.merge(suppression_df, on=[“field_name”], how=‘left’, indicator=True) .query(‘_merge == “left_only”’) .drop(columns=‘_merge’) )

df is created by merging several feather files into a single DF. The dimensions of this process don’t fluctuate. suppression_df is created via pd.read_sql with sqlalchemy (version below). This dimensions of this process don’t fluctuate.

Describe the problem

Goal: to filter an existing data frame by a column in another data frame of the same name. I attempted both methods stated above. The filtering works in that some values are filtered. However, not all the values are filtered, some remain. The number of records that aren’t filtered out changes every run.

After a while I suspected that the parallelism in Modin might be causing things to go awry. I switched to pandas and wham, every record/row that should’ve been dropped was dropped/filtered. With pandas the number of records/rows that were dropped remained the same every time I ran the script.

I then switched back to Modin and used this statement (os.environ[“MODIN_CPUS”] = “1”) to limit the cores to 1 (to emulate pandas). And sure enough, the filtering works as expected (all records/rows that should be dropped/filtered were).

Any thoughts about this?

Source code / logs

IDK how to do this. I’m happy to add more info but will require some assistance.

Using ray, version 1.1.0 I saw this somewhere in the documentation and use it: ray.init( ignore_reinit_error=True, log_to_driver=False )

IPython 7.27.0 PyCharm community 2020.3 sqlalchemy 1.4.22

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:29 (17 by maintainers)

github_iconTop GitHub Comments

2reactions
amyskovcommented, Jan 11, 2022

Hi @SaintRod ! You can create test database using pandas or modin to make a reproducer, see example below:

import modin.pandas as pd
# modin util to check equality of frames
from modin.pandas.test.utils import df_equals
import pandas
import os

os.environ["MODIN_CPUS"] = "12"
table_name = "test_table"
query = f"SELECT * FROM {table_name}"
conn = "sqlite:///test_data.db"

# get test dataset
df = pandas.read_csv("s3://modin-datasets/cloud/taxi/trips_xaa.csv")
# put test data to sqlite database
df.to_sql(table_name, conn, index=False)

df_pandas = pandas.read_sql(query, conn)
df_pd = pd.read_sql(query, conn)

df_equals(df_pandas, df_pd) # frames are equal

In this example sqlite database and portion of NYC taxi benchmark dataset were used and it works fine, so it looks like the issue you faced can occur with certain database engine or with some specific data. Can you please provide us some more details about your case? Maybe you can provide us with database name you use, or maybe you can modify dataset from example above to trigger your issue?

1reaction
RehanSDcommented, Nov 14, 2022

Hi @SaintRod! No worries at all - I think we should probably just close this ticket for now, since we no longer know if it’s reproducible or not!

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas - filter dataframe by another dataframe by row elements
You can do this efficiently using isin on a multiindex constructed from the desired columns: df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],...
Read more >
Pandas Filter Rows by Conditions - Spark by {Examples}
You can filter the Rows from pandas DataFrame based on a single condition or multiple conditions either using DataFrame.loc[] attribute,
Read more >
pandas.DataFrame.apply — pandas 1.5.2 documentation
Objects passed to the function are Series objects whose index is either the DataFrame's index ( axis=0 ) or the DataFrame's columns (...
Read more >
Pandas DataFrame apply() Examples - DigitalOcean
The possible values are {0 or 'index', 1 or 'columns'}, default 0. args: The positional arguments to pass to the function. This is...
Read more >
Views and Copies in pandas - Practical Data Science
Unfortunately, while this behavior is relatively straightforward in numpy, ... [1]:. import pandas as pd import numpy as np df = pd.DataFrame({"a": ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found