question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CI: `test-fuzzydata` fails when `sample` uses a small `frac`

See original GitHub issue

test-fuzzydata has been failing occasionally as can be seen here: https://github.com/modin-project/modin/actions/runs/3062497807/jobs/4943541816

Stack trace
/usr/share/miniconda3/envs/modin/lib/python3.8/site-packages/fuzzydata/core/generator.py:379: in generate_workflow
    raise e
/usr/share/miniconda3/envs/modin/lib/python3.8/site-packages/fuzzydata/core/generator.py:367: in generate_workflow
    wf.execute_current_operation(next_label)
/usr/share/miniconda3/envs/modin/lib/python3.8/site-packages/fuzzydata/core/workflow.py:197: in execute_current_operation
    raise e
/usr/share/miniconda3/envs/modin/lib/python3.8/site-packages/fuzzydata/core/workflow.py:168: in execute_current_operation
    new_artifact = self.current_operation.execute(new_label)
/usr/share/miniconda3/envs/modin/lib/python3.8/site-packages/fuzzydata/core/operation.py:172: in execute
    result = self.materialize(new_label)
/usr/share/miniconda3/envs/modin/lib/python3.8/site-packages/fuzzydata/clients/pandas.py:114: in materialize
    new_df = eval(self.code)
<string>:1: in <module>
    ???
modin/logging/logger_decorator.py:128: in run_and_log
    return obj(*args, **kwargs)
modin/_compat/pandas_api/latest/base.py:255: in sample
    return self._sample(
modin/logging/logger_decorator.py:128: in run_and_log
    return obj(*args, **kwargs)
modin/pandas/base.py:2451: in _sample
    return self._default_to_pandas(
modin/logging/logger_decorator.py:128: in run_and_log
    return obj(*args, **kwargs)
modin/pandas/base.py:431: in _default_to_pandas
    result = getattr(self._pandas_class, op)(pandas_obj, *args, **kwargs)
/usr/share/miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/generic.py:5438: in sample
    size = sample.process_sampling_size(n, frac, replace)

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
suhailrehmancommented, Sep 15, 2022

@pyrito These are precisely the kinds of errors I wanted this project to surface! Thanks for digging in!

I haven’t been monitoring this modin integration heavily. But I have some free time this week, and I think I can make some improvements to help improve the testing/debugging experience):

  1. I think the same workflow should run on both dask and ray (right now, it’s generating two separate random workflows running on ray and dask, respectively)
  2. Artifacts are not being written out on failure (significant pain to rerun and verify if the bug has been fixed)
  3. fuzzydata should also run the workflow against pandas to measure performance regressions - maybe if the scale factor is significant enough to warrant it - the primary focus right now is correctness and not performance.
1reaction
pyritocommented, Sep 15, 2022

@mvashishtha @suhailrehman I spent some time digging into the fuzzydata code base and I don’t think the issue is there. Rather, I think we are hitting a very, very specific edge case in how Modin handles sample. If you see here: https://github.com/modin-project/modin/blob/eca9a936846faa31b116a1e58b1114d90cfa44d8/modin/pandas/base.py#L2439, it’s actually possible to end up getting n to be 0, so it’ll end up executing the wrong code path that has n and frac set.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How Sample Size Affects the Margin of Error - Dummies.com
Sample size and margin of error have an inverse relationship. When your sample increases, your margin of error goes down — to a...
Read more >
Small sample size confidence intervals (video) - Khan Academy
Sample mean +/- the margin of error gives us the confidence interval. If we are using a 95% confidence level, then it can...
Read more >
Determining sample size based on confidence and margin of ...
What is the smallest sample size required to obtain the desired margin of error ? So let's just remind ourselves what the confidence...
Read more >
How Big a Sample Do I Need? - BrownMath.com
Answer: To find an 95% CI with a margin of error no more than ±3.5 percentage points, where you have no idea of...
Read more >
Sample Size Calculator
This free sample size calculator determines the sample size required to meet a given set of constraints. Also, learn more about population standard ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found