Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Test failure test_frame_write_read_verify

See original GitHub issue

Test failure with 0.4.1 (and 0.4.0) cloned from this repo, with Python 3.8.

=================================== FAILURES ===================================
_ test_frame_write_read_verify[input_symbols8-10-hive-2-partitions8-filters8] __

tempdir = '/build/tmpighy8d7p', input_symbols = ['NOW', 'SPY', 'VIX']
input_days = 10, file_scheme = 'hive', input_columns = 2
partitions = ['symbol', 'dtTrade']
filters = [('dtTrade', '==', '2005-01-02T00:00:00.000000000')]

    @pytest.mark.parametrize('input_symbols,input_days,file_scheme,input_columns,'
                             'partitions,filters',
                             [
                                 (['NOW', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['now', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['TODAY', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['VIX*', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['QQQ*', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['QQQ!', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['Q%QQ', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['NOW', 'SPY', 'VIX'], 10, 'hive', 2,
                                  ['symbol', 'dtTrade'], [('symbol', '==', 'SPY')]),
                                 (['NOW', 'SPY', 'VIX'], 10, 'hive', 2,
                                  ['symbol', 'dtTrade'],
                                  [('dtTrade', '==',
                                    '2005-01-02T00:00:00.000000000')]),
                                 (['NOW', 'SPY', 'VIX'], 10, 'hive', 2,
                                  ['symbol', 'dtTrade'],
                                  [('dtTrade', '==',
                                    Timestamp('2005-01-01 00:00:00'))]),
                             ]
                             )
    def test_frame_write_read_verify(tempdir, input_symbols, input_days,
                                     file_scheme,
                                     input_columns, partitions, filters):
        if os.name == 'nt':
            pytest.xfail("Partitioning folder names contain special characters which are not supported on Windows")
    
        # Generate Temp Director for parquet Files
        fdir = str(tempdir)
        fname = os.path.join(fdir, 'test')
    
        # Generate Test Input Frame
        input_df = frame_symbol_dtTrade_type_strike(days=input_days,
                                                    symbols=input_symbols,
                                                    numbercolumns=input_columns)
        input_df.reset_index(inplace=True)
        write(fname, input_df, partition_on=partitions, file_scheme=file_scheme,
              compression='SNAPPY')
    
        # Read Back Whole Parquet Structure
        output_df = ParquetFile(fname).to_pandas()
        for col in output_df.columns:
            assert col in input_df.columns.values
        assert len(input_df) == len(output_df)
    
        # Read with filters
        filtered_output_df = ParquetFile(fname).to_pandas(filters=filters)
    
        # Filter Input Frame to Match What Should Be Expected from parquet read
        # Handle either string or non-string inputs / works for timestamps
        filterStrings = []
        for name, operator, value in filters:
            if isinstance(value, str):
                value = "'{}'".format(value)
            else:
                value = value.__repr__()
            filterStrings.append("{} {} {}".format(name, operator, value))
        filters_expression = " and ".join(filterStrings)
        filtered_input_df = input_df.query(filters_expression)
    
        # Check to Ensure Columns Match
        for col in filtered_output_df.columns:
            assert col in filtered_input_df.columns.values
        # Check to Ensure Number of Rows Match
>       assert len(filtered_input_df) == len(filtered_output_df)
E       assert 3 == 0
E         +3
E         -0

fastparquet/test/test_partition_filters_specialstrings.py:109: AssertionError

Environment:

Dask version: -
Python version: 3.8.5
Operating System: NixOS
Install method (conda, pip, source):

Build/test/run-time dependencies:

$ nix show-derivation -f . python3.pkgs.fastparquet | jq  '.[].inputDrvs | keys'
[
  "/nix/store/0fl8gz98vq7k0xpphn0ayx36illf7v8c-python-remove-tests-dir-hook.drv",
  "/nix/store/11vhhyvc6cz433snizyqdkpg7k2q5zkf-python3.8-pytest-runner-5.2.drv",
  "/nix/store/1a661nb0dli97gw6qy50msp85ll680rz-python-imports-check-hook.sh.drv",
  "/nix/store/22c15w3md8d2jdi7awb2k50392by8x6g-python3.8-thrift-0.13.0.drv",
  "/nix/store/29r49aa4sz6hypb3gv5sdw330vj2j2ii-python3.8-numpy-1.19.1.drv",
  "/nix/store/377gwr2f2il0mi2kmq0yah2knhsyhsd5-hook.drv",
  "/nix/store/3h7k0zvr8psgmz4nyh17z1isjsj7px72-pip-install-hook.drv",
  "/nix/store/3vgc68qbg9c5qhb18xc41ihaqw0bng6l-python3.8-setuptools-47.3.1.drv",
  "/nix/store/4qry96ap0kpkjwjlsyc8p3m3hh6pg5pv-bash-4.4-p23.drv",
  "/nix/store/5y6w15gqfhiiw3v79ybqsai55c48k88p-python3.8-zstd-1.4.5.1.drv",
  "/nix/store/7r9z46n4rccnzdr3l3nxz1qvnsc6gcbz-setuptools-check-hook.drv",
  "/nix/store/7ryff7q11maypkrqg0k4hpj57m7xb5sw-python3.8-pandas-1.1.1.drv",
  "/nix/store/80pzh07z7qxq1j6v4bnj1qmrv9arwjmj-python3.8-pytest-5.4.3.drv",
  "/nix/store/877v1y795mz5qa2mji8mrrm6an7ryif8-python3.8-numba-0.51.1.drv",
  "/nix/store/9jp75f9q5spp2wwyml63yf7lkciqz4cr-source.drv",
  "/nix/store/caad1plf2ddqrjrmhvmraaksdcmhcn0q-python-catch-conflicts-hook.drv",
  "/nix/store/czk62c3arggf1w17nmxcgnxjslx9qxz6-python-remove-bin-bytecode-hook.drv",
  "/nix/store/myrlr2xv6zwmwm634frd01rirjxk1a40-python3.8-python-lz4-2.1.10.drv",
  "/nix/store/n0w17xq75lr9vx6qiw28097ymrifvkl0-python-recompile-bytecode-hook.drv",
  "/nix/store/nizihiiy8gcwn61sfd538vq0bf3ll5ll-stdenv-linux.drv",
  "/nix/store/q1q5dsc3pcx10clb38gyrbrgivl47kl8-python3-3.8.5.drv",
  "/nix/store/rhl55hlw72qc7a8qz82xp28xs1kq69qm-hook.drv",
  "/nix/store/sjg4vq1gjzipd76zzijxqq04bzlz2iqp-python3.8-python-snappy-0.5.4.drv",
  "/nix/store/vfrswqlwnpz9shla74fv6irdncngj8h5-python-namespaces-hook.sh.drv",
  "/nix/store/vk9rkwnkmgn9knnwbvxwjbzrxi45s965-setuptools-setup-hook.drv"
]

Issue Analytics

State:
Created 3 years ago
Comments:13 (7 by maintainers)

Top GitHub Comments

2reactions

veprblcommented, Sep 27, 2020

It seems like the difference is occuring in the generation of the file path https://github.com/dask/fastparquet/blob/a8cb8d1a28eb2db4ada233052cbc01bf815c2551/fastparquet/writer.py#L952-L971

There are difference in behaviour of groupby for multi index, it can be seen in a following example:

import numpy as np
import pandas as pd
print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)

In a previous version it used to preserve the type

# nix-shell -p python3Packages.pandas -p python3Packages.numpy -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/69cb94ebb3193fc5077ee99ab2b50353151466ae.tar.gz --run 'python3 -c "import numpy as np; import pandas as pd; print(pd.__version__); print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)"'
1.0.5
{(numpy.datetime64('2020-01-01T00:00:00.000000000'), 12345): array([0])}

but now started to perform a conversion

# nix-shell -p python3Packages.pandas -p python3Packages.numpy -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/2dafde493f153dba0eb4b34cd49763ee78eda3d9.tar.gz --run 'python3 -c "import numpy as np; import pandas as pd; print(pd.__version__); print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)"'
1.1.0
{(Timestamp('2020-01-01 00:00:00'), 12345): array([0])}

Hm, I have pandas 1.1.0, and it still passes for me locally 😐

@martindurant That might be because you’ve changed the compared value as a part of a8cb8d1a28eb2db4ada233052cbc01bf815c2551. That should have broken the test on older pandas versions such as 1.0.5.

0reactions

martindurantcommented, Sep 29, 2020

The names of partitioning directories in the “hive” were changed because the dates were rendered to string with default format of the type

Correct, we think this is what’s going on

it appears that the behaviour in 1.1.0 is not a bug

Well, it’s a change in behaviour, hence the problem for us. Perhaps wrapping in Timestamp in the expected value solves this for all cases.

Top Results From Across the Web

[Feature] Allow test run to continue the case execution if a test ...

It would be nice to have a flag or something that could be use to allow a test run go trough all the...

Making Maven run all tests, even when some fail

When all tests pass, Maven test runs them all. When tests fail in the first module, maven will not continue to the next...

Write Unit Test Cases using JUnit 5 - Error vs Failure - 03

junit5 #junit #java #junit2022 #2022 #unittestingBuild Production Ready REST API with Spring Boot - Expense Tracker API @ Udemy [94% ...

Writing Automated Integration Tests for Failure Scenarios

In this blog, I will introduce the process I went through to diagnose the bug and determine the correct integration test solution to...

Failing at Integration Testing: Common Mistakes - TestQuality

Learn about Common Pitfalls and Mistakes of Integration Testing, and how to avoid them.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

uninitialized array leads to failure in tz_localize ("UserWarning: Inferring time-zone from CET in column __null_dask_index__ failed, using time-zone-agnostic")

Test failure test_frame_write_read_verify

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

uninitialized array leads to failure in tz_localize ("UserWarning: Inferring time-zone from CET in column __null_dask_index__ failed, using time-zone-agnostic")

Error in converting byte array to list of floats using dask dataframe