question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Test failure test_frame_write_read_verify

See original GitHub issue

Test failure with 0.4.1 (and 0.4.0) cloned from this repo, with Python 3.8.

=================================== FAILURES ===================================
_ test_frame_write_read_verify[input_symbols8-10-hive-2-partitions8-filters8] __

tempdir = '/build/tmpighy8d7p', input_symbols = ['NOW', 'SPY', 'VIX']
input_days = 10, file_scheme = 'hive', input_columns = 2
partitions = ['symbol', 'dtTrade']
filters = [('dtTrade', '==', '2005-01-02T00:00:00.000000000')]

    @pytest.mark.parametrize('input_symbols,input_days,file_scheme,input_columns,'
                             'partitions,filters',
                             [
                                 (['NOW', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['now', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['TODAY', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['VIX*', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['QQQ*', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['QQQ!', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['Q%QQ', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['NOW', 'SPY', 'VIX'], 10, 'hive', 2,
                                  ['symbol', 'dtTrade'], [('symbol', '==', 'SPY')]),
                                 (['NOW', 'SPY', 'VIX'], 10, 'hive', 2,
                                  ['symbol', 'dtTrade'],
                                  [('dtTrade', '==',
                                    '2005-01-02T00:00:00.000000000')]),
                                 (['NOW', 'SPY', 'VIX'], 10, 'hive', 2,
                                  ['symbol', 'dtTrade'],
                                  [('dtTrade', '==',
                                    Timestamp('2005-01-01 00:00:00'))]),
                             ]
                             )
    def test_frame_write_read_verify(tempdir, input_symbols, input_days,
                                     file_scheme,
                                     input_columns, partitions, filters):
        if os.name == 'nt':
            pytest.xfail("Partitioning folder names contain special characters which are not supported on Windows")
    
        # Generate Temp Director for parquet Files
        fdir = str(tempdir)
        fname = os.path.join(fdir, 'test')
    
        # Generate Test Input Frame
        input_df = frame_symbol_dtTrade_type_strike(days=input_days,
                                                    symbols=input_symbols,
                                                    numbercolumns=input_columns)
        input_df.reset_index(inplace=True)
        write(fname, input_df, partition_on=partitions, file_scheme=file_scheme,
              compression='SNAPPY')
    
        # Read Back Whole Parquet Structure
        output_df = ParquetFile(fname).to_pandas()
        for col in output_df.columns:
            assert col in input_df.columns.values
        assert len(input_df) == len(output_df)
    
        # Read with filters
        filtered_output_df = ParquetFile(fname).to_pandas(filters=filters)
    
        # Filter Input Frame to Match What Should Be Expected from parquet read
        # Handle either string or non-string inputs / works for timestamps
        filterStrings = []
        for name, operator, value in filters:
            if isinstance(value, str):
                value = "'{}'".format(value)
            else:
                value = value.__repr__()
            filterStrings.append("{} {} {}".format(name, operator, value))
        filters_expression = " and ".join(filterStrings)
        filtered_input_df = input_df.query(filters_expression)
    
        # Check to Ensure Columns Match
        for col in filtered_output_df.columns:
            assert col in filtered_input_df.columns.values
        # Check to Ensure Number of Rows Match
>       assert len(filtered_input_df) == len(filtered_output_df)
E       assert 3 == 0
E         +3
E         -0

fastparquet/test/test_partition_filters_specialstrings.py:109: AssertionError

Environment:

  • Dask version: -
  • Python version: 3.8.5
  • Operating System: NixOS
  • Install method (conda, pip, source):

Build/test/run-time dependencies:

$ nix show-derivation -f . python3.pkgs.fastparquet | jq  '.[].inputDrvs | keys'
[
  "/nix/store/0fl8gz98vq7k0xpphn0ayx36illf7v8c-python-remove-tests-dir-hook.drv",
  "/nix/store/11vhhyvc6cz433snizyqdkpg7k2q5zkf-python3.8-pytest-runner-5.2.drv",
  "/nix/store/1a661nb0dli97gw6qy50msp85ll680rz-python-imports-check-hook.sh.drv",
  "/nix/store/22c15w3md8d2jdi7awb2k50392by8x6g-python3.8-thrift-0.13.0.drv",
  "/nix/store/29r49aa4sz6hypb3gv5sdw330vj2j2ii-python3.8-numpy-1.19.1.drv",
  "/nix/store/377gwr2f2il0mi2kmq0yah2knhsyhsd5-hook.drv",
  "/nix/store/3h7k0zvr8psgmz4nyh17z1isjsj7px72-pip-install-hook.drv",
  "/nix/store/3vgc68qbg9c5qhb18xc41ihaqw0bng6l-python3.8-setuptools-47.3.1.drv",
  "/nix/store/4qry96ap0kpkjwjlsyc8p3m3hh6pg5pv-bash-4.4-p23.drv",
  "/nix/store/5y6w15gqfhiiw3v79ybqsai55c48k88p-python3.8-zstd-1.4.5.1.drv",
  "/nix/store/7r9z46n4rccnzdr3l3nxz1qvnsc6gcbz-setuptools-check-hook.drv",
  "/nix/store/7ryff7q11maypkrqg0k4hpj57m7xb5sw-python3.8-pandas-1.1.1.drv",
  "/nix/store/80pzh07z7qxq1j6v4bnj1qmrv9arwjmj-python3.8-pytest-5.4.3.drv",
  "/nix/store/877v1y795mz5qa2mji8mrrm6an7ryif8-python3.8-numba-0.51.1.drv",
  "/nix/store/9jp75f9q5spp2wwyml63yf7lkciqz4cr-source.drv",
  "/nix/store/caad1plf2ddqrjrmhvmraaksdcmhcn0q-python-catch-conflicts-hook.drv",
  "/nix/store/czk62c3arggf1w17nmxcgnxjslx9qxz6-python-remove-bin-bytecode-hook.drv",
  "/nix/store/myrlr2xv6zwmwm634frd01rirjxk1a40-python3.8-python-lz4-2.1.10.drv",
  "/nix/store/n0w17xq75lr9vx6qiw28097ymrifvkl0-python-recompile-bytecode-hook.drv",
  "/nix/store/nizihiiy8gcwn61sfd538vq0bf3ll5ll-stdenv-linux.drv",
  "/nix/store/q1q5dsc3pcx10clb38gyrbrgivl47kl8-python3-3.8.5.drv",
  "/nix/store/rhl55hlw72qc7a8qz82xp28xs1kq69qm-hook.drv",
  "/nix/store/sjg4vq1gjzipd76zzijxqq04bzlz2iqp-python3.8-python-snappy-0.5.4.drv",
  "/nix/store/vfrswqlwnpz9shla74fv6irdncngj8h5-python-namespaces-hook.sh.drv",
  "/nix/store/vk9rkwnkmgn9knnwbvxwjbzrxi45s965-setuptools-setup-hook.drv"
]

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
veprblcommented, Sep 27, 2020

It seems like the difference is occuring in the generation of the file path https://github.com/dask/fastparquet/blob/a8cb8d1a28eb2db4ada233052cbc01bf815c2551/fastparquet/writer.py#L952-L971

There are difference in behaviour of groupby for multi index, it can be seen in a following example:

import numpy as np
import pandas as pd
print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)

In a previous version it used to preserve the type

# nix-shell -p python3Packages.pandas -p python3Packages.numpy -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/69cb94ebb3193fc5077ee99ab2b50353151466ae.tar.gz --run 'python3 -c "import numpy as np; import pandas as pd; print(pd.__version__); print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)"'
1.0.5
{(numpy.datetime64('2020-01-01T00:00:00.000000000'), 12345): array([0])}

but now started to perform a conversion

# nix-shell -p python3Packages.pandas -p python3Packages.numpy -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/2dafde493f153dba0eb4b34cd49763ee78eda3d9.tar.gz --run 'python3 -c "import numpy as np; import pandas as pd; print(pd.__version__); print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)"'
1.1.0
{(Timestamp('2020-01-01 00:00:00'), 12345): array([0])}

Hm, I have pandas 1.1.0, and it still passes for me locally 😐

@martindurant That might be because you’ve changed the compared value as a part of a8cb8d1a28eb2db4ada233052cbc01bf815c2551. That should have broken the test on older pandas versions such as 1.0.5.

0reactions
martindurantcommented, Sep 29, 2020

The names of partitioning directories in the “hive” were changed because the dates were rendered to string with default format of the type

Correct, we think this is what’s going on

it appears that the behaviour in 1.1.0 is not a bug

Well, it’s a change in behaviour, hence the problem for us. Perhaps wrapping in Timestamp in the expected value solves this for all cases.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Feature] Allow test run to continue the case execution if a test ...
It would be nice to have a flag or something that could be use to allow a test run go trough all the...
Read more >
Making Maven run all tests, even when some fail
When all tests pass, Maven test runs them all. When tests fail in the first module, maven will not continue to the next...
Read more >
Write Unit Test Cases using JUnit 5 - Error vs Failure - 03
junit5 #junit #java #junit2022 #2022 #unittestingBuild Production Ready REST API with Spring Boot - Expense Tracker API @ Udemy [94% ...
Read more >
Writing Automated Integration Tests for Failure Scenarios
In this blog, I will introduce the process I went through to diagnose the bug and determine the correct integration test solution to...
Read more >
Failing at Integration Testing: Common Mistakes - TestQuality
Learn about Common Pitfalls and Mistakes of Integration Testing, and how to avoid them.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found