Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Format Breaking Changes Candidates

See original GitHub issue

This is a meta issue describing which changes we would like to make now but which are incompatible with the current format.

Type Normalization

The following type normalizations are currently not implemented but could be:

decimal128[P, S] -> decimal128[38, S] (38 is the max for 128 bits)
date{32, 64} -> date64
time{32, 64}[U] -> time64[U]
structs (nested normalization)

Index Handling

Reject non-integer/range indices, use reset_index and drop index information before writing data. Always restore as normalized (reset_index) indices, even when applying predicates.

Pandas-specific Metadata

Pandas-specific metadata is part of the Arrow schema but is not part of the Arrow Type system. It captures information like the index type. If Index Handling is implemented, we could drop the entire pandas metadata field. This would simplify interopt with other languages/frameworks.

Labels

Use UUIDs everwhere and reject user-provided labels.

Issue Analytics

State:
Created 4 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

lr4dcommented, Jun 6, 2019

If partition labels and partition indices (recently deprecated) are both removed, it will be nice for UX if parse_input_to_metapartition accepts what is currently the value of the data argument, given that data would be the only key remaining in the input dictionary.

That is, instead of the current:

dfs = [
    {
        "data": {
            "core-table": pd.DataFrame({"col1": ["x"]}),
            "aux-table": pd.DataFrame({"f": [1.1]}),
        }
    },
    {
        "data": {
            "core-table": pd.DataFrame({"col1": ["y"]}),
            "aux-table": pd.DataFrame({"f": [1.2]}),
        }
    },
]

Allow:

dfs = [
    {
        "core-table": pd.DataFrame({"col1": ["x"]}),
        "aux-table": pd.DataFrame({"f": [1.1]}),
    },
    {
        "core-table": pd.DataFrame({"col1": ["y"]}),
        "aux-table": pd.DataFrame({"f": [1.2]}),
    },
]

1reaction

crepererumcommented, May 24, 2019

Why do we upcast to the broadest width? Why not the smallest?

We use the largest because it the common metadata then describes a type that can hold all variables of all partitions (aka container type). That’s the whole point of the type system documentation and also explains why ints cannot be packed into floats or the other way around.

What would we do if pyarrow introduces a int96 or int128? Do we change our casting rules?

Depends. It would make sense to upcast to int128 in that case, but we might not want to that if this means that all libs break because they cannot handle this type (numpy for example). Or in other words (also as described in the type system docs): find a container type that still doesn’t break the ecosystem.

Most common in the sense that these types are used by pandas as defaults for the given type family.

Pandas is exactly NOT a good blueprint for the type system (see float VS int discussion again). Most common means “the container / common type can hold all values of all types that are upcasted into that exact container / common type” (which is not the case for int<->float) and that “make semantically sense” (see discussion on why bools should not be upcasted to ints in the type system docs).

Top Results From Across the Web

Breaking Changes · microsoft/TypeScript Wiki - GitHub

These changes list where implementation differs between versions as the spec and compiler are simplified and inconsistencies are corrected. For ...

Semantic Versioning 2.0.0 | Semantic Versioning

Consider a version format of X.Y.Z (Major.Minor.Patch). Bug fixes not affecting the API increment the patch version, backwards compatible API additions/changes ...

Breaking changes in 7.0 | Elasticsearch Guide [7.17] | Elastic

Breaking changes in 7.0edit. This section discusses the changes that you need to be aware of when migrating your application to Elasticsearch 7.0....

Legislative committee considers election format changes

Voters would rank candidates by preference on their ballots, and if a candidate wins more than half of first-preference votes, they are declared ......

Candidates Tournament 2022 - Wikipedia

The 2022 Candidates Tournament was an eight-player chess tournament to decide the challenger ... any Candidates tournament since the modern format was introduced...