Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

False positive ambiguous columns error when creating features

See original GitHub issue

Tech Debt Title

Summary

Weird error related to ambiguous columns that are not really ambiguous?

Feature related:

Age: new-tech-debt-introduced

Present since: 2020-03-06

Estimated cost: investigation_needed

Type: coding

Description 📋

It seems that when using a SQLExpressionTransform when creating features can lead to false positive errors about ambiguous columnds.

An example:

source=Source(
                readers=[
                    TableReader(
                        id="availability",
                        database="datalake_ebdb_raw",
                        table="horariosemanalimovel_aud",
                    )
                    .with_(self.column_sum)
                    .with_(
                        pivot,
                        group_by_columns=["imovel_id", "rev"],
                        pivot_column="diaDaSemana",
                        agg_column="column_sum",
                        aggregation=functions.sum,
                        mock_value=0,
                        mock_type="int",
                        with_forward_fill=True,
                    ),
                    TableReader(
                        id="ure",
                        database="datalake_ebdb_clean",
                        table="user_revision_entity",
                    ),
                ],
                query=(
                    """
                    with coalesced_availability as (
                      select 
                        av.imovel_id as id,
                        av.rev,
                        coalesce(`1`, 0) as monday,
                        coalesce(`2`, 0) as tuesday,
                        coalesce(`3`, 0) as wednesday,
                        coalesce(`4`, 0) as thursday,
                        coalesce(`5`, 0) as friday,
                        coalesce(`6`, 0) as saturday,
                        coalesce(`7`, 0) as sunday
                      from availability av
                    ), houses as (
                      select
                        ha.id_house as ha_id,
                        ha.rev as ha_rev,
                        av.rev as av_rev,
                        av.monday,
                        av.tuesday,
                        av.wednesday,
                        av.thursday,
                        av.friday,
                        av.saturday,
                        av.sunday
                      from datalake_ebdb_clean.house_aud ha
                      full outer join coalesced_availability av
                        on av.id = ha.id_house
                          and av.rev <= ha.rev
                    )
                    select distinct
                      ha_id as id,
                      coalesce(av_rev, ha_rev) as ts_revision,
                      monday as available_slots_monday,
                      tuesday as available_slots_tuesday,
                      wednesday as available_slots_wednesday,
                      thursday as available_slots_thursday,
                      friday as available_slots_friday,
                      saturday as available_slots_saturday,
                      sunday as available_slots_sunday,
                      (monday + tuesday + wednesday + thursday + friday + saturday + sunday) as total_available_slots_weekly
                    from houses
                    """
                ),
            ),
            feature_set=FeatureSet(
                name="house_availability",
                entity="house",
                description=(
                    """
                    Holds availability information related to house
                    feature such as "available_slots_monday" or
                    "total_available_slots_weekly"
                    """
                ),
                keys=[
                    KeyFeature(
                        name="id",
                        description="The House's Main ID",
                    )
                ],
                timestamp=TimestampFeature(from_column="ts_revision", from_ms=True),
                features=[
                    Feature(
                        name="available_slots_monday",
                        description="Number indicating available hours for visit on monday",
                        transformation=SQLExpressionTransform(
                            expression="coalesce(available_slots_monday, 9)"
                        ),
                    ),
                  ...

It seems that the part:

Feature(
                        name="available_slots_monday",
                        description="Number indicating available hours for visit on monday",
                        transformation=SQLExpressionTransform(
                            expression="coalesce(available_slots_monday, 9)"
                        ),
                    ),

Causes the error:

org.apache.spark.sql.AnalysisException: Reference 'available_slots_monday' is ambiguous, could be: available_slots_monday, available_slots_monday.;

However if I change the query from monday as available_slots_monday, to make the slect simply as monday and then do:

Feature(
                        name="available_slots_monday",
                        description="Number indicating available hours for visit on monday",
                        transformation=SQLExpressionTransform(
                            expression="coalesce(monday, 9)"
                        ),

it works!

Impact 💣

Some false positive errors that can be hard to debug.

Critical in: UNKOWN

Solution Hints :squirrel:

Not sure

Observations 🤔

Files related or evidences (like: prints)

Complete error:

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:7 (3 by maintainers)

Top GitHub Comments

2reactions

rafaelleiniocommented, Sep 14, 2020

I’m working on this fix. Changing just one line on the SQLExpressionTransform. This transformation was using a select statement on the dataframe. If the name of the feature is the same as one column already presented on the df it gets ambiguous. I’m changing the return to use a withColumn statement, so Spark automatically overwrite the column if they have the same name, this is the same behaviour presented in the other transformations and features in Butterfree.

1reaction

rafaelleiniocommented, Sep 19, 2020

Thanks! I’m closing this issue.

Top Results From Across the Web

How to Solve the “Ambiguous Name Column” Error in SQL

One of the simplest ways to solve an “ambiguous name column” error — without changing column name — is to give the tables...

Enforcing non-ambiguous references to column names in ...

syntax_pg=true for PostgreSQL compatibility). It throws when the ORDER BY has a column name that is ambiguous.

PostgreSQL. False positive 'Ambiguous column reference'

DBE-5229 showing valid sql having query as ambiguous reference. 1. Similar to 1 issue (1 unresolved). N. DBE-16973 ambiguous reference false positive.

Remove ambiguous character types from the data source file ...

Single character fields should be eliminated from the data source file. These are more likely to cause false positives, since a single character ......

Adjust how locations and attributes are extracted—ArcGIS Pro

click the Coordinates tab, and click the Create features from coordinates toggle. ... They can produce locations that are false positives since they...