Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fails to read from parquet table with flat schema and fields containing "."

See original GitHub issue

I observed that when loading parquet data from a table that has a mix of normal fields and fields that have . in their name:

Sample data:

from pyspark.sql.functions import lit
spark.range(1).withColumn("a.b", lit(1)).withColumn("a_b", lit(2)).coalesce(1).write.parquet("/mnt/ivan/sample.parquet")

Code works when accessing “a_b”:

df = ks.read_parquet("/mnt/ivan/sample.parquet")
df['a_b']

# Out[9]: 
# 0    2
# Name: a_b, dtype: int32

Accessing “a.b” column fails:

df = ks.read_parquet("/mnt/ivan/sample.parquet")
df['a.b']

# KeyError: 'a.b'
# ---------------------------------------------------------------------------
# KeyError                                  Traceback (most recent call last)
# <command-3188040896828876> in <module>()
#       1 df = ks.read_parquet("/mnt/ivan/sample.parquet")
# ----> 2 df['a.b']
# 
# /databricks/python/lib/python3.5/site-packages/databricks/koalas/frame.py in __getitem__(self, key)
#     448 
#     449     def __getitem__(self, key):
# --> 450         return self._pd_getitem(key)
#     451 
#     452     def __setitem__(self, key, value):
# 
# /databricks/python/lib/python3.5/site-packages/databricks/koalas/frame.py in _pd_getitem(self, key)
#     427                 return Series(self._sdf.__getitem__(key), self, self._metadata.index_info)
#     428             except AnalysisException:
# --> 429                 raise KeyError(key)
#     430         if np.isscalar(key) or isinstance(key, (tuple, string_types)):
#     431             raise NotImplementedError(key)
# 
# KeyError: 'a.b'

Escaping column name with “`” also does not work:

df = ks.read_parquet("/mnt/ivan/sample.parquet")
df['`a.b`']

# Out[11]: <repr(<databricks.koalas.series.Series at 0x7fe1347c53c8>) failed: KeyError: 'a.b'>

Issue Analytics

State:
Created 4 years ago
Comments:14 (13 by maintainers)

Top GitHub Comments

2reactions

sadikovicommented, May 13, 2019

I am still working on the problem, found there are quite a few examples for which I would need to check my patch. Will submit PR this week, apologies for the delay.

0reactions

HyukjinKwoncommented, Jun 3, 2019

Here’s what’s going on now:

scala> val df = spark.range(1).toDF("a.b")
df: org.apache.spark.sql.DataFrame = [a.b: bigint]

scala> df("a.b")
org.apache.spark.sql.AnalysisException: Cannot resolve column name "a.b" among (a.b);
  at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:233)
  at scala.Option.getOrElse(Option.scala:138)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:233)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1314)
  at org.apache.spark.sql.Dataset.apply(Dataset.scala:1281)
  ... 47 elided

a.b is being resolved as nested access

Top Results From Across the Web

Unable to infer schema when loading Parquet file

This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.

Parquet Files - Spark 3.3.1 Documentation

Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet...

Apache Spark job fails with Parquet column cannot be ...

Problem You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted...

Querying Data in Staged Files - Snowflake Documentation

Specifies the positional number of the field/column (in the file) that contains the data to be loaded ( 1 for the first field,...

Apache Spark Tutorial— How to Read and Write Data With ...

The default is parquet. option — a set of key-value configurations to parameterize how to read data; schema — optional one used to...