question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fails to read from parquet table with flat schema and fields containing "."

See original GitHub issue

I observed that when loading parquet data from a table that has a mix of normal fields and fields that have . in their name:

Sample data:

from pyspark.sql.functions import lit
spark.range(1).withColumn("a.b", lit(1)).withColumn("a_b", lit(2)).coalesce(1).write.parquet("/mnt/ivan/sample.parquet")

Code works when accessing “a_b”:

df = ks.read_parquet("/mnt/ivan/sample.parquet")
df['a_b']

# Out[9]: 
# 0    2
# Name: a_b, dtype: int32

Accessing “a.b” column fails:

df = ks.read_parquet("/mnt/ivan/sample.parquet")
df['a.b']

# KeyError: 'a.b'
# ---------------------------------------------------------------------------
# KeyError                                  Traceback (most recent call last)
# <command-3188040896828876> in <module>()
#       1 df = ks.read_parquet("/mnt/ivan/sample.parquet")
# ----> 2 df['a.b']
# 
# /databricks/python/lib/python3.5/site-packages/databricks/koalas/frame.py in __getitem__(self, key)
#     448 
#     449     def __getitem__(self, key):
# --> 450         return self._pd_getitem(key)
#     451 
#     452     def __setitem__(self, key, value):
# 
# /databricks/python/lib/python3.5/site-packages/databricks/koalas/frame.py in _pd_getitem(self, key)
#     427                 return Series(self._sdf.__getitem__(key), self, self._metadata.index_info)
#     428             except AnalysisException:
# --> 429                 raise KeyError(key)
#     430         if np.isscalar(key) or isinstance(key, (tuple, string_types)):
#     431             raise NotImplementedError(key)
# 
# KeyError: 'a.b'

Escaping column name with “`” also does not work:

df = ks.read_parquet("/mnt/ivan/sample.parquet")
df['`a.b`']

# Out[11]: <repr(<databricks.koalas.series.Series at 0x7fe1347c53c8>) failed: KeyError: 'a.b'>

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
sadikovicommented, May 13, 2019

I am still working on the problem, found there are quite a few examples for which I would need to check my patch. Will submit PR this week, apologies for the delay.

0reactions
HyukjinKwoncommented, Jun 3, 2019

Here’s what’s going on now:

scala> val df = spark.range(1).toDF("a.b")
df: org.apache.spark.sql.DataFrame = [a.b: bigint]

scala> df("a.b")
org.apache.spark.sql.AnalysisException: Cannot resolve column name "a.b" among (a.b);
  at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:233)
  at scala.Option.getOrElse(Option.scala:138)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:233)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1314)
  at org.apache.spark.sql.Dataset.apply(Dataset.scala:1281)
  ... 47 elided

a.b is being resolved as nested access

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unable to infer schema when loading Parquet file
This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.
Read more >
Parquet Files - Spark 3.3.1 Documentation
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet...
Read more >
Apache Spark job fails with Parquet column cannot be ...
Problem You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted...
Read more >
Querying Data in Staged Files - Snowflake Documentation
Specifies the positional number of the field/column (in the file) that contains the data to be loaded ( 1 for the first field,...
Read more >
Apache Spark Tutorial— How to Read and Write Data With ...
The default is parquet. option — a set of key-value configurations to parameterize how to read data; schema — optional one used to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found