Fails to read from parquet table with flat schema and fields containing "."
See original GitHub issueI observed that when loading parquet data from a table that has a mix of normal fields and fields that have .
in their name:
Sample data:
from pyspark.sql.functions import lit
spark.range(1).withColumn("a.b", lit(1)).withColumn("a_b", lit(2)).coalesce(1).write.parquet("/mnt/ivan/sample.parquet")
Code works when accessing “a_b”:
df = ks.read_parquet("/mnt/ivan/sample.parquet")
df['a_b']
# Out[9]:
# 0 2
# Name: a_b, dtype: int32
Accessing “a.b” column fails:
df = ks.read_parquet("/mnt/ivan/sample.parquet")
df['a.b']
# KeyError: 'a.b'
# ---------------------------------------------------------------------------
# KeyError Traceback (most recent call last)
# <command-3188040896828876> in <module>()
# 1 df = ks.read_parquet("/mnt/ivan/sample.parquet")
# ----> 2 df['a.b']
#
# /databricks/python/lib/python3.5/site-packages/databricks/koalas/frame.py in __getitem__(self, key)
# 448
# 449 def __getitem__(self, key):
# --> 450 return self._pd_getitem(key)
# 451
# 452 def __setitem__(self, key, value):
#
# /databricks/python/lib/python3.5/site-packages/databricks/koalas/frame.py in _pd_getitem(self, key)
# 427 return Series(self._sdf.__getitem__(key), self, self._metadata.index_info)
# 428 except AnalysisException:
# --> 429 raise KeyError(key)
# 430 if np.isscalar(key) or isinstance(key, (tuple, string_types)):
# 431 raise NotImplementedError(key)
#
# KeyError: 'a.b'
Escaping column name with “`” also does not work:
df = ks.read_parquet("/mnt/ivan/sample.parquet")
df['`a.b`']
# Out[11]: <repr(<databricks.koalas.series.Series at 0x7fe1347c53c8>) failed: KeyError: 'a.b'>
Issue Analytics
- State:
- Created 4 years ago
- Comments:14 (13 by maintainers)
Top Results From Across the Web
Unable to infer schema when loading Parquet file
This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.
Read more >Parquet Files - Spark 3.3.1 Documentation
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet...
Read more >Apache Spark job fails with Parquet column cannot be ...
Problem You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted...
Read more >Querying Data in Staged Files - Snowflake Documentation
Specifies the positional number of the field/column (in the file) that contains the data to be loaded ( 1 for the first field,...
Read more >Apache Spark Tutorial— How to Read and Write Data With ...
The default is parquet. option — a set of key-value configurations to parameterize how to read data; schema — optional one used to...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I am still working on the problem, found there are quite a few examples for which I would need to check my patch. Will submit PR this week, apologies for the delay.
Here’s what’s going on now:
a.b
is being resolved as nested access