Datetime column type could not be recognized in Spark
See original GitHub issueWhat happened:
We use fastparquet to write pandas dataframe with datetime columns. Then we use Spark to read the parquet file, all the datetime columns become ‘bigint’ type.
It worked in older version(0.6.0), but breaks in the latest release 0.7.0.
What you expected to happen:
Should get timestamp type in Spark.
Minimal Complete Verifiable Example:
import pyspark
import pandas as pd
pdf = pd.DataFrame([[pd.to_datetime('2021-01-01')]], columns=['c'])
pdf.to_parquet('tmp.parquet', engine='fastparquet')
print(pdf.dtypes)
spark = pyspark.sql.SparkSession.builder.getOrCreate()
sdf = spark.read.format('parquet').load('tmp.parquet')
print(sdf.dtypes)
output: pandas schema: c datetime64[ns] spark schema: [(‘c’, ‘bigint’)]
Anything else we need to know?:
Environment:
- fastparquet version: 0.7.0
- Spark version: 3.0.1
- Dask version: N/A
- Python version: 3.7.9
- Operating System: CentOS 7.6
- Install method (conda, pip, source): pip/conda
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
How can I let PySpark recognize a column as a datetime type?
To solve this problem, we need to know about list, tuples, and data types. This is key to create the Python structure that...
Read more >Timestamp not recognized while writing Spark dataframe to ...
The source from where the data is copied, Hive, is using STRING format for the column and it is being loaded to a...
Read more >How to Effectively Use Dates and Timestamps in Spark 3.0
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME ZONE , which is a combination of the fields ( YEAR ,...
Read more >Datetime Patterns for Formatting and Parsing - Apache Spark
There are several common scenarios for datetime usage in Spark: CSV/JSON datasources use the pattern string for parsing and formatting datetime content. ......
Read more >Serverless SQL pool self-help - Azure Synapse Analytics
This message means serverless SQL pool can't execute at this moment. Here are some troubleshooting options: Make sure data types of reasonable ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Have you tried
times='int96'
when writing? The previous behaviour was truncating pandas’ ns-resolution timestamps to us, which was also unfortunate.Right, pandas doesn’t like times in anything other than ns, but I think it can be done somehow.
On July 25, 2021 8:38:29 AM EDT, Yuan Zhou @.***> wrote:
– Sent from my Android device with K-9 Mail. Please excuse my brevity.