Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Datetime column type could not be recognized in Spark

See original GitHub issue

What happened:

We use fastparquet to write pandas dataframe with datetime columns. Then we use Spark to read the parquet file, all the datetime columns become ‘bigint’ type.

It worked in older version(0.6.0), but breaks in the latest release 0.7.0.

What you expected to happen:

Should get timestamp type in Spark.

Minimal Complete Verifiable Example:

import pyspark
import pandas as pd


pdf = pd.DataFrame([[pd.to_datetime('2021-01-01')]], columns=['c'])
pdf.to_parquet('tmp.parquet', engine='fastparquet')
print(pdf.dtypes)

spark = pyspark.sql.SparkSession.builder.getOrCreate()
sdf = spark.read.format('parquet').load('tmp.parquet')
print(sdf.dtypes)

output: pandas schema: c datetime64[ns] spark schema: [(‘c’, ‘bigint’)]

Anything else we need to know?:

Environment:

fastparquet version: 0.7.0
Spark version: 3.0.1
Dask version: N/A
Python version: 3.7.9
Operating System: CentOS 7.6
Install method (conda, pip, source): pip/conda

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Jul 23, 2021

Have you tried times='int96' when writing? The previous behaviour was truncating pandas’ ns-resolution timestamps to us, which was also unfortunate.

0reactions

martindurantcommented, Jul 25, 2021

Right, pandas doesn’t like times in anything other than ns, but I think it can be done somehow.

On July 25, 2021 8:38:29 AM EDT, Yuan Zhou @.***> wrote:

I tested the solution with times='int96', it works now!

But I didn’t find a way to convert time columns to datetime[ms] or datetime[us] type… tried with pd.to_datetime(pdf['c'], unit='ms') and pdf['c'].astype('datetime[ms'), neither worked.

– You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dask/fastparquet/issues/646#issuecomment-886196137

– Sent from my Android device with K-9 Mail. Please excuse my brevity.

Top Results From Across the Web

How can I let PySpark recognize a column as a datetime type?

To solve this problem, we need to know about list, tuples, and data types. This is key to create the Python structure that...

Timestamp not recognized while writing Spark dataframe to ...

The source from where the data is copied, Hive, is using STRING format for the column and it is being loaded to a...

How to Effectively Use Dates and Timestamps in Spark 3.0

Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME ZONE , which is a combination of the fields ( YEAR ,...

Datetime Patterns for Formatting and Parsing - Apache Spark

There are several common scenarios for datetime usage in Spark: CSV/JSON datasources use the pattern string for parsing and formatting datetime content. ......

Serverless SQL pool self-help - Azure Synapse Analytics

This message means serverless SQL pool can't execute at this moment. Here are some troubleshooting options: Make sure data types of reasonable ...