question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Datetime column type could not be recognized in Spark

See original GitHub issue

What happened:

We use fastparquet to write pandas dataframe with datetime columns. Then we use Spark to read the parquet file, all the datetime columns become ‘bigint’ type.

It worked in older version(0.6.0), but breaks in the latest release 0.7.0.

What you expected to happen:

Should get timestamp type in Spark.

Minimal Complete Verifiable Example:

import pyspark
import pandas as pd


pdf = pd.DataFrame([[pd.to_datetime('2021-01-01')]], columns=['c'])
pdf.to_parquet('tmp.parquet', engine='fastparquet')
print(pdf.dtypes)

spark = pyspark.sql.SparkSession.builder.getOrCreate()
sdf = spark.read.format('parquet').load('tmp.parquet')
print(sdf.dtypes)

output: pandas schema: c datetime64[ns] spark schema: [(‘c’, ‘bigint’)]

Anything else we need to know?:

Environment:

  • fastparquet version: 0.7.0
  • Spark version: 3.0.1
  • Dask version: N/A
  • Python version: 3.7.9
  • Operating System: CentOS 7.6
  • Install method (conda, pip, source): pip/conda

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Jul 23, 2021

Have you tried times='int96' when writing? The previous behaviour was truncating pandas’ ns-resolution timestamps to us, which was also unfortunate.

0reactions
martindurantcommented, Jul 25, 2021

Right, pandas doesn’t like times in anything other than ns, but I think it can be done somehow.

On July 25, 2021 8:38:29 AM EDT, Yuan Zhou @.***> wrote:

I tested the solution with times='int96', it works now!

But I didn’t find a way to convert time columns to datetime[ms] or datetime[us] type… tried with pd.to_datetime(pdf['c'], unit='ms') and pdf['c'].astype('datetime[ms'), neither worked.

– You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dask/fastparquet/issues/646#issuecomment-886196137

– Sent from my Android device with K-9 Mail. Please excuse my brevity.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How can I let PySpark recognize a column as a datetime type?
To solve this problem, we need to know about list, tuples, and data types. This is key to create the Python structure that...
Read more >
Timestamp not recognized while writing Spark dataframe to ...
The source from where the data is copied, Hive, is using STRING format for the column and it is being loaded to a...
Read more >
How to Effectively Use Dates and Timestamps in Spark 3.0
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME ZONE , which is a combination of the fields ( YEAR ,...
Read more >
Datetime Patterns for Formatting and Parsing - Apache Spark
There are several common scenarios for datetime usage in Spark: CSV/JSON datasources use the pattern string for parsing and formatting datetime content.  ......
Read more >
Serverless SQL pool self-help - Azure Synapse Analytics
This message means serverless SQL pool can't execute at this moment. Here are some troubleshooting options: Make sure data types of reasonable ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found