Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Timestamp metadata and Spark

See original GitHub issue

OK, so I create a pandas dataframe that has a timestamp column. I save this to parquet using fastparquet and then read the data with Spark. I find that my spark dataframe identifies my timestamp column as an integer column. Is there perhaps some special metadata that Spark is looking out for?

Example

In [1]: import pandas as pd

In [2]: import pyspark

In [3]: import fastparquet

In [4]: df = pd.DataFrame({'x': [1, 2, 3]})

In [5]: df['x'] = pd.to_datetime(df.x)

In [6]: df
Out[6]: 
                              x
0 1970-01-01 00:00:00.000000001
1 1970-01-01 00:00:00.000000002
2 1970-01-01 00:00:00.000000003

In [7]: sc = pyspark.SparkContext('local[4]')
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
17/02/27 17:13:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/27 17:13:11 WARN Utils: Your hostname, carbon resolves to a loopback address: 127.0.1.1; using 192.168.1.115 instead (on interface wlp4s0)
17/02/27 17:13:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/02/27 17:13:11 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

In [8]: sql = pyspark.SQLContext(sc)

In [9]: fastparquet.write('foo.parquet', df)

In [10]: sdf = sql.read.parquet('foo.parquet')
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

In [11]: sdf
Out[11]: DataFrame[x: bigint]

Issue Analytics

State:
Created 7 years ago
Comments:31 (19 by maintainers)

Top GitHub Comments

3reactions

martindurantcommented, Mar 6, 2017

Use the to_pandas(timestamp96=['inserted2']) to automatically convert the S12-type column to times.

1reaction

martindurantcommented, Sep 8, 2017

Not any time soon 😃

Top Results From Across the Web

How to Effectively Use Dates and Timestamps in Spark 3.0

Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME ZONE , which is a combination of the fields ( YEAR ,...

current_timestamp Function For Processing Time in Streaming ...

Both are special for Spark Structured Streaming as StreamExecution replaces their underlying Catalyst expressions, CurrentTimestamp and CurrentDate ...

TIMESTAMP compatibility for Parquet files | CDP Private Cloud

When writing Parquet files, Hive and Spark SQL both normalize all TIMESTAMP values to the UTC time zone. During a query, Spark SQL...

Spark Queries - Apache Iceberg

Metadata tables, like history and snapshots , can use the Iceberg table name ... Spark 3.3 and later supports time travel in SQL...

How to read timestamp csv file in pyspark? - Stack Overflow

df_read_file = spark.read.format("com.databricks.spark.csv") ... .load("/app/HTA/SrcFiles/inbound/metadata/projectno_without_data_*").