question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Timestamp metadata and Spark

See original GitHub issue

OK, so I create a pandas dataframe that has a timestamp column. I save this to parquet using fastparquet and then read the data with Spark. I find that my spark dataframe identifies my timestamp column as an integer column. Is there perhaps some special metadata that Spark is looking out for?

Example

In [1]: import pandas as pd

In [2]: import pyspark

In [3]: import fastparquet

In [4]: df = pd.DataFrame({'x': [1, 2, 3]})

In [5]: df['x'] = pd.to_datetime(df.x)

In [6]: df
Out[6]: 
                              x
0 1970-01-01 00:00:00.000000001
1 1970-01-01 00:00:00.000000002
2 1970-01-01 00:00:00.000000003

In [7]: sc = pyspark.SparkContext('local[4]')
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
17/02/27 17:13:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/27 17:13:11 WARN Utils: Your hostname, carbon resolves to a loopback address: 127.0.1.1; using 192.168.1.115 instead (on interface wlp4s0)
17/02/27 17:13:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/02/27 17:13:11 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

In [8]: sql = pyspark.SQLContext(sc)

In [9]: fastparquet.write('foo.parquet', df)

In [10]: sdf = sql.read.parquet('foo.parquet')
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

In [11]: sdf
Out[11]: DataFrame[x: bigint]

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:31 (19 by maintainers)

github_iconTop GitHub Comments

3reactions
martindurantcommented, Mar 6, 2017

Use the to_pandas(timestamp96=['inserted2']) to automatically convert the S12-type column to times.

1reaction
martindurantcommented, Sep 8, 2017

Not any time soon 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Effectively Use Dates and Timestamps in Spark 3.0
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME ZONE , which is a combination of the fields ( YEAR ,...
Read more >
current_timestamp Function For Processing Time in Streaming ...
Both are special for Spark Structured Streaming as StreamExecution replaces their underlying Catalyst expressions, CurrentTimestamp and CurrentDate ...
Read more >
TIMESTAMP compatibility for Parquet files | CDP Private Cloud
When writing Parquet files, Hive and Spark SQL both normalize all TIMESTAMP values to the UTC time zone. During a query, Spark SQL...
Read more >
Spark Queries - Apache Iceberg
Metadata tables, like history and snapshots , can use the Iceberg table name ... Spark 3.3 and later supports time travel in SQL...
Read more >
How to read timestamp csv file in pyspark? - Stack Overflow
df_read_file = spark.read.format("com.databricks.spark.csv") ... .load("/app/HTA/SrcFiles/inbound/metadata/projectno_without_data_*").
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found