Timestamp metadata and Spark
See original GitHub issueOK, so I create a pandas dataframe that has a timestamp column. I save this to parquet using fastparquet and then read the data with Spark. I find that my spark dataframe identifies my timestamp column as an integer column. Is there perhaps some special metadata that Spark is looking out for?
Example
In [1]: import pandas as pd
In [2]: import pyspark
In [3]: import fastparquet
In [4]: df = pd.DataFrame({'x': [1, 2, 3]})
In [5]: df['x'] = pd.to_datetime(df.x)
In [6]: df
Out[6]:
x
0 1970-01-01 00:00:00.000000001
1 1970-01-01 00:00:00.000000002
2 1970-01-01 00:00:00.000000003
In [7]: sc = pyspark.SparkContext('local[4]')
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
17/02/27 17:13:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/27 17:13:11 WARN Utils: Your hostname, carbon resolves to a loopback address: 127.0.1.1; using 192.168.1.115 instead (on interface wlp4s0)
17/02/27 17:13:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/02/27 17:13:11 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
In [8]: sql = pyspark.SQLContext(sc)
In [9]: fastparquet.write('foo.parquet', df)
In [10]: sdf = sql.read.parquet('foo.parquet')
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
In [11]: sdf
Out[11]: DataFrame[x: bigint]
Issue Analytics
- State:
- Created 7 years ago
- Comments:31 (19 by maintainers)
Top Results From Across the Web
How to Effectively Use Dates and Timestamps in Spark 3.0
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME ZONE , which is a combination of the fields ( YEAR ,...
Read more >current_timestamp Function For Processing Time in Streaming ...
Both are special for Spark Structured Streaming as StreamExecution replaces their underlying Catalyst expressions, CurrentTimestamp and CurrentDate ...
Read more >TIMESTAMP compatibility for Parquet files | CDP Private Cloud
When writing Parquet files, Hive and Spark SQL both normalize all TIMESTAMP values to the UTC time zone. During a query, Spark SQL...
Read more >Spark Queries - Apache Iceberg
Metadata tables, like history and snapshots , can use the Iceberg table name ... Spark 3.3 and later supports time travel in SQL...
Read more >How to read timestamp csv file in pyspark? - Stack Overflow
df_read_file = spark.read.format("com.databricks.spark.csv") ... .load("/app/HTA/SrcFiles/inbound/metadata/projectno_without_data_*").
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Use the
to_pandas(timestamp96=['inserted2'])
to automatically convert the S12-type column to times.Not any time soon 😃