pandas results serializer proposal
See original GitHub issuePandas dataframes are very common in python data workflows, often saved to disk in efficient binary formats.
I’m sharing parquet and feather serializers that I am using and might also benefit others.
I’m not sure where or even if these pandas specific serializer fits into prefect core. What does are the teams thoughts? If we go down this route we will probably want to add CSV as well.
from io import BytesIO
import pandas as pd
import pyarrow.feather as pf
from prefect.engine.serializers import Serializer
class FeatherSerializer(Serializer):
def serialize(self, value: pd.DataFrame) -> bytes:
# transform a Python object into bytes
bytes_buffer = BytesIO()
pf.write_feather(
df=value,
dest=bytes_buffer,
version=2,
)
return bytes_buffer.getvalue()
def deserialize(self, value:bytes) -> pd.DataFrame:
# recover a Python object from bytes
df_bytes_io = BytesIO(value)
return pd.read_feather(df_bytes_io)
class ParquetSerializer(Serializer):
def serialize(self, value: pd.DataFrame) -> bytes:
# transform a Python object into bytes
bytes_buffer = BytesIO()
value.to_parquet(
path=bytes_buffer,
index=False
)
return bytes_buffer.getvalue()
def deserialize(self, value:bytes) -> pd.DataFrame:
# recover a Python object from bytes
df_bytes_io = BytesIO(value)
return pd.read_parquet(df_bytes_io)
Issue Analytics
- State:
- Created 3 years ago
- Comments:7
Top Results From Across the Web
Serialization of a pandas DataFrame - python - Stack Overflow
In the end, I want to collect all the results (as a DataFrame) from each grid job and aggregate them into a giant...
Read more >Stop persisting pandas data frames in CSVs
It allows serializing complex nested structures, supports column-wise compression and column-wise encoding, and offers fast reads because it's ...
Read more >Introducing Pandas UDF for PySpark - The Databricks Blog
This blog post introduces the Pandas UDFs feature in the upcoming Apache Spark 2.3 release that substantially improves the performance and ...
Read more >Complete Guide To Different Persisting Methods In Pandas
It allows serializing complex nested structures, supports column-wise compression and column-wise encoding, and offers fast reads. The advantage ...
Read more >Comparison with pandas-gbq | BigQuery - Google Cloud
The client library uses the BigQuery Storage API to download results to a # pandas dataframe if the API is enabled on the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

I may be rapidly losing my authority on the topic as I’ve not had much time to spend on this recently, but my personal perspective is that the best way to handle pandas serializing is to provide users with a single entry point — instead of having a
CSVSerializer,ParquetSerializer, and so on, you provide aPandasSerializerthat takes in how you want to serialize as an argument. I was thinking the options would be only the things that pandas supports (ie only stuff with aDataFrame.to_[thing]method), although I’m not sure where that would leave the feather support since in your example the serializer doesn’t use a DataFrame method.Fixed by #3020, closing.