question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pandas results serializer proposal

See original GitHub issue

Pandas dataframes are very common in python data workflows, often saved to disk in efficient binary formats.

I’m sharing parquet and feather serializers that I am using and might also benefit others.

I’m not sure where or even if these pandas specific serializer fits into prefect core. What does are the teams thoughts? If we go down this route we will probably want to add CSV as well.

from io import BytesIO
import pandas as pd
import pyarrow.feather as pf
from prefect.engine.serializers import Serializer


class FeatherSerializer(Serializer):

    def serialize(self, value: pd.DataFrame) -> bytes:
        # transform a Python object into bytes        
        bytes_buffer = BytesIO()
        pf.write_feather(
            df=value,
            dest=bytes_buffer,
            version=2,
        )
        return bytes_buffer.getvalue()

    def deserialize(self, value:bytes) -> pd.DataFrame:
        # recover a Python object from bytes
        df_bytes_io = BytesIO(value)
        return pd.read_feather(df_bytes_io)


class ParquetSerializer(Serializer):

    def serialize(self, value: pd.DataFrame) -> bytes:
        # transform a Python object into bytes
        bytes_buffer = BytesIO()
        value.to_parquet(
            path=bytes_buffer,
            index=False
        )
        return bytes_buffer.getvalue()

    def deserialize(self, value:bytes) -> pd.DataFrame:
        # recover a Python object from bytes        
        df_bytes_io = BytesIO(value)
        return pd.read_parquet(df_bytes_io)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7

github_iconTop GitHub Comments

2reactions
AndrewRookcommented, Jul 9, 2020

I may be rapidly losing my authority on the topic as I’ve not had much time to spend on this recently, but my personal perspective is that the best way to handle pandas serializing is to provide users with a single entry point — instead of having a CSVSerializer, ParquetSerializer, and so on, you provide a PandasSerializer that takes in how you want to serialize as an argument. I was thinking the options would be only the things that pandas supports (ie only stuff with a DataFrame.to_[thing] method), although I’m not sure where that would leave the feather support since in your example the serializer doesn’t use a DataFrame method.

0reactions
jcristcommented, Aug 11, 2020

Fixed by #3020, closing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Serialization of a pandas DataFrame - python - Stack Overflow
In the end, I want to collect all the results (as a DataFrame) from each grid job and aggregate them into a giant...
Read more >
Stop persisting pandas data frames in CSVs
It allows serializing complex nested structures, supports column-wise compression and column-wise encoding, and offers fast reads because it's ...
Read more >
Introducing Pandas UDF for PySpark - The Databricks Blog
This blog post introduces the Pandas UDFs feature in the upcoming Apache Spark 2.3 release that substantially improves the performance and ...
Read more >
Complete Guide To Different Persisting Methods In Pandas
It allows serializing complex nested structures, supports column-wise compression and column-wise encoding, and offers fast reads. The advantage ...
Read more >
Comparison with pandas-gbq | BigQuery - Google Cloud
The client library uses the BigQuery Storage API to download results to a # pandas dataframe if the API is enabled on the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found