Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pandas results serializer proposal

See original GitHub issue

Pandas dataframes are very common in python data workflows, often saved to disk in efficient binary formats.

I’m sharing parquet and feather serializers that I am using and might also benefit others.

I’m not sure where or even if these pandas specific serializer fits into prefect core. What does are the teams thoughts? If we go down this route we will probably want to add CSV as well.

from io import BytesIO
import pandas as pd
import pyarrow.feather as pf
from prefect.engine.serializers import Serializer


class FeatherSerializer(Serializer):

    def serialize(self, value: pd.DataFrame) -> bytes:
        # transform a Python object into bytes        
        bytes_buffer = BytesIO()
        pf.write_feather(
            df=value,
            dest=bytes_buffer,
            version=2,
        )
        return bytes_buffer.getvalue()

    def deserialize(self, value:bytes) -> pd.DataFrame:
        # recover a Python object from bytes
        df_bytes_io = BytesIO(value)
        return pd.read_feather(df_bytes_io)


class ParquetSerializer(Serializer):

    def serialize(self, value: pd.DataFrame) -> bytes:
        # transform a Python object into bytes
        bytes_buffer = BytesIO()
        value.to_parquet(
            path=bytes_buffer,
            index=False
        )
        return bytes_buffer.getvalue()

    def deserialize(self, value:bytes) -> pd.DataFrame:
        # recover a Python object from bytes        
        df_bytes_io = BytesIO(value)
        return pd.read_parquet(df_bytes_io)

Issue Analytics

State:
Created 3 years ago
Comments:7

Top GitHub Comments

2reactions

AndrewRookcommented, Jul 9, 2020

I may be rapidly losing my authority on the topic as I’ve not had much time to spend on this recently, but my personal perspective is that the best way to handle pandas serializing is to provide users with a single entry point — instead of having a CSVSerializer, ParquetSerializer, and so on, you provide a PandasSerializer that takes in how you want to serialize as an argument. I was thinking the options would be only the things that pandas supports (ie only stuff with a DataFrame.to_[thing] method), although I’m not sure where that would leave the feather support since in your example the serializer doesn’t use a DataFrame method.

0reactions

jcristcommented, Aug 11, 2020

Fixed by #3020, closing.