pyarrow.hdfs.HadoopFileSystem not serializable by Spark
See original GitHub issueHi again,
I am using pyspark 2.4.0, and pyarrow 0.11.1 and spark is not able to serialize pyarrow.hdfs.HadoopFileSystem. Have you encountered this issue before?
I get “HDFS Connection Failed” during the serialization, which is a bit strange. I can open a new filesystem connection inside the spark-mapper with no problem, so there should not be a connection issue but rather a serialization issue.
Snippet from stacktrace:
File "/srv/hops/hopsdata/tmp/nm-local-dir/usercache/N8YmHUGK9tr5Q9_iz7ZdAb0oU66QXgWDdYzH4tE4wgI/appcache/application_1547648243443_0001/container_e01_1547648243443_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 566, in loads return pickle.loads(obj, encoding=encoding) File "/srv/hops/anaconda/anaconda/envs/petastorm/lib/python3.6/site-packages/pyarrow/hdfs.py", line 37, in __init__ self._connect(host, port, user, kerb_ticket, driver, extra_conf) File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed
Thanks /Kim
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:8 (3 by maintainers)
Top GitHub Comments
I see that the local filesystem is serialiazable. Probably
libhdfs3
based filesystem is serializable as well., whilelibhdfs
one is not. That explains that we never ran into it. We’ll get it fixed in the next release. Thanks for bringing this up!@maver1ck / @Limmen - #310 is now landed so you should be able to pass a filesystem factory method to
materialize_dataset()
(or leave it empty for use with the local filesystem). I’ll close this issue for now, but feel free to reopen if the issue resurfaces.