df.apply() giving the type Error
See original GitHub issueSystem information
- **OS Platform :Windows
- Modin version (
modin.__version__
): 0.8.1.1 - Python version: 3.7.2
- Code we can use to reproduce: data[“City_lat”] = data[“City”].apply(lambda x: city_lattitude(x))
Describe the problem
While using with data frame apply() function getting TypeError: can't pickle _thread.lock objects
.
Source code / logs
`TypeError Traceback (most recent call last) <ipython-input-34-bd640fe633ce> in <module> ----> 1 data[“City_lat”] = data[“City”].apply(lambda x: city_lattitude(x))
~\Anaconda3\lib\site-packages\modin\pandas\series.py in apply(self, func, convert_dtype, args, **kwds) 531 if isinstance(f, np.ufunc): 532 return f(self) –> 533 result = self.map(f)._query_compiler 534 if return_type not in [“DataFrame”, “Series”]: 535 # sometimes result can be not a query_compiler, but scalar (for example
~\Anaconda3\lib\site-packages\modin\pandas\series.py in map(self, arg, na_action) 1056 return self.constructor( 1057 query_compiler=self._query_compiler.applymap( -> 1058 lambda s: arg(s) 1059 if pandas.isnull(s) is not True or na_action is None 1060 else s
~\Anaconda3\lib\site-packages\modin\data_management\functions\mapfunction.py in caller(query_compiler, *args, **kwargs) 21 return query_compiler.constructor( 22 query_compiler._modin_frame._map( —> 23 lambda x: function(x, *args, **kwargs), *call_args, **call_kwds 24 ) 25 )
~\Anaconda3\lib\site-packages\modin\engines\base\frame\data.py in _map(self, func, dtypes, validate_index, validate_columns) 1097 A new dataframe. 1098 “”" -> 1099 new_partitions = self._frame_mgr_cls.lazy_map_partitions(self._partitions, func) 1100 if dtypes == “copy”: 1101 dtypes = self._dtypes
~\Anaconda3\lib\site-packages\modin\engines\base\frame\partition_manager.py in lazy_map_partitions(cls, partitions, map_func) 281 @classmethod 282 def lazy_map_partitions(cls, partitions, map_func): –> 283 preprocessed_map_func = cls.preprocess_func(map_func) 284 return np.array( 285 [
~\Anaconda3\lib\site-packages\modin\engines\base\frame\partition_manager.py in preprocess_func(cls, map_func) 49 being used). 50 “”" —> 51 return cls._partition_class.preprocess_func(map_func) 52 53 # END Abstract Methods
~\Anaconda3\lib\site-packages\modin\engines\ray\pandas_on_ray\frame\partition.py in preprocess_func(cls, func) 153 A ray.ObjectID. 154 “”" –> 155 return ray.put(func) 156 157 def length(self):
~\Anaconda3\lib\site-packages\ray\worker.py in put(value) 1454 with profiling.profile(“ray.put”): 1455 try: -> 1456 object_ref = worker.put_object(value, pin_object=True) 1457 except ObjectStoreFullError: 1458 logger.info(
~\Anaconda3\lib\site-packages\ray\worker.py in put_object(self, value, object_ref, pin_object) 263 “inserting with an ObjectRef”) 264 –> 265 serialized_value = self.get_serialization_context().serialize(value) 266 # This must be the first place that we construct this python 267 # ObjectRef because an entry with 0 local references is created when
~\Anaconda3\lib\site-packages\ray\serialization.py in serialize(self, value) 402 return RawSerializedObject(value) 403 else: –> 404 return self._serialize_to_msgpack(value) 405 406 def register_custom_serializer(self,
~\Anaconda3\lib\site-packages\ray\serialization.py in _serialize_to_msgpack(self, value)
382 metadata = ray_constants.OBJECT_METADATA_TYPE_PYTHON
383 pickle5_serialized_object =
–> 384 self._serialize_to_pickle5(metadata, python_objects)
385 else:
386 pickle5_serialized_object = None
~\Anaconda3\lib\site-packages\ray\serialization.py in _serialize_to_pickle5(self, metadata, value) 342 except Exception as e: 343 self.get_and_clear_contained_object_refs() –> 344 raise e 345 finally: 346 self.set_out_of_band_serialization()
~\Anaconda3\lib\site-packages\ray\serialization.py in _serialize_to_pickle5(self, metadata, value) 339 self.set_in_band_serialization() 340 inband = pickle.dumps( –> 341 value, protocol=5, buffer_callback=writer.buffer_callback) 342 except Exception as e: 343 self.get_and_clear_contained_object_refs()
~\Anaconda3\lib\site-packages\ray\cloudpickle\cloudpickle_fast.py in dumps(obj, protocol, buffer_callback) 68 with io.BytesIO() as file: 69 cp = CloudPickler(file, protocol=protocol, buffer_callback=buffer_callback) —> 70 cp.dump(obj) 71 return file.getvalue() 72
~\Anaconda3\lib\site-packages\ray\cloudpickle\cloudpickle_fast.py in dump(self, obj) 654 def dump(self, obj): 655 try: –> 656 return Pickler.dump(self, obj) 657 except RuntimeError as e: 658 if “recursion” in e.args[0]:
TypeError: can’t pickle _thread.lock objects`
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (5 by maintainers)
@amitmeel I’m not Ray or Dask developer, so I can be inaccurate in the following explanation:
Ray needs to serialize objects to put it into shared storage that will be available for every node, Ray uses Plasma store for that purposes. And as serialization/deserialization mechanism Ray decided to use pickle. You can learn more in the Ray documentation.
And a bit more about using geopy. As I understood, calling
geocode
will make a request to the map provider server, which is an internet call and so blocking operation. Even when we parallelize computations with Modin, every partition will still have slowness because of a blocking server calls.By default geopy uses sync mode to make internet calls, however going by its documentation it has async mode, which should became much faster. So I recommend you instead of using
df.apply
approach, which sequentially applies function to every value in a Series (in our case function will be blocking), do something like this:I think we can close it. If an object is not serializable, we cannot do distributed processing.