Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

df.apply() giving the type Error

See original GitHub issue

System information

**OS Platform :Windows
Modin version (modin.__version__): 0.8.1.1
Python version: 3.7.2
Code we can use to reproduce: data[“City_lat”] = data[“City”].apply(lambda x: city_lattitude(x))

Describe the problem

While using with data frame apply() function getting TypeError: can't pickle _thread.lock objects.

Source code / logs

`TypeError Traceback (most recent call last) <ipython-input-34-bd640fe633ce> in <module> ----> 1 data[“City_lat”] = data[“City”].apply(lambda x: city_lattitude(x))

~\Anaconda3\lib\site-packages\modin\pandas\series.py in apply(self, func, convert_dtype, args, **kwds) 531 if isinstance(f, np.ufunc): 532 return f(self) –> 533 result = self.map(f)._query_compiler 534 if return_type not in [“DataFrame”, “Series”]: 535 # sometimes result can be not a query_compiler, but scalar (for example

~\Anaconda3\lib\site-packages\modin\pandas\series.py in map(self, arg, na_action) 1056 return self.constructor( 1057 query_compiler=self._query_compiler.applymap( -> 1058 lambda s: arg(s) 1059 if pandas.isnull(s) is not True or na_action is None 1060 else s

~\Anaconda3\lib\site-packages\modin\data_management\functions\mapfunction.py in caller(query_compiler, *args, **kwargs) 21 return query_compiler.constructor( 22 query_compiler._modin_frame._map( —> 23 lambda x: function(x, *args, **kwargs), *call_args, **call_kwds 24 ) 25 )

~\Anaconda3\lib\site-packages\modin\engines\base\frame\data.py in _map(self, func, dtypes, validate_index, validate_columns) 1097 A new dataframe. 1098 “”" -> 1099 new_partitions = self._frame_mgr_cls.lazy_map_partitions(self._partitions, func) 1100 if dtypes == “copy”: 1101 dtypes = self._dtypes

~\Anaconda3\lib\site-packages\modin\engines\base\frame\partition_manager.py in lazy_map_partitions(cls, partitions, map_func) 281 @classmethod 282 def lazy_map_partitions(cls, partitions, map_func): –> 283 preprocessed_map_func = cls.preprocess_func(map_func) 284 return np.array( 285 [

~\Anaconda3\lib\site-packages\modin\engines\base\frame\partition_manager.py in preprocess_func(cls, map_func) 49 being used). 50 “”" —> 51 return cls._partition_class.preprocess_func(map_func) 52 53 # END Abstract Methods

~\Anaconda3\lib\site-packages\modin\engines\ray\pandas_on_ray\frame\partition.py in preprocess_func(cls, func) 153 A ray.ObjectID. 154 “”" –> 155 return ray.put(func) 156 157 def length(self):

~\Anaconda3\lib\site-packages\ray\worker.py in put(value) 1454 with profiling.profile(“ray.put”): 1455 try: -> 1456 object_ref = worker.put_object(value, pin_object=True) 1457 except ObjectStoreFullError: 1458 logger.info(

~\Anaconda3\lib\site-packages\ray\worker.py in put_object(self, value, object_ref, pin_object) 263 “inserting with an ObjectRef”) 264 –> 265 serialized_value = self.get_serialization_context().serialize(value) 266 # This must be the first place that we construct this python 267 # ObjectRef because an entry with 0 local references is created when

~\Anaconda3\lib\site-packages\ray\serialization.py in serialize(self, value) 402 return RawSerializedObject(value) 403 else: –> 404 return self._serialize_to_msgpack(value) 405 406 def register_custom_serializer(self,

~\Anaconda3\lib\site-packages\ray\serialization.py in _serialize_to_msgpack(self, value) 382 metadata = ray_constants.OBJECT_METADATA_TYPE_PYTHON 383 pickle5_serialized_object =
–> 384 self._serialize_to_pickle5(metadata, python_objects) 385 else: 386 pickle5_serialized_object = None

~\Anaconda3\lib\site-packages\ray\serialization.py in _serialize_to_pickle5(self, metadata, value) 342 except Exception as e: 343 self.get_and_clear_contained_object_refs() –> 344 raise e 345 finally: 346 self.set_out_of_band_serialization()

~\Anaconda3\lib\site-packages\ray\serialization.py in _serialize_to_pickle5(self, metadata, value) 339 self.set_in_band_serialization() 340 inband = pickle.dumps( –> 341 value, protocol=5, buffer_callback=writer.buffer_callback) 342 except Exception as e: 343 self.get_and_clear_contained_object_refs()

~\Anaconda3\lib\site-packages\ray\cloudpickle\cloudpickle_fast.py in dumps(obj, protocol, buffer_callback) 68 with io.BytesIO() as file: 69 cp = CloudPickler(file, protocol=protocol, buffer_callback=buffer_callback) —> 70 cp.dump(obj) 71 return file.getvalue() 72

~\Anaconda3\lib\site-packages\ray\cloudpickle\cloudpickle_fast.py in dump(self, obj) 654 def dump(self, obj): 655 try: –> 656 return Pickler.dump(self, obj) 657 except RuntimeError as e: 658 if “recursion” in e.args[0]:

TypeError: can’t pickle _thread.lock objects`

Issue Analytics

State:
Created 3 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

dchigarevcommented, Oct 8, 2020

@amitmeel I’m not Ray or Dask developer, so I can be inaccurate in the following explanation:

Ray needs to serialize objects to put it into shared storage that will be available for every node, Ray uses Plasma store for that purposes. And as serialization/deserialization mechanism Ray decided to use pickle. You can learn more in the Ray documentation.

And a bit more about using geopy. As I understood, calling geocode will make a request to the map provider server, which is an internet call and so blocking operation. Even when we parallelize computations with Modin, every partition will still have slowness because of a blocking server calls.

By default geopy uses sync mode to make internet calls, however going by its documentation it has async mode, which should became much faster. So I recommend you instead of using df.apply approach, which sequentially applies function to every value in a Series (in our case function will be blocking), do something like this:

# Pseudo-code, may be incorrect
async_geolocator = Nominatim(some_args_to_make_it_async)
df["city_lat"] = await [async_geolocator.geocode(city) for city in df["city"].values]

0reactions

devin-petersohncommented, Oct 12, 2020

I think we can close it. If an object is not serializable, we cannot do distributed processing.

Top Results From Across the Web

TypeError when using `df.apply` (Pandas) - Stack Overflow

What happens is that df.apply returns a pd.Series object for the lambda to operate over... It basically operates over a Series at a...

Type Error: Pandas Dataframe apply function, argument ...

Coding example for the question Type Error: Pandas Dataframe apply function, ... TypeError: generate() takes 2 positional arguments but 5 were given.

pandas.DataFrame.apply — pandas 1.5.2 documentation

Only perform transforming type operations. Notes. Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See ......

Pandas DataFrame apply() Examples - DigitalOcean

Pandas DataFrame apply() function is used to apply a function along an axis of the DataFrame. The function syntax is: def apply( self,...

KeyError Pandas – How To Fix - Data Independent

Pandas KeyError - This annoying error means that Pandas can not find your column ... name, type ... IndexEngine.get_loc() pandas/_libs/index.pyx in pandas.