BIGNUMERIC causing ModuleNotFoundError: No module named 'google.cloud.spark'
See original GitHub issueOn a default Dataproc Cluster on Compute Engine
Repro
def foo(rows: Iterable[Row]) -> List[int]:
row_dicts: List[Dict[str, Any]] = [row.asDict() for row in rows] # Error
return [1]
df = ... # read from big query table which has BIGNUMERIC
x: int = df.rdd.mapPartitions(foo).sum()
Stack trace
row_dicts: List[Dict[str, Any]] = [row.asDict() for row in rows]
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 138, in load_stream
yield self._read_with_length(stream)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
return self.loads(obj)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 893, in _parse_datatype_json_string
return _parse_datatype_json_value(json.loads(json_string))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 910, in _parse_datatype_json_value
return _all_complex_types[tpe].fromJson(json_value)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 596, in fromJson
return StructType([StructField.fromJson(f) for f in json["fields"]])
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 596, in <listcomp>
return StructType([StructField.fromJson(f) for f in json["fields"]])
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 441, in fromJson
_parse_datatype_json_value(json["type"]),
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 912, in _parse_datatype_json_value
return UserDefinedType.fromJson(json_value)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 755, in fromJson
m = __import__(pyModule, globals(), locals(), [pyClass])
ModuleNotFoundError: No module named 'google.cloud.spark'
Workaround fix
- SSH onto Spark machines
- Find path to the
google.cloud
package. E.g.,/opt/conda/default/lib/python3.8/site-packages/google/cloud
- Copy in the
spark
folder from https://github.com/GoogleCloudDataproc/spark-bigquery-connector/tree/master/spark-bigquery-python-lib/src/main/python/google/cloud/spark
gary@cluster-9629-m:~$ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import google.cloud
>>> google.cloud.__path__
_NamespacePath(['/opt/conda/default/lib/python3.8/site-packages/google/cloud'])
>>> exit()
gary@cluster-9629-m:~$ cd '/opt/conda/default/lib/python3.8/site-packages/google/cloud'
gary@cluster-9629-m:/opt/conda/default/lib/python3.8/site-packages/google/cloud$ # copy in spark folder
gary@cluster-9629-m:/opt/conda/default/lib/python3.8/site-packages/google/cloud$ find spark
spark
spark/bigquery
spark/bigquery/__init__.py
spark/bigquery/big_query_connector_utils.py
spark/bigquery/big_numeric_support.py
spark/__init__.py
Debugging process
Monkey patched UserDefinedType.fromJson
to log the erroring value.
`{'type': 'udt', 'class': 'org.apache.spark.bigquery.BigNumericUDT', 'pyClass': 'google.cloud.spark.bigquery.big_numeric_support.BigNumericUDT', 'sqlType': 'string'}`
Issue Analytics
- State:
- Created a year ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
ImportError: No module named google.cloud - Stack Overflow
My error was caused because I hadn't enabled the Cloud Speech-to-Text API. I was able to do that in cloud console and the...
Read more >Could not handle BIGNUMERICs while writing #500 - GitHub
ModuleNotFoundError : No module named 'google.cloud.spark' ... It should be written as a BigNumeric in BigQuery as the scale exceeds ...
Read more >No module named 'google.cloud' in Python | bobbyhadz
To solve the Python ModuleNotFoundError: No module named 'google.cloud' error, install the specific google cloud module that you are ...
Read more >Changelog - Python client library | Google Cloud
This may cause some incompatibility with older google-cloud libraries, you will need to update those libraries if you have a dependency conflict.
Read more >No module named 'google-cloud-bigquery' - Finxter
[Fixed] ModuleNotFoundError: No module named 'google-cloud-bigquery'. by Chris. Rate this post. Quick Fix: Python raises the ImportError: No module named ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@gary-arcana have you tried the steps from the README?
Also, please make sure that you have included the connector’s jar in the cluster (using the connectors init action) or by using the
--jars
option. Also verify thatgs://spark-lib/bigquery/spark-bigquery-support-0.26.0.zip
is configured inspark.submit.pyfiles
or add it in runtimeThanks @davidrabinowitz , I’ll give it a go and provide an update later.
And no - I didn’t see the README. We can’t make Spark’s exception better, but maybe we can add the instructions to
BigNumericUDT
’s docstring. (or point it to the README)I had a lot of trouble googling for a solution. Hopefully this GitHub issue will be picked up by the search engine.