Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BIGNUMERIC causing ModuleNotFoundError: No module named 'google.cloud.spark'

See original GitHub issue

On a default Dataproc Cluster on Compute Engine

Repro

    def foo(rows: Iterable[Row]) -> List[int]:
        row_dicts: List[Dict[str, Any]] = [row.asDict() for row in rows]  # Error
        return [1]

    df = ...  # read from big query table which has BIGNUMERIC

    x: int = df.rdd.mapPartitions(foo).sum()

Stack trace

  row_dicts: List[Dict[str, Any]] = [row.asDict() for row in rows]
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 138, in load_stream
    yield self._read_with_length(stream)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 893, in _parse_datatype_json_string
    return _parse_datatype_json_value(json.loads(json_string))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 910, in _parse_datatype_json_value
    return _all_complex_types[tpe].fromJson(json_value)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 596, in fromJson
    return StructType([StructField.fromJson(f) for f in json["fields"]])
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 596, in <listcomp>
    return StructType([StructField.fromJson(f) for f in json["fields"]])
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 441, in fromJson
    _parse_datatype_json_value(json["type"]),
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 912, in _parse_datatype_json_value
    return UserDefinedType.fromJson(json_value)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 755, in fromJson
    m = __import__(pyModule, globals(), locals(), [pyClass])
ModuleNotFoundError: No module named 'google.cloud.spark'

Workaround fix

SSH onto Spark machines
Find path to the google.cloud package. E.g., /opt/conda/default/lib/python3.8/site-packages/google/cloud
Copy in the spark folder from https://github.com/GoogleCloudDataproc/spark-bigquery-connector/tree/master/spark-bigquery-python-lib/src/main/python/google/cloud/spark

gary@cluster-9629-m:~$ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import google.cloud
>>> google.cloud.__path__
_NamespacePath(['/opt/conda/default/lib/python3.8/site-packages/google/cloud'])
>>> exit()
gary@cluster-9629-m:~$ cd '/opt/conda/default/lib/python3.8/site-packages/google/cloud'
gary@cluster-9629-m:/opt/conda/default/lib/python3.8/site-packages/google/cloud$ # copy in spark folder
gary@cluster-9629-m:/opt/conda/default/lib/python3.8/site-packages/google/cloud$ find spark
spark
spark/bigquery
spark/bigquery/__init__.py
spark/bigquery/big_query_connector_utils.py
spark/bigquery/big_numeric_support.py
spark/__init__.py

Debugging process

Monkey patched UserDefinedType.fromJson to log the erroring value.

`{'type': 'udt', 'class': 'org.apache.spark.bigquery.BigNumericUDT', 'pyClass': 'google.cloud.spark.bigquery.big_numeric_support.BigNumericUDT', 'sqlType': 'string'}`

https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/python/pyspark/sql/types.py#L755

Issue Analytics

State:
Created a year ago
Comments:5 (1 by maintainers)

Top GitHub Comments

2reactions

davidrabinowitzcommented, Aug 18, 2022

@gary-arcana have you tried the steps from the README?

try:
    import pkg_resources

    pkg_resources.declare_namespace(__name__)
except ImportError:
    import pkgutil

    __path__ = pkgutil.extend_path(__path__, __name__)

Also, please make sure that you have included the connector’s jar in the cluster (using the connectors init action) or by using the --jars option. Also verify that gs://spark-lib/bigquery/spark-bigquery-support-0.26.0.zip is configured in spark.submit.pyfiles or add it in runtime

spark.sparkContext.addPyFile("gs://spark-lib/bigquery/spark-bigquery-support-0.26.0.zip")

0reactions

ghostcommented, Aug 18, 2022

Thanks @davidrabinowitz , I’ll give it a go and provide an update later.

And no - I didn’t see the README. We can’t make Spark’s exception better, but maybe we can add the instructions to BigNumericUDT’s docstring. (or point it to the README)

I had a lot of trouble googling for a solution. Hopefully this GitHub issue will be picked up by the search engine.

Top Results From Across the Web

ImportError: No module named google.cloud - Stack Overflow

My error was caused because I hadn't enabled the Cloud Speech-to-Text API. I was able to do that in cloud console and the...

Could not handle BIGNUMERICs while writing #500 - GitHub

ModuleNotFoundError : No module named 'google.cloud.spark' ... It should be written as a BigNumeric in BigQuery as the scale exceeds ...

No module named 'google.cloud' in Python | bobbyhadz

To solve the Python ModuleNotFoundError: No module named 'google.cloud' error, install the specific google cloud module that you are ...

Changelog - Python client library | Google Cloud

This may cause some incompatibility with older google-cloud libraries, you will need to update those libraries if you have a dependency conflict.

No module named 'google-cloud-bigquery' - Finxter

[Fixed] ModuleNotFoundError: No module named 'google-cloud-bigquery'. by Chris. Rate this post. Quick Fix: Python raises the ImportError: No module named ...