question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BIGNUMERIC causing ModuleNotFoundError: No module named 'google.cloud.spark'

See original GitHub issue

On a default Dataproc Cluster on Compute Engine

Repro

    def foo(rows: Iterable[Row]) -> List[int]:
        row_dicts: List[Dict[str, Any]] = [row.asDict() for row in rows]  # Error
        return [1]

    df = ...  # read from big query table which has BIGNUMERIC

    x: int = df.rdd.mapPartitions(foo).sum()

Stack trace

  row_dicts: List[Dict[str, Any]] = [row.asDict() for row in rows]
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 138, in load_stream
    yield self._read_with_length(stream)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 893, in _parse_datatype_json_string
    return _parse_datatype_json_value(json.loads(json_string))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 910, in _parse_datatype_json_value
    return _all_complex_types[tpe].fromJson(json_value)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 596, in fromJson
    return StructType([StructField.fromJson(f) for f in json["fields"]])
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 596, in <listcomp>
    return StructType([StructField.fromJson(f) for f in json["fields"]])
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 441, in fromJson
    _parse_datatype_json_value(json["type"]),
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 912, in _parse_datatype_json_value
    return UserDefinedType.fromJson(json_value)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 755, in fromJson
    m = __import__(pyModule, globals(), locals(), [pyClass])
ModuleNotFoundError: No module named 'google.cloud.spark'

Workaround fix

  1. SSH onto Spark machines
  2. Find path to the google.cloud package. E.g., /opt/conda/default/lib/python3.8/site-packages/google/cloud
  3. Copy in the spark folder from https://github.com/GoogleCloudDataproc/spark-bigquery-connector/tree/master/spark-bigquery-python-lib/src/main/python/google/cloud/spark
gary@cluster-9629-m:~$ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import google.cloud
>>> google.cloud.__path__
_NamespacePath(['/opt/conda/default/lib/python3.8/site-packages/google/cloud'])
>>> exit()
gary@cluster-9629-m:~$ cd '/opt/conda/default/lib/python3.8/site-packages/google/cloud'
gary@cluster-9629-m:/opt/conda/default/lib/python3.8/site-packages/google/cloud$ # copy in spark folder
gary@cluster-9629-m:/opt/conda/default/lib/python3.8/site-packages/google/cloud$ find spark
spark
spark/bigquery
spark/bigquery/__init__.py
spark/bigquery/big_query_connector_utils.py
spark/bigquery/big_numeric_support.py
spark/__init__.py

Debugging process

Monkey patched UserDefinedType.fromJson to log the erroring value.

`{'type': 'udt', 'class': 'org.apache.spark.bigquery.BigNumericUDT', 'pyClass': 'google.cloud.spark.bigquery.big_numeric_support.BigNumericUDT', 'sqlType': 'string'}`

https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/python/pyspark/sql/types.py#L755

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
davidrabinowitzcommented, Aug 18, 2022

@gary-arcana have you tried the steps from the README?

try:
    import pkg_resources

    pkg_resources.declare_namespace(__name__)
except ImportError:
    import pkgutil

    __path__ = pkgutil.extend_path(__path__, __name__)

Also, please make sure that you have included the connector’s jar in the cluster (using the connectors init action) or by using the --jars option. Also verify that gs://spark-lib/bigquery/spark-bigquery-support-0.26.0.zip is configured in spark.submit.pyfiles or add it in runtime

spark.sparkContext.addPyFile("gs://spark-lib/bigquery/spark-bigquery-support-0.26.0.zip")
0reactions
ghostcommented, Aug 18, 2022

Thanks @davidrabinowitz , I’ll give it a go and provide an update later.

And no - I didn’t see the README. We can’t make Spark’s exception better, but maybe we can add the instructions to BigNumericUDT’s docstring. (or point it to the README)

I had a lot of trouble googling for a solution. Hopefully this GitHub issue will be picked up by the search engine.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ImportError: No module named google.cloud - Stack Overflow
My error was caused because I hadn't enabled the Cloud Speech-to-Text API. I was able to do that in cloud console and the...
Read more >
Could not handle BIGNUMERICs while writing #500 - GitHub
ModuleNotFoundError : No module named 'google.cloud.spark' ... It should be written as a BigNumeric in BigQuery as the scale exceeds ...
Read more >
No module named 'google.cloud' in Python | bobbyhadz
To solve the Python ModuleNotFoundError: No module named 'google.cloud' error, install the specific google cloud module that you are ...
Read more >
Changelog - Python client library | Google Cloud
This may cause some incompatibility with older google-cloud libraries, you will need to update those libraries if you have a dependency conflict.
Read more >
No module named 'google-cloud-bigquery' - Finxter
[Fixed] ModuleNotFoundError: No module named 'google-cloud-bigquery'. by Chris. Rate this post. Quick Fix: Python raises the ImportError: No module named ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found