question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Conflict columns when Spark Model loading

See original GitHub issue

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

mlflow-skinny 1.28

System information

  • Databricks platform Release 11.2 ML:
  • Python 3.:
  • yarn version, if running the dev UI:

Describe the problem

When passing from 1.27 to 1.28, there’s a bug with loading a Spark model that says: java.lang.AssertionError: assertion failed: Conflicting partition column names detected:

Tracking information

MLflow version: 1.28.0 Tracking URI: databricks Artifact URI: dbfs:/databricks/mlflow-tracking/3773864572821124/c149c22b99f34bb5ac023d2ae0f679a8/artifacts

Code to reproduce issue


%pip install mlflow-skinny==1.28

from typing import List
from pyspark.sql import DataFrame
from pyspark.sql.types import StructType, StructField, IntegerType

data: List = [
              (1, 3, 1),
  (1, 3, 1),
  (1, 3, 1),
  (1, 3, 1),
  (1, 3, 1),
              (2, 4, 1),
              (2, 3, 1),
              (3, 3, 1),
              (3, 4, 1)
  ]

schema: StructType = StructType([ \
    StructField("utilisateur_identifiant", IntegerType(), True), \
    StructField("diamant_identifiant", IntegerType(), True),
    StructField("nombre_de_fois_achetes", IntegerType(), True)
  ])

diamants_pre_features: DataFrame = spark.createDataFrame(data=data,schema=schema)


from pyspark.ml.recommendation import ALS, ALSModel

als: ALS = ALS(
  userCol="utilisateur_identifiant", 
  itemCol="diamant_identifiant", 
  ratingCol="nombre_de_fois_achetes",
  implicitPrefs=True,
  alpha=40,
  nonnegative=True
)
model: ALSModel = als.fit(diamants_pre_features)

import mlflow
mlflow.set_experiment("/Users/nastasia/ALS_experiment")

with mlflow.start_run() as last_run:
  mlflow.spark.log_model(model, "als_exp")

from mlflow.tracking import MlflowClient
# Get last run from Mlflow experiment
client = MlflowClient()

model_experiment_id = client.get_experiment_by_name("/Users/nastasia/ALS_experiment").experiment_id

runs = client.search_runs(
        model_experiment_id, order_by=["start_time DESC"]
)

run_uuid = runs[0].info.run_uuid

# can be loaded from s3
# model = ALSModel.load(sources_jobs['ALS_model'])
loaded_model = mlflow.spark.load_model(f"runs:/{run_uuid}/als_exp")

Stack trace


Py4JJavaError                             Traceback (most recent call last)
<command-3935055487045470> in <cell line: 15>()
     13 # can be loaded from s3
     14 # model = ALSModel.load(sources_jobs['ALS_model'])
---> 15 loaded_model = mlflow.spark.load_model(f"runs:/{run_uuid}/als_exp")

/local_disk0/.ephemeral_nfs/envs/pythonEnv-704f188f-34a4-414e-9394-fe156dae6392/lib/python3.9/site-packages/mlflow/spark.py in load_model(model_uri, dfs_tmpdir)
    784             get_databricks_profile_uri_from_artifact_uri(root_uri)
    785         ):
--> 786             return PipelineModel.load(mlflowdbfs_path)
    787 
    788     return _load_model(

/databricks/spark/python/pyspark/ml/util.py in load(cls, path)
    444     def load(cls, path: str) -> RL:
    445         """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 446         return cls.read().load(path)
    447 
    448 

/databricks/spark/python/pyspark/ml/pipeline.py in load(self, path)
    282         metadata = DefaultParamsReader.loadMetadata(path, self.sc)
    283         if "language" not in metadata["paramMap"] or metadata["paramMap"]["language"] != "Python":
--> 284             return JavaMLReader(cast(Type["JavaMLReadable[PipelineModel]"], self.cls)).load(path)
    285         else:
    286             uid, stages = PipelineSharedReadWrite.load(metadata, self.sc, path)

/databricks/spark/python/pyspark/ml/util.py in load(self, path)
    393         if not isinstance(path, str):
    394             raise TypeError("path should be a string, got type %s" % type(path))
--> 395         java_obj = self._jread.load(path)
    396         if not hasattr(self._clazz, "_from_java"):
    397             raise NotImplementedError(

/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1319 
   1320         answer = self.gateway_client.send_command(command)
-> 1321         return_value = get_return_value(
   1322             answer, self.gateway_client, self.target_id, self.name)
   1323 

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    194     def deco(*a: Any, **kw: Any) -> Any:
    195         try:
--> 196             return f(*a, **kw)
    197         except Py4JJavaError as e:
    198             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o631.load.
: java.lang.AssertionError: assertion failed: Conflicting partition column names detected:

	Partition column name list #0: part-00000-tid-2225352087800616616-e8b36a0f-887c-4049-96ad-de143491c840-1160-1-c000.snappy.parquet?X-Amz-Security-Token, _cloud_type_, _file_size_
	Partition column name list #1: part-00001-tid-2225352087800616616-e8b36a0f-887c-4049-96ad-de143491c840-1161-1-c000.snappy.parquet?X-Amz-Security-Token, _cloud_type_, _file_size_
	Partition column name list #2: part-00002-tid-2225352087800616616-e8b36a0f-887c-4049-96ad-de143491c840-1162-1-c000.snappy.parquet?X-Amz-Security-Token, _cloud_type_, _file_size_
	Partition column name list #3: part-00003-tid-2225352087800616616-e8b36a0f-887c-4049-96ad-de143491c840-1163-1-c000.snappy.parquet?X-Amz-Security-Token, _cloud_type_, _file_size_

For partitioned table directories, data files should only live in leaf directories.
And directories at the same level should have the same partition column name.
	at scala.Predef$.assert(Predef.scala:223)
	at org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:482)
	at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:213)
	at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:142)
	at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:212)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:106)
	at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:53)
	at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:192)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:460)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:368)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:324)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:324)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:237)
	at org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:558)
	at org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:548)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$5(Pipeline.scala:277)
	at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:161)
	at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:156)
	at org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:43)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:277)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:284)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:284)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
	at org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
	at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:161)
	at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:156)
	at org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:43)
	at org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:284)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:284)
	at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
	at java.lang.Thread.run(Thread.java:748)

Other info / logs

No response

What component(s) does this bug affect?

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/pipelines: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

What language(s) does this bug affect?

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:11

github_iconTop GitHub Comments

2reactions
BenWilson2commented, Oct 5, 2022

Hi @NastasiaSaby , if you’d like to use MLflow 1.28 or 1.29, we’re fairly certain that by setting this variable in your notebook will make things work (for now):

import os

os.environ["DISABLE_MLFLOWDBFS"] = "true"

Alternatively, you can set the environment variable at cluster creation. We’re working on a fix for this that won’t be available until the next stable release. Once again, sorry for the regression!

0reactions
dbczumarcommented, Nov 2, 2022

@BenWilson2 When will be the next stable release? I can confirm that the workaround below works for sparkdl.xgboost, but wondering how long I need to keep the following line in my code. Thank you!


os.environ["DISABLE_MLFLOWDBFS"] = "true"

@chengyineng38 @NastasiaSaby A fix for this issue has been included in the Databricks Runtime and is currently in the process of being released to production workspaces. By the end of this week, it should be safe to remove the DISABLE_MLFLOWDBFS workaround code from your notebooks after restarting your cluster(s).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Conflicting partition column names detected Pyspark ...
Following error is thrown due to source location dbfs:/mnt/DL/Facts has two different partition structure. java.lang.
Read more >
Solving 5 Mysterious Spark Errors | by yhoztak
This error usually happens when two dataframes, and you apply udf on some columns to transfer, aggregate, rejoining to add as new fields...
Read more >
Conflicting directory structures error
You should use distinct paths in the storage location, otherwise conflicting directory structures may result in an error.
Read more >
Mongodb-Spark Conflict datatype issue
It looks like one of your fields contains mixed types, so it has been assigned a ConflictType. It should be visible when run...
Read more >
Handle Ambiguous column error during join in spark scala
In many spark applications, we face a known and standard error, i.e., Ambiguous column Error. This error results due to duplicate column ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found