[BUG] Conflict columns when Spark Model loading
See original GitHub issueWillingness to contribute
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
MLflow version
mlflow-skinny 1.28
System information
- Databricks platform Release 11.2 ML:
- Python 3.:
- yarn version, if running the dev UI:
Describe the problem
When passing from 1.27 to 1.28, there’s a bug with loading a Spark model that says:
java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
Tracking information
MLflow version: 1.28.0 Tracking URI: databricks Artifact URI: dbfs:/databricks/mlflow-tracking/3773864572821124/c149c22b99f34bb5ac023d2ae0f679a8/artifacts
Code to reproduce issue
%pip install mlflow-skinny==1.28
from typing import List
from pyspark.sql import DataFrame
from pyspark.sql.types import StructType, StructField, IntegerType
data: List = [
(1, 3, 1),
(1, 3, 1),
(1, 3, 1),
(1, 3, 1),
(1, 3, 1),
(2, 4, 1),
(2, 3, 1),
(3, 3, 1),
(3, 4, 1)
]
schema: StructType = StructType([ \
StructField("utilisateur_identifiant", IntegerType(), True), \
StructField("diamant_identifiant", IntegerType(), True),
StructField("nombre_de_fois_achetes", IntegerType(), True)
])
diamants_pre_features: DataFrame = spark.createDataFrame(data=data,schema=schema)
from pyspark.ml.recommendation import ALS, ALSModel
als: ALS = ALS(
userCol="utilisateur_identifiant",
itemCol="diamant_identifiant",
ratingCol="nombre_de_fois_achetes",
implicitPrefs=True,
alpha=40,
nonnegative=True
)
model: ALSModel = als.fit(diamants_pre_features)
import mlflow
mlflow.set_experiment("/Users/nastasia/ALS_experiment")
with mlflow.start_run() as last_run:
mlflow.spark.log_model(model, "als_exp")
from mlflow.tracking import MlflowClient
# Get last run from Mlflow experiment
client = MlflowClient()
model_experiment_id = client.get_experiment_by_name("/Users/nastasia/ALS_experiment").experiment_id
runs = client.search_runs(
model_experiment_id, order_by=["start_time DESC"]
)
run_uuid = runs[0].info.run_uuid
# can be loaded from s3
# model = ALSModel.load(sources_jobs['ALS_model'])
loaded_model = mlflow.spark.load_model(f"runs:/{run_uuid}/als_exp")
Stack trace
Py4JJavaError Traceback (most recent call last)
<command-3935055487045470> in <cell line: 15>()
13 # can be loaded from s3
14 # model = ALSModel.load(sources_jobs['ALS_model'])
---> 15 loaded_model = mlflow.spark.load_model(f"runs:/{run_uuid}/als_exp")
/local_disk0/.ephemeral_nfs/envs/pythonEnv-704f188f-34a4-414e-9394-fe156dae6392/lib/python3.9/site-packages/mlflow/spark.py in load_model(model_uri, dfs_tmpdir)
784 get_databricks_profile_uri_from_artifact_uri(root_uri)
785 ):
--> 786 return PipelineModel.load(mlflowdbfs_path)
787
788 return _load_model(
/databricks/spark/python/pyspark/ml/util.py in load(cls, path)
444 def load(cls, path: str) -> RL:
445 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 446 return cls.read().load(path)
447
448
/databricks/spark/python/pyspark/ml/pipeline.py in load(self, path)
282 metadata = DefaultParamsReader.loadMetadata(path, self.sc)
283 if "language" not in metadata["paramMap"] or metadata["paramMap"]["language"] != "Python":
--> 284 return JavaMLReader(cast(Type["JavaMLReadable[PipelineModel]"], self.cls)).load(path)
285 else:
286 uid, stages = PipelineSharedReadWrite.load(metadata, self.sc, path)
/databricks/spark/python/pyspark/ml/util.py in load(self, path)
393 if not isinstance(path, str):
394 raise TypeError("path should be a string, got type %s" % type(path))
--> 395 java_obj = self._jread.load(path)
396 if not hasattr(self._clazz, "_from_java"):
397 raise NotImplementedError(
/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py in __call__(self, *args)
1319
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1323
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
194 def deco(*a: Any, **kw: Any) -> Any:
195 try:
--> 196 return f(*a, **kw)
197 except Py4JJavaError as e:
198 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o631.load.
: java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
Partition column name list #0: part-00000-tid-2225352087800616616-e8b36a0f-887c-4049-96ad-de143491c840-1160-1-c000.snappy.parquet?X-Amz-Security-Token, _cloud_type_, _file_size_
Partition column name list #1: part-00001-tid-2225352087800616616-e8b36a0f-887c-4049-96ad-de143491c840-1161-1-c000.snappy.parquet?X-Amz-Security-Token, _cloud_type_, _file_size_
Partition column name list #2: part-00002-tid-2225352087800616616-e8b36a0f-887c-4049-96ad-de143491c840-1162-1-c000.snappy.parquet?X-Amz-Security-Token, _cloud_type_, _file_size_
Partition column name list #3: part-00003-tid-2225352087800616616-e8b36a0f-887c-4049-96ad-de143491c840-1163-1-c000.snappy.parquet?X-Amz-Security-Token, _cloud_type_, _file_size_
For partitioned table directories, data files should only live in leaf directories.
And directories at the same level should have the same partition column name.
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:482)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:213)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:142)
at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:212)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:106)
at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:460)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:368)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:324)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:324)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:237)
at org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:558)
at org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:548)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$5(Pipeline.scala:277)
at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:161)
at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:156)
at org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:43)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:277)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:284)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:284)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:161)
at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:156)
at org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:43)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:284)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:284)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:748)
Other info / logs
No response
What component(s) does this bug affect?
-
area/artifacts
: Artifact stores and artifact logging -
area/build
: Build and test infrastructure for MLflow -
area/docs
: MLflow documentation pages -
area/examples
: Example code -
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry -
area/models
: MLmodel format, model serialization/deserialization, flavors -
area/pipelines
: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templates -
area/projects
: MLproject format, project running backends -
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs -
area/server-infra
: MLflow Tracking server backend -
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
-
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server -
area/docker
: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models -
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry -
area/windows
: Windows support
What language(s) does this bug affect?
-
language/r
: R APIs and clients -
language/java
: Java APIs and clients -
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
-
integrations/azure
: Azure and Azure ML integrations -
integrations/sagemaker
: SageMaker integrations -
integrations/databricks
: Databricks integrations
Issue Analytics
- State:
- Created a year ago
- Comments:11
Top Results From Across the Web
Conflicting partition column names detected Pyspark ...
Following error is thrown due to source location dbfs:/mnt/DL/Facts has two different partition structure. java.lang.
Read more >Solving 5 Mysterious Spark Errors | by yhoztak
This error usually happens when two dataframes, and you apply udf on some columns to transfer, aggregate, rejoining to add as new fields...
Read more >Conflicting directory structures error
You should use distinct paths in the storage location, otherwise conflicting directory structures may result in an error.
Read more >Mongodb-Spark Conflict datatype issue
It looks like one of your fields contains mixed types, so it has been assigned a ConflictType. It should be visible when run...
Read more >Handle Ambiguous column error during join in spark scala
In many spark applications, we face a known and standard error, i.e., Ambiguous column Error. This error results due to duplicate column ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @NastasiaSaby , if you’d like to use MLflow 1.28 or 1.29, we’re fairly certain that by setting this variable in your notebook will make things work (for now):
Alternatively, you can set the environment variable at cluster creation. We’re working on a fix for this that won’t be available until the next stable release. Once again, sorry for the regression!
@chengyineng38 @NastasiaSaby A fix for this issue has been included in the Databricks Runtime and is currently in the process of being released to production workspaces. By the end of this week, it should be safe to remove the
DISABLE_MLFLOWDBFS
workaround code from your notebooks after restarting your cluster(s).