Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Unable to read Hudi MOR data set in a test on 0.7

See original GitHub issue

Tips before filing an issue

I’m trying out Hudi 0.7 locally via PySpark and oddly I can write data that can be read back as parquet but not as Hudi. It seems InMemoryFileIndex is missing?

Stacktrace for pytest test:

============================================================== FAILURES ==============================================================
_____________________________________________________________ test_hudi ______________________________________________________________
answer = 'xro66', gateway_client = <py4j.java_gateway.GatewayClient object at 0x1069a83d0>, target_id = 'o65', name = 'load'

    def get_return_value(answer, gateway_client, target_id=None, name=None):
        """Converts an answer received from the Java gateway into a Python object.

        For example, string representation of integers are converted to Python
        integer, string representation of objects are converted to JavaObject
        instances, etc.

        :param answer: the string returned by the Java gateway
        :param gateway_client: the gateway client used to communicate with the Java
            Gateway. Only necessary if the answer is a reference (e.g., object,
            list, map)
        :param target_id: the name of the object from which the answer comes from
            (e.g., *object1* in `object1.hello()`). Optional.
        :param name: the name of the member from which the answer comes from
            (e.g., *hello* in `object1.hello()`). Optional.
        """
        if is_error(answer)[0]:
            if len(answer) > 1:
                type = answer[1]
                value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
                if answer[1] == REFERENCE_TYPE:
                    raise Py4JJavaError(
                        "An error occurred while calling {0}{1}{2}.\n".
>                       format(target_id, ".", name), value)
E                   py4j.protocol.Py4JJavaError: An error occurred while calling o65.load.
E                   : java.lang.NoSuchMethodError: 'void org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(org.apache.spark.sql.SparkSession, scala.collection.Seq, scala.collection.immutable.Map, scala.Option, org.apache.spark.sql.execution.datasources.FileStatusCache)'
E                       at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
E                       at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
E                       at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
E                       at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
E                       at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
E                       at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
E                       at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
E                       at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
E                       at scala.Option.getOrElse(Option.scala:189)
E                       at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
E                       at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214)
E                       at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
E                       at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
E                       at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E                       at java.base/java.lang.reflect.Method.invoke(Method.java:567)
E                       at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
E                       at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
E                       at py4j.Gateway.invoke(Gateway.java:282)
E                       at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
E                       at py4j.commands.CallCommand.execute(CallCommand.java:79)
E                       at py4j.GatewayConnection.run(GatewayConnection.java:238)
E                       at java.base/java.lang.Thread.run(Thread.java:830)

../../.pyenv/versions/3.7.6/envs/spark/lib/python3.7/site-packages/py4j/protocol.py:328: Py4JJavaError

To Reproduce

Steps to reproduce the behavior:

Sample code using pytest:

import pytest
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession

from pyspark.sql import Row


def test_hudi(tmp_path):
    SparkContext.getOrCreate(
        conf=SparkConf()
        .setAppName("testing")
        .setMaster("local[1]")
        .set(
            "spark.jars.packages",
            "org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1",
        )
        .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
        .set("spark.sql.hive.convertMetastoreParquet", "false")
    )
    spark = SparkSession.builder.getOrCreate()

    hudi_options = {
        "hoodie.table.name": "test",
        "hoodie.datasource.write.recordkey.field": "id",
        "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
        "hoodie.datasource.write.partitionpath.field": "year,month,day",
        "hoodie.datasource.write.table.name": "test",
        "hoodie.datasource.write.table.type": "MERGE_ON_READ",
        "hoodie.datasource.write.operation": "upsert",
        "hoodie.datasource.write.precombine.field": "ts",
    }
    df = spark.createDataFrame(
        [
            Row(id=1, year=2020, month=7, day=5, ts=1),
        ]
    )
    df.write.format("hudi").options(**hudi_options).mode("append").save(str(tmp_path))
    read_df = spark.read.format("parquet").load(str(tmp_path) + "/*/*/*")
    # This works and prints:
    # [Row(_hoodie_commit_time='20210210160002', _hoodie_commit_seqno='20210210160002_0_1', _hoodie_record_key='id:1', _hoodie_partition_path='2020/7/5', _hoodie_file_name='e8febcc9-58b6-4174-8e83-90842d5492b0-0_0-21-12005_20210210160002.parquet', id=1, year=2020, month=7, day=5, ts=1)]
    print(read_df.collect())

    read_df = spark.read.format("hudi").load(str(tmp_path) + "/*/*/*")
    # This does not
    print(read_df.collect())

Install pytest (I’m using 6.1.1) and run: py.test -s --verbose test_hudi.py (NB: tmp_path is a pytest fixture https://docs.pytest.org/en/stable/tmpdir.html for creating a temporary directory)

Expected behavior

The read should work, not sure why InMemoryFileIndex is missing. Could be something wrong with my setup. I’m using Scala 2.12.10 locally.

Environment Description

Hudi version : 0.7.0
Spark version : 3.0.0
Hive version : N/A
Hadoop version : N/A
Storage (HDFS/S3/GCS…) : Local
Running on Docker? (yes/no) : no

Issue Analytics

State:
Created 3 years ago
Comments:16 (10 by maintainers)

Top GitHub Comments

1reaction

nsivabalancommented, Feb 22, 2021

I am still triaging the issue. here is the tracking ticket in the mean time: https://issues.apache.org/jira/browse/HUDI-1568. Happening even for spark.

1reaction

nsivabalancommented, Feb 21, 2021

got it. yes, I could able to repro now. will investigate further. thnx.

Top Results From Across the Web

[GitHub] [hudi] umehrot2 edited a comment on issue #2566 ...

[GitHub] [hudi] umehrot2 edited a comment on issue #2566: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.7.

Spark Guide - Apache Hudi

Spark SQL needs an explicit create table command. Both Hudi's table types, Copy-On-Write (COW) and Merge-On-Read (MOR), can be created using Spark SQL....

New features from Apache Hudi 0.7.0 and 0.8.0 available on ...

It allows you to maintain data in Amazon S3 or HDFS in open formats like Apache Parquet and Apache Avro. Starting with release...

Newest 'apache-hudi' Questions - Stack Overflow

We are trying to read data from mysql database in RDS using DMS. The DMS outputs the data in parquet file format into...

Synchronizing Hudi Table Data to Hive_MapReduce ... - 华为云

You can run run_hive_sync_tool.sh to synchronize data in the Hudi table to Hive.For example, run the following command to synchronize the Hudi table...