Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark-Bench failing with remote HDFS

See original GitHub issue

Spark-Bench Version

992a278

Spark Version on Your Cluster

2.2.0

Scala Version on Your Spark Cluster

2.11.8

Your Exact Configuration File (with system details anonymized)

spark-bench = {
  spark-submit-config = [{
    spark-home = "/opt/kubespark" 
    spark-args = {
      master = "k8s://https://x.x.x.x:6443"
      executor-memory = "4g"
    }
    conf = {
    }
    suites-parallel = false
    workload-suites = [
      {
        descr = "Generate a dataset, then take that same dataset and write it out to Parquet format"
        benchmark-output = "hdfs://hdfs:9000/tmp/csv-vs-parquet/results-data-gen.csv"
        save-mode = "overwrite"
        // We need to generate the dataset first through the data generator, then we take that dataset and convert it to Parquet.
        parallel = false
        workloads = [
          {
            name = "data-generation-kmeans"
            rows = 10000000
            cols = 24
            output = "hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.csv"
            save-mode = "ignore"
          },
          {
            name = "sql"
            query = "select * from input"
            input = "hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.csv"
            output = "hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.parquet"
            save-mode = "ignore"
          }
        ]
      },
      {
        descr = "Run two different SQL queries over the dataset in two different formats"
        benchmark-output = "hdfs://hdfs:9000/tmp/csv-vs-parquet/results-sql.csv"
        save-mode = "overwrite"
        parallel = false
        repeat = 10
        workloads = [
          {
            name = "sql"
            input = ["hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.csv", "hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.parquet"]
            query = ["select * from input", "select `0`, `22` from input where `0` < -0.9"]
            cache = false
          }
        ]
      }
    ]
  }]
}

Relevant Stack Trace (If Applicable)

18/02/08 14:26:15 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds (T0)
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs:/tmp/csv-vs-parquet/results-data-gen.csv, expected: file:///
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
	at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
	at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.pathExists(SparkFuncs.scala:68)
	at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyCanWrite(SparkFuncs.scala:40)
	at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyCanWriteOrThrow(SparkFuncs.scala:83)
	at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyOutput(SparkFuncs.scala:33)
	at com.ibm.sparktc.sparkbench.workload.SuiteKickoff$.run(SuiteKickoff.scala:61)
	at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially$1.apply(MultipleSuiteKickoff.scala:38)
	at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially$1.apply(MultipleSuiteKickoff.scala:38)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$.com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially(MultipleSuiteKickoff.scala:38)
	at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$run$1.apply(MultipleSuiteKickoff.scala:28)
	at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$run$1.apply(MultipleSuiteKickoff.scala:25)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$.run(MultipleSuiteKickoff.scala:25)
	at com.ibm.sparktc.sparkbench.cli.CLIKickoff$.main(CLIKickoff.scala:30)
	at com.ibm.sparktc.sparkbench.cli.CLIKickoff.main(CLIKickoff.scala)

Description of Problem, Any Other Info

Spark-Bench is not playing nice with remote hdfs systems.

Issue Analytics

State:
Created 6 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

AndriiSushkocommented, Feb 15, 2018

Thanks for such a quick reply. Spark is in a standalone mode, 3 nodes (1 master and 3 slaves). HDFS is up, running and available. I’ll check Input/Output tests and fill out a separate issue if I don’t figure this out.

UPD: @ecurtin please check issue #154 and let me know if you need some more details.

1reaction

ecurtincommented, Feb 15, 2018

@yekaifeng @AndriiSushko I am happy to do what I can to help, but I’ll need more information from you because I can’t seem to reproduce the issue.

Here’s what I’ll need to help you:

Please fill out a separate issue with the info requested in the issue template (your config file, is your cluster Yarn, Standalone, etc.) and the stacktrace of your issue.
Ensure that you’re able to read and write to HDFS from outside of Spark-Bench. Optionally you can use OutputTest and InputTest from SparkTests to see if things are working. First change the URLs in the run scripts in the bin directory of SparkTests to reflect your environment.