question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark-Bench failing with remote HDFS

See original GitHub issue

Spark-Bench Version

992a278

Spark Version on Your Cluster

2.2.0

Scala Version on Your Spark Cluster

2.11.8

Your Exact Configuration File (with system details anonymized)

spark-bench = {
  spark-submit-config = [{
    spark-home = "/opt/kubespark" 
    spark-args = {
      master = "k8s://https://x.x.x.x:6443"
      executor-memory = "4g"
    }
    conf = {
    }
    suites-parallel = false
    workload-suites = [
      {
        descr = "Generate a dataset, then take that same dataset and write it out to Parquet format"
        benchmark-output = "hdfs://hdfs:9000/tmp/csv-vs-parquet/results-data-gen.csv"
        save-mode = "overwrite"
        // We need to generate the dataset first through the data generator, then we take that dataset and convert it to Parquet.
        parallel = false
        workloads = [
          {
            name = "data-generation-kmeans"
            rows = 10000000
            cols = 24
            output = "hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.csv"
            save-mode = "ignore"
          },
          {
            name = "sql"
            query = "select * from input"
            input = "hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.csv"
            output = "hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.parquet"
            save-mode = "ignore"
          }
        ]
      },
      {
        descr = "Run two different SQL queries over the dataset in two different formats"
        benchmark-output = "hdfs://hdfs:9000/tmp/csv-vs-parquet/results-sql.csv"
        save-mode = "overwrite"
        parallel = false
        repeat = 10
        workloads = [
          {
            name = "sql"
            input = ["hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.csv", "hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.parquet"]
            query = ["select * from input", "select `0`, `22` from input where `0` < -0.9"]
            cache = false
          }
        ]
      }
    ]
  }]
}

Relevant Stack Trace (If Applicable)

18/02/08 14:26:15 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds (T0)
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs:/tmp/csv-vs-parquet/results-data-gen.csv, expected: file:///
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
	at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
	at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.pathExists(SparkFuncs.scala:68)
	at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyCanWrite(SparkFuncs.scala:40)
	at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyCanWriteOrThrow(SparkFuncs.scala:83)
	at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyOutput(SparkFuncs.scala:33)
	at com.ibm.sparktc.sparkbench.workload.SuiteKickoff$.run(SuiteKickoff.scala:61)
	at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially$1.apply(MultipleSuiteKickoff.scala:38)
	at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially$1.apply(MultipleSuiteKickoff.scala:38)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$.com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially(MultipleSuiteKickoff.scala:38)
	at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$run$1.apply(MultipleSuiteKickoff.scala:28)
	at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$run$1.apply(MultipleSuiteKickoff.scala:25)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$.run(MultipleSuiteKickoff.scala:25)
	at com.ibm.sparktc.sparkbench.cli.CLIKickoff$.main(CLIKickoff.scala:30)
	at com.ibm.sparktc.sparkbench.cli.CLIKickoff.main(CLIKickoff.scala)

Description of Problem, Any Other Info

Spark-Bench is not playing nice with remote hdfs systems.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
AndriiSushkocommented, Feb 15, 2018

Thanks for such a quick reply. Spark is in a standalone mode, 3 nodes (1 master and 3 slaves). HDFS is up, running and available. I’ll check Input/Output tests and fill out a separate issue if I don’t figure this out.

UPD: @ecurtin please check issue #154 and let me know if you need some more details.

1reaction
ecurtincommented, Feb 15, 2018

@yekaifeng @AndriiSushko I am happy to do what I can to help, but I’ll need more information from you because I can’t seem to reproduce the issue.

Here’s what I’ll need to help you:

  1. Please fill out a separate issue with the info requested in the issue template (your config file, is your cluster Yarn, Standalone, etc.) and the stacktrace of your issue.
  2. Ensure that you’re able to read and write to HDFS from outside of Spark-Bench. Optionally you can use OutputTest and InputTest from SparkTests to see if things are working. First change the URLs in the run scripts in the bin directory of SparkTests to reflect your environment.
Read more comments on GitHub >

github_iconTop Results From Across the Web

CODAIT/spark-bench - Files not writing to remote HDFS
After that I change it to return true, commented generation and "Classic benchmarking" also run successfully. So I believe either getHadoopFS( ...
Read more >
Solved: Remote spark-submit HDFS error
I am trying to launch a spark job from a remote host to the HDP sandbox running on my Mac but keep getting...
Read more >
Failed to Write Files to HDFS, and "item limit of / is ...
The client or upper-layer component logs indicate that a file fails to be written to a directory on HDFS. The error information is...
Read more >
Spark Remote execution to Cluster fails - HDFS connection ...
I am having issues submitting a spark-submit remote job from a machine outside from the Spark Cluster running on YARN ...
Read more >
Dell EMC PowerStore: Apache Spark Solution Guide
ESXi™ hypervisor automatically restarts or migrates failed Spark and HDFS servers to a different ESXi node. This process resumes operations ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found