question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Files not writing to remote HDFS

See original GitHub issue

Spark-Bench version (version number, tag, or git commit hash)

dc78dad

Details of your cluster setup (Spark version, Standalone/Yarn/Local/Etc)

Spark 2.1.1 Stanalone mode

Scala version on your cluster

2.11.6

Your exact configuration file (with system details anonymized for security)

From time to time I commented generation OR benchmarking to test them separately.

spark-bench = {
  spark-submit-parallel = false
  spark-submit-config = [{
    suites-parallel = false
    workload-suites = [
      {
        descr = "Generating data for the benchmarks to use"
        parallel = false
        repeat = 1 // generate once and done!
        benchmark-output = "console"
        workloads = [
          {
            name = "data-generation-kmeans"
            output = "hdfs://hostname:9000/tmp/spark-bench-test/kmeans-data.parquet"
            rows = 10000
            cols = 14
          }
        ]
     },
      {
        descr = "Classic benchmarking"
        parallel = false
        repeat = 1 
        benchmark-output = "console"
        workloads = [
          {
            name = "kmeans"
            input = "hdfs://hostname:9000/tmp/spark-bench-test/kmeans-data.parquet"
          }
      ]}
    ]
  }]
}

Relevant stacktrace

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://hostname:9000/tmp/spark-bench-test/kmeans-data.parquet, expected: file:///
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
	at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
	at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.pathExists(SparkFuncs.scala:87)
	at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyCanWrite(SparkFuncs.scala:41)
	at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyCanWriteOrThrow(SparkFuncs.scala:102)
	at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyOutput(SparkFuncs.scala:34)
	at com.ibm.sparktc.sparkbench.workload.Workload$class.run(Workload.scala:49)
	at com.ibm.sparktc.sparkbench.datageneration.mlgenerator.KMeansDataGen.run(KMeansDataGen.scala:44)

Description of your problem and any other relevant info

Actual error occurs at: com.ibm.sparktc.sparkbench.utils.SparkFuncs$.pathExists(SparkFuncs.scala:87)

For quick test I changed com.ibm.sparktc.sparkbench.utils.pathExists as follows:
def pathExists(path: String, spark: SparkSession): Boolean = { false } and generation run successfully. After that I change it to return true, commented generation and “Classic benchmarking” also run successfully. So I believe either getHadoopFS(path, spark).exists(new org.apache.hadoop.fs.Path(path)) doesn’t works as expected both for files and folder or com.ibm.sparktc.sparkbench.utils.SparkFuncs$.pathExists(SparkFuncs.scala:87) is used somewhere in an erroneous way.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ecurtincommented, Feb 15, 2018

Closed by #155. Distribution will be created and uploaded to Github Releases in ~15 minutes or checkout master. Thank you @AndriiSushko for your thorough and clear bug report!

0reactions
ecurtincommented, Feb 16, 2018

@AndriiSushko I would be super happy to have you as a contributor! Send me an email if you wanna chat about design or Scala or anything. 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Is it possible to write to a remote HDFS? - Stack Overflow
yes , almost..The client which connects to the remote HDFS should know the configuration details of it beforehand , so that it can...
Read more >
Solved: could not list the files in remote HDFS cluster
we are trying to list the directories in cluster A by issuing the command from one of the client machine of cluster B....
Read more >
hadoop fs commands do not work from remote server - IBM
Problem. Unable to transfer data files from dev to staging cluster environment · Symptom. hadoop fs -ls hdfs://<servername>:8020/user/user · Cause.
Read more >
API reference — HdfsCLI 2.5.8 documentation - Read the Docs
hdfs_path – Remote path. This directory should contain at most one part file per partition (otherwise one will be picked arbitrarily).
Read more >
Connect to remote data - Dask documentation
Hadoop File System: hdfs:// - Hadoop Distributed File System, ... that the client and worker processes do not necessarily have the same working...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found