Files not writing to remote HDFS
See original GitHub issueSpark-Bench version (version number, tag, or git commit hash)
dc78dad
Details of your cluster setup (Spark version, Standalone/Yarn/Local/Etc)
Spark 2.1.1 Stanalone mode
Scala version on your cluster
2.11.6
Your exact configuration file (with system details anonymized for security)
From time to time I commented generation OR benchmarking to test them separately.
spark-bench = {
spark-submit-parallel = false
spark-submit-config = [{
suites-parallel = false
workload-suites = [
{
descr = "Generating data for the benchmarks to use"
parallel = false
repeat = 1 // generate once and done!
benchmark-output = "console"
workloads = [
{
name = "data-generation-kmeans"
output = "hdfs://hostname:9000/tmp/spark-bench-test/kmeans-data.parquet"
rows = 10000
cols = 14
}
]
},
{
descr = "Classic benchmarking"
parallel = false
repeat = 1
benchmark-output = "console"
workloads = [
{
name = "kmeans"
input = "hdfs://hostname:9000/tmp/spark-bench-test/kmeans-data.parquet"
}
]}
]
}]
}
Relevant stacktrace
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://hostname:9000/tmp/spark-bench-test/kmeans-data.parquet, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.pathExists(SparkFuncs.scala:87)
at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyCanWrite(SparkFuncs.scala:41)
at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyCanWriteOrThrow(SparkFuncs.scala:102)
at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyOutput(SparkFuncs.scala:34)
at com.ibm.sparktc.sparkbench.workload.Workload$class.run(Workload.scala:49)
at com.ibm.sparktc.sparkbench.datageneration.mlgenerator.KMeansDataGen.run(KMeansDataGen.scala:44)
Description of your problem and any other relevant info
Actual error occurs at: com.ibm.sparktc.sparkbench.utils.SparkFuncs$.pathExists(SparkFuncs.scala:87)
For quick test I changed com.ibm.sparktc.sparkbench.utils.pathExists
as follows:
def pathExists(path: String, spark: SparkSession): Boolean = { false }
and generation run successfully. After that I change it to return true, commented generation and “Classic benchmarking” also run successfully. So I believe either getHadoopFS(path, spark).exists(new org.apache.hadoop.fs.Path(path))
doesn’t works as expected both for files and folder or com.ibm.sparktc.sparkbench.utils.SparkFuncs$.pathExists(SparkFuncs.scala:87)
is used somewhere in an erroneous way.
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
Closed by #155. Distribution will be created and uploaded to Github Releases in ~15 minutes or checkout
master
. Thank you @AndriiSushko for your thorough and clear bug report!@AndriiSushko I would be super happy to have you as a contributor! Send me an email if you wanna chat about design or Scala or anything. 😃