Spark-Bench failing with remote HDFS
See original GitHub issueSpark-Bench Version
992a278
Spark Version on Your Cluster
2.2.0
Scala Version on Your Spark Cluster
2.11.8
Your Exact Configuration File (with system details anonymized)
spark-bench = {
spark-submit-config = [{
spark-home = "/opt/kubespark"
spark-args = {
master = "k8s://https://x.x.x.x:6443"
executor-memory = "4g"
}
conf = {
}
suites-parallel = false
workload-suites = [
{
descr = "Generate a dataset, then take that same dataset and write it out to Parquet format"
benchmark-output = "hdfs://hdfs:9000/tmp/csv-vs-parquet/results-data-gen.csv"
save-mode = "overwrite"
// We need to generate the dataset first through the data generator, then we take that dataset and convert it to Parquet.
parallel = false
workloads = [
{
name = "data-generation-kmeans"
rows = 10000000
cols = 24
output = "hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.csv"
save-mode = "ignore"
},
{
name = "sql"
query = "select * from input"
input = "hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.csv"
output = "hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.parquet"
save-mode = "ignore"
}
]
},
{
descr = "Run two different SQL queries over the dataset in two different formats"
benchmark-output = "hdfs://hdfs:9000/tmp/csv-vs-parquet/results-sql.csv"
save-mode = "overwrite"
parallel = false
repeat = 10
workloads = [
{
name = "sql"
input = ["hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.csv", "hdfs://hdfs:9000/tmp/csv-vs-parquet/kmeans-data.parquet"]
query = ["select * from input", "select `0`, `22` from input where `0` < -0.9"]
cache = false
}
]
}
]
}]
}
Relevant Stack Trace (If Applicable)
18/02/08 14:26:15 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds (T0)
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs:/tmp/csv-vs-parquet/results-data-gen.csv, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.pathExists(SparkFuncs.scala:68)
at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyCanWrite(SparkFuncs.scala:40)
at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyCanWriteOrThrow(SparkFuncs.scala:83)
at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyOutput(SparkFuncs.scala:33)
at com.ibm.sparktc.sparkbench.workload.SuiteKickoff$.run(SuiteKickoff.scala:61)
at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially$1.apply(MultipleSuiteKickoff.scala:38)
at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially$1.apply(MultipleSuiteKickoff.scala:38)
at scala.collection.immutable.List.foreach(List.scala:381)
at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$.com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially(MultipleSuiteKickoff.scala:38)
at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$run$1.apply(MultipleSuiteKickoff.scala:28)
at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$run$1.apply(MultipleSuiteKickoff.scala:25)
at scala.collection.immutable.List.foreach(List.scala:381)
at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$.run(MultipleSuiteKickoff.scala:25)
at com.ibm.sparktc.sparkbench.cli.CLIKickoff$.main(CLIKickoff.scala:30)
at com.ibm.sparktc.sparkbench.cli.CLIKickoff.main(CLIKickoff.scala)
Description of Problem, Any Other Info
Spark-Bench is not playing nice with remote hdfs systems.
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
CODAIT/spark-bench - Files not writing to remote HDFS
After that I change it to return true, commented generation and "Classic benchmarking" also run successfully. So I believe either getHadoopFS( ...
Read more >Solved: Remote spark-submit HDFS error
I am trying to launch a spark job from a remote host to the HDP sandbox running on my Mac but keep getting...
Read more >Failed to Write Files to HDFS, and "item limit of / is ...
The client or upper-layer component logs indicate that a file fails to be written to a directory on HDFS. The error information is...
Read more >Spark Remote execution to Cluster fails - HDFS connection ...
I am having issues submitting a spark-submit remote job from a machine outside from the Spark Cluster running on YARN ...
Read more >Dell EMC PowerStore: Apache Spark Solution Guide
ESXi™ hypervisor automatically restarts or migrates failed Spark and HDFS servers to a different ESXi node. This process resumes operations ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for such a quick reply. Spark is in a standalone mode, 3 nodes (1 master and 3 slaves). HDFS is up, running and available. I’ll check Input/Output tests and fill out a separate issue if I don’t figure this out.
UPD: @ecurtin please check issue #154 and let me know if you need some more details.
@yekaifeng @AndriiSushko I am happy to do what I can to help, but I’ll need more information from you because I can’t seem to reproduce the issue.
Here’s what I’ll need to help you:
bin
directory of SparkTests to reflect your environment.