Problem with special characters in file path
See original GitHub issueHi Ruslan
I am a colleague of @wrangel. We have problems with special characters in the path of data files (anything escaped in a URI). Here is the full stack trace:
java.io.FileNotFoundException: File /mnt/landingzone/source/daily/1007_rs/2019-09-20/with%20space does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:539) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:752) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:529) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) at za.co.absa.cobrix.spark.cobol.utils.HDFSUtils$.getBlocksLocations(HDFSUtils.scala:56) at za.co.absa.cobrix.spark.cobol.utils.HDFSUtils$.getBlocksLocations(HDFSUtils.scala:37) at za.co.absa.cobrix.spark.cobol.source.index.IndexBuilder$$anonfun$2.apply(IndexBuilder.scala:146) at za.co.absa.cobrix.spark.cobol.source.index.IndexBuilder$$anonfun$2.apply(IndexBuilder.scala:145) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at za.co.absa.cobrix.spark.cobol.source.index.IndexBuilder$.toRDDWithLocality(IndexBuilder.scala:145) at za.co.absa.cobrix.spark.cobol.source.index.IndexBuilder$.buildIndexForVarLenReaderWithFullLocality(IndexBuilder.scala:69) at za.co.absa.cobrix.spark.cobol.source.index.IndexBuilder$.buildIndex(IndexBuilder.scala:50) at za.co.absa.cobrix.spark.cobol.source.CobolRelation.indexes$lzycompute(CobolRelation.scala:80) at za.co.absa.cobrix.spark.cobol.source.CobolRelation.indexes(CobolRelation.scala:80) at za.co.absa.cobrix.spark.cobol.source.CobolRelation.buildScan(CobolRelation.scala:92) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:308) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77) at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:100) at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:67) at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:91) at org.apache.spark.sql.Dataset.persist(Dataset.scala:2968) at org.apache.spark.sql.Dataset.cache(Dataset.scala:2978) at ch.swisscard.bigdataanalyticspoc.app.stage1.cobolparser.CobolFileReader.run(CobolFileReader.scala:54) at ch.swisscard.bigdataanalyticspoc.app.stage1.cobolparser.CobolParser.parse(CobolParser.scala:18) at ch.swisscard.bigdataanalyticspoc.app.stage1.Stage1$$anonfun$1.apply(Stage1.scala:53) at ch.swisscard.bigdataanalyticspoc.app.stage1.Stage1$$anonfun$1.apply(Stage1.scala:49) at scala.util.Try$.apply(Try.scala:192) at ch.swisscard.bigdataanalyticspoc.app.stage1.Stage1$.ch$swisscard$bigdataanalyticspoc$app$stage1$Stage1$$parseFile(Stage1.scala:49) at ch.swisscard.bigdataanalyticspoc.app.stage1.Stage1$$anonfun$run$1$$anonfun$apply$1.apply(Stage1.scala:35) at ch.swisscard.bigdataanalyticspoc.app.stage1.Stage1$$anonfun$run$1$$anonfun$apply$1.apply(Stage1.scala:31) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at ch.swisscard.bigdataanalyticspoc.app.stage1.Stage1$$anonfun$run$1.apply(Stage1.scala:30) at ch.swisscard.bigdataanalyticspoc.app.stage1.Stage1$$anonfun$run$1.apply(Stage1.scala:22) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) at ch.swisscard.bigdataanalyticspoc.app.stage1.Stage1$.run(Stage1.scala:22) at ch.swisscard.bigdataanalyticspoc.app.Main$.delayedEndpoint$ch$swisscard$bigdataanalyticspoc$app$Main$1(Main.scala:15) at ch.swisscard.bigdataanalyticspoc.app.Main$delayedInit$body.apply(Main.scala:11) at scala.Function0$class.apply$mcV$sp(Function0.scala:34) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.App$class.main(App.scala:76) at ch.swisscard.bigdataanalyticspoc.app.Main$.main(Main.scala:11) at ch.swisscard.bigdataanalyticspoc.app.Main.main(Main.scala)
We did some digging and found out that the path is converted into a URI once to often. Cobrix seems to convert the path in FileUtils.getFiles():84. Its converted to a URI and then converted to a RawPath which retains the character escaping of the URI (‘with space’ becomes ‘with%20space’). This path is then used by Hadoop which will again convert the Path into a URI (‘with%20space’ becomes ‘with%2520space’). This is then reverted back with getPath() (‘with%2520space’ becomes ‘with%20space’ again) and is then used as the path of a java.io.File object. See RawLocalFileSystem.pathToFile():86.
We would appreciate your help in resolving this.
Thanks, Patrick
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top GitHub Comments
I can now confirm, that it works for us as well.
Thanks again, Patrick
The snapshot repository is not searched by default, you can enable it temporarily by adding this to your Maven profile:
The fixed worked well on our cluster so we will release
1.1.1
soon.