question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dependency issues with Spark's built-in commons-compress

See original GitHub issue

I can use the library when I run spark on my local windows machine and read excel files on the same machine. However, when I upload the files to WASB on Azure and use HDInsight cluster for running spark jobs (either local or cluster mode), I get the following error:

java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics. at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:180) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:104) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:298) at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:314) at org.apache.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:296) at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:214) at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:180) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2$$anonfun$apply$4.apply(ExcelRelation.scala:66) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2$$anonfun$apply$4.apply(ExcelRelation.scala:66) at scala.Option.fold(Option.scala:158) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2.apply(ExcelRelation.scala:66) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2.apply(ExcelRelation.scala:66) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.openWorkbook(ExcelRelation.scala:64) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:71) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:70) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:264) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:263) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:263) at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:91) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:39) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 53 elided

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:29

github_iconTop GitHub Comments

2reactions
xvinoshcommented, Aug 17, 2020

@nightscape : In my case, I tried all the versions from 0.12.1 to 0.13.5, none worked. Downloaded the latest version of common compress manually which spark-shell showed as if it has downloaded while launching the spark shell with packages option but actually did not(as I could not find anywhere in the maven repo dir where it said, it’s downloaded) Version: 1.20 Then explicitly mentioned the jar name in the driver’s classpath as mentioned below: $ spark-shell --driver-class-path /home/xvinosh/.m2/repository/org/apache/commons/commons-compress/1.20/commons-compress-1.20jar

This worked. Hope it helps other.

1reaction
jornfrankecommented, Jun 27, 2019

Maybe some of your dependencies have POI as a dependency and then this dependency does not use the shaded commons-io

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dependency issues with Spark's built-in commons-compress -
I can use the library when I run spark on my local windows machine and read excel files on the same machine. However,...
Read more >
Commons Compress – Dependency Information
Dependency Information · Apache Maven · Apache Buildr · Apache Ivy · Groovy Grape · Gradle/Grails · Scala SBT · Leiningen.
Read more >
Built-in Dependencies - Data Lake Insight
DLI built-in dependencies are provided by the platform by default. In case of conflicts, you do not need to upload them when packaging...
Read more >
Spark is ignoring overridden libraries, using provided ones ...
It seems it uses the (older) commons-compress which Spark uses, but ignores the one I baked into my jar. How can I tell...
Read more >
Fixed Issues in Apache Spark | CDP Private Cloud
Review the list of Spark issues that are resolved in Cloudera Runtime 7.1.7 SP1.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found