question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] option 'ignoreAfterHeader' not work

See original GitHub issue

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

public static void main(String[] args) {
    SparkSession sparkSession = SparkSession.builder()
            .master("local[*]")
            .appName("demo")
            .getOrCreate();
    Dataset<Row> rows = sparkSession.read()
            .format("com.crealytics.spark.excel")
            .option("dataAddress", "'Sheet1'!A1")
            .option("header", true)
            .option("ignoreAfterHeader", 1L)
            .option("maxRowsInMemory", 20)
            .load("file:///Users/td/Downloads/20w_2id_AtypeSM3.xlsx");
    rows.show();
}
<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> </head> <body>
test_id time_point id_number mobile
测试序号 回溯日期 身份证号 手机号
1 2021/12/8 17:06 cbdddb8e8421b23498480570d7d75330538a6882f5dfdc3b64115c647f3328c4 cbdddb8e8421b23498480570d7d75330538a6882f5dfdc3b64115c647f3328c4
2 2021/12/8 17:06 a0dc2d74b9b0e3c87e076003dbfe472a424cb3032463cb339e351460765a822e a0dc2d74b9b0e3c87e076003dbfe472a424cb3032463cb339e351460765a822e
3 2021/12/8 17:06 55e3192d096e62d4f9cd00e734a949de2b8e55b13d9b85b1d2d2999c9db2e72c 55e3192d096e62d4f9cd00e734a949de2b8e55b13d9b85b1d2d2999c9db2e72c
4 2021/12/8 17:06 9b602e9b9e8556eff1a28962d4580b34d9bf054f4831f4f924d4a6dfad660e88 9b602e9b9e8556eff1a28962d4580b34d9bf054f4831f4f924d4a6dfad660e88
5 2021/12/8 17:06 5c0d4f4953843ed6f3c54ea7ca2cc4a86d8b7723c3bf0f3fd403d4c61a77feca 5c0d4f4953843ed6f3c54ea7ca2cc4a86d8b7723c3bf0f3fd403d4c61a77feca
</body> </html> <dependency> <groupId>com.crealytics</groupId> <artifactId>spark-excel_2.12</artifactId> <version>0.14.0</version> </dependency>

I wanted to use ‘ignoreAfterHeader’ to ignore the second line, but it didn’t work.

console output:

±-------±------------±-------------------±-------------------+ | test_id| time_point| id_number| mobile| ±-------±------------±-------------------±-------------------+ |测试序号| 回溯日期| 身份证号| 手机号| | 1|12/8/21 17:06|cbdddb8e8421b2349…|cbdddb8e8421b2349…| | 2|12/8/21 17:06|a0dc2d74b9b0e3c87…|a0dc2d74b9b0e3c87…| | 3|12/8/21 17:06|55e3192d096e62d4f…|55e3192d096e62d4f…| | 4|12/8/21 17:06|9b602e9b9e8556eff…|9b602e9b9e8556eff…| | 5|12/8/21 17:06|5c0d4f4953843ed6f…|5c0d4f4953843ed6f…| | 6|12/8/21 17:06|f83340f3147b49827…|f83340f3147b49827…| | 7|12/8/21 17:06|d712cf4114c03dc43…|d712cf4114c03dc43…| | 8|12/8/21 17:06|fefad899b5dc20858…|fefad899b5dc20858…| | 9|12/8/21 17:06|8e7a98f9565619a4d…|8e7a98f9565619a4d…| | 10|12/8/21 17:06|3eaa72f81914fb894…|3eaa72f81914fb894…| | 11|12/8/21 17:06|d5744897e47fb6d78…|d5744897e47fb6d78…| | 12|12/8/21 17:06|6f61c3af9dcc39522…|6f61c3af9dcc39522…| | 13|12/8/21 17:06|abe1b0a5a9e58808c…|abe1b0a5a9e58808c…| | 14|12/8/21 17:06|87c186adf88a37443…|87c186adf88a37443…| | 15|12/8/21 17:06|7b4073a22410aafc3…|7b4073a22410aafc3…| | 16|12/8/21 17:06|dab089f470a4bcb77…|dab089f470a4bcb77…| | 17|12/8/21 17:06|1f78641036c71b8e6…|1f78641036c71b8e6…| | 18|12/8/21 17:06|47fb25b4d4af9f2da…|47fb25b4d4af9f2da…| | 19|12/8/21 17:06|8f1818a052ee87314…|8f1818a052ee87314…| ±-------±------------±-------------------±-------------------+

Expected Behavior

I expect that option ‘ignoreAfterHeader’ do work.

Steps To Reproduce

public static void main(String[] args) { SparkSession sparkSession = SparkSession.builder() .master(“local[*]”) .appName(“demo”) .getOrCreate(); Dataset<Row> rows = sparkSession.read() .format(“com.crealytics.spark.excel”) .option(“dataAddress”, “‘new贷前画像-DCPACP指标3.0’!A1”) .option(“header”, true) .option(“ignoreAfterHeader”, 1L) .option(“maxRowsInMemory”, 20) .load(“file:///Users/td/Downloads/20w_2id_AtypeSM3.xlsx”); rows.show(); }

Environment

- Spark version: 3.1.1
- Spark-Excel version: 0.14.0
- OS: MacOS
- Cluster environment local[*]

Anything else?

No response

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:13

github_iconTop GitHub Comments

2reactions
github-actions[bot]commented, Jul 16, 2022

Please check these potential duplicates:

  • [#615] [BUG] partitionBy not working as expected (62.91%) If this issue is a duplicate, please add any additional info to the ticket with the most information and close this one.
0reactions
mgyboomcommented, Oct 17, 2022

@mgyboom can you try 0.18.3 which should now be correctly cross-published for all Spark versions.

Although I use 3.1.1_0.18.3 version, it is still so.

java.lang.ClassNotFoundException: Failed to find data source: excel. Please find packages at http://spark.apache.org/third-party-projects.html
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:689)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:743)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:266)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
	at cn.tongdun.sparkdatahandler.handler.impl.ExcelHandler.read(ExcelHandler.java:34)
	at cn.tongdun.sparkdatahandler.handler.impl.ExcelHandler.read(ExcelHandler.java:13)
	at cn.tongdun.sparkdatahandler.BaseMain.read(BaseMain.java:87)
	at cn.tongdun.sparkdatahandler.Sharding.main(Sharding.java:58)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: excel.DefaultSource
	at java.base/java.net.URLClassLoader.findClass(Unknown Source)
	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:663)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:663)
	at scala.util.Failure.orElse(Try.scala:224)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:663)
	... 19 more
Read more comments on GitHub >

github_iconTop Results From Across the Web

crealytics - Bountysource
[BUG] option 'ignoreAfterHeader' not work $ 0 ... Created 4 months ago in crealytics/spark-excel with 11 comments. Is there an existing issue for...
Read more >
Why doesn't 'HeaderLines' work in the readtable command?
I am importing some excel files into Matlab using the readtable command. There are a couple of header lines in the file that...
Read more >
Veusz Documentation - Fossies
Veusz is a 2D and 3D scientific plotting package. It is designed to be easy to use, easily extensible, but powerful. The.
Read more >
Home - crealytics/spark-excel GitHub Wiki
Spark Excel Options ; ignoreAfterHeader, None (0), Support, Number of rows to ignore after header. Only in reading ; ignoreLeadingWhiteSpace ...
Read more >
[BUG] option 'ignoreAfterHeader' not work - crealytics/spark-excel ...
crealytics/spark-excel: [BUG] option 'ignoreAfterHeader' not work.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found