Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] option 'ignoreAfterHeader' not work

See original GitHub issue

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

public static void main(String[] args) {
    SparkSession sparkSession = SparkSession.builder()
            .master("local[*]")
            .appName("demo")
            .getOrCreate();
    Dataset<Row> rows = sparkSession.read()
            .format("com.crealytics.spark.excel")
            .option("dataAddress", "'Sheet1'!A1")
            .option("header", true)
            .option("ignoreAfterHeader", 1L)
            .option("maxRowsInMemory", 20)
            .load("file:///Users/td/Downloads/20w_2id_AtypeSM3.xlsx");
    rows.show();
}

test_id	time_point	id_number	mobile
测试序号	回溯日期	身份证号	手机号
1	2021/12/8 17:06	cbdddb8e8421b23498480570d7d75330538a6882f5dfdc3b64115c647f3328c4	cbdddb8e8421b23498480570d7d75330538a6882f5dfdc3b64115c647f3328c4
2	2021/12/8 17:06	a0dc2d74b9b0e3c87e076003dbfe472a424cb3032463cb339e351460765a822e	a0dc2d74b9b0e3c87e076003dbfe472a424cb3032463cb339e351460765a822e
3	2021/12/8 17:06	55e3192d096e62d4f9cd00e734a949de2b8e55b13d9b85b1d2d2999c9db2e72c	55e3192d096e62d4f9cd00e734a949de2b8e55b13d9b85b1d2d2999c9db2e72c
4	2021/12/8 17:06	9b602e9b9e8556eff1a28962d4580b34d9bf054f4831f4f924d4a6dfad660e88	9b602e9b9e8556eff1a28962d4580b34d9bf054f4831f4f924d4a6dfad660e88
5	2021/12/8 17:06	5c0d4f4953843ed6f3c54ea7ca2cc4a86d8b7723c3bf0f3fd403d4c61a77feca	5c0d4f4953843ed6f3c54ea7ca2cc4a86d8b7723c3bf0f3fd403d4c61a77feca

</body> </html> <dependency> <groupId>com.crealytics</groupId> <artifactId>spark-excel_2.12</artifactId> <version>0.14.0</version> </dependency>

I wanted to use ‘ignoreAfterHeader’ to ignore the second line, but it didn’t work.

console output:

±-------±------------±-------------------±-------------------+ | test_id| time_point| id_number| mobile| ±-------±------------±-------------------±-------------------+ |测试序号| 回溯日期| 身份证号| 手机号| | 1|12/8/21 17:06|cbdddb8e8421b2349…|cbdddb8e8421b2349…| | 2|12/8/21 17:06|a0dc2d74b9b0e3c87…|a0dc2d74b9b0e3c87…| | 3|12/8/21 17:06|55e3192d096e62d4f…|55e3192d096e62d4f…| | 4|12/8/21 17:06|9b602e9b9e8556eff…|9b602e9b9e8556eff…| | 5|12/8/21 17:06|5c0d4f4953843ed6f…|5c0d4f4953843ed6f…| | 6|12/8/21 17:06|f83340f3147b49827…|f83340f3147b49827…| | 7|12/8/21 17:06|d712cf4114c03dc43…|d712cf4114c03dc43…| | 8|12/8/21 17:06|fefad899b5dc20858…|fefad899b5dc20858…| | 9|12/8/21 17:06|8e7a98f9565619a4d…|8e7a98f9565619a4d…| | 10|12/8/21 17:06|3eaa72f81914fb894…|3eaa72f81914fb894…| | 11|12/8/21 17:06|d5744897e47fb6d78…|d5744897e47fb6d78…| | 12|12/8/21 17:06|6f61c3af9dcc39522…|6f61c3af9dcc39522…| | 13|12/8/21 17:06|abe1b0a5a9e58808c…|abe1b0a5a9e58808c…| | 14|12/8/21 17:06|87c186adf88a37443…|87c186adf88a37443…| | 15|12/8/21 17:06|7b4073a22410aafc3…|7b4073a22410aafc3…| | 16|12/8/21 17:06|dab089f470a4bcb77…|dab089f470a4bcb77…| | 17|12/8/21 17:06|1f78641036c71b8e6…|1f78641036c71b8e6…| | 18|12/8/21 17:06|47fb25b4d4af9f2da…|47fb25b4d4af9f2da…| | 19|12/8/21 17:06|8f1818a052ee87314…|8f1818a052ee87314…| ±-------±------------±-------------------±-------------------+

Expected Behavior

I expect that option ‘ignoreAfterHeader’ do work.

Steps To Reproduce

public static void main(String[] args) { SparkSession sparkSession = SparkSession.builder() .master(“local[*]”) .appName(“demo”) .getOrCreate(); Dataset<Row> rows = sparkSession.read() .format(“com.crealytics.spark.excel”) .option(“dataAddress”, “‘new贷前画像-DCPACP指标3.0’!A1”) .option(“header”, true) .option(“ignoreAfterHeader”, 1L) .option(“maxRowsInMemory”, 20) .load(“file:///Users/td/Downloads/20w_2id_AtypeSM3.xlsx”); rows.show(); }

Environment

- Spark version: 3.1.1
- Spark-Excel version: 0.14.0
- OS: MacOS
- Cluster environment local[*]

Anything else?

No response

Issue Analytics

State:
Created a year ago
Comments:13

Top GitHub Comments

2reactions

github-actions[bot]commented, Jul 16, 2022

Please check these potential duplicates:

[#615] [BUG] partitionBy not working as expected (62.91%) If this issue is a duplicate, please add any additional info to the ticket with the most information and close this one.

0reactions

mgyboomcommented, Oct 17, 2022

@mgyboom can you try 0.18.3 which should now be correctly cross-published for all Spark versions.

Although I use 3.1.1_0.18.3 version, it is still so.

java.lang.ClassNotFoundException: Failed to find data source: excel. Please find packages at http://spark.apache.org/third-party-projects.html
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:689)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:743)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:266)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
	at cn.tongdun.sparkdatahandler.handler.impl.ExcelHandler.read(ExcelHandler.java:34)
	at cn.tongdun.sparkdatahandler.handler.impl.ExcelHandler.read(ExcelHandler.java:13)
	at cn.tongdun.sparkdatahandler.BaseMain.read(BaseMain.java:87)
	at cn.tongdun.sparkdatahandler.Sharding.main(Sharding.java:58)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: excel.DefaultSource
	at java.base/java.net.URLClassLoader.findClass(Unknown Source)
	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:663)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:663)
	at scala.util.Failure.orElse(Try.scala:224)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:663)
	... 19 more