[BUG] option 'ignoreAfterHeader' not work
See original GitHub issueIs there an existing issue for this?
- I have searched the existing issues
Current Behavior
public static void main(String[] args) {
SparkSession sparkSession = SparkSession.builder()
.master("local[*]")
.appName("demo")
.getOrCreate();
Dataset<Row> rows = sparkSession.read()
.format("com.crealytics.spark.excel")
.option("dataAddress", "'Sheet1'!A1")
.option("header", true)
.option("ignoreAfterHeader", 1L)
.option("maxRowsInMemory", 20)
.load("file:///Users/td/Downloads/20w_2id_AtypeSM3.xlsx");
rows.show();
}
<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
<head>
</head>
<body>
test_id | time_point | id_number | mobile |
---|---|---|---|
测试序号 | 回溯日期 | 身份证号 | 手机号 |
1 | 2021/12/8 17:06 | cbdddb8e8421b23498480570d7d75330538a6882f5dfdc3b64115c647f3328c4 | cbdddb8e8421b23498480570d7d75330538a6882f5dfdc3b64115c647f3328c4 |
2 | 2021/12/8 17:06 | a0dc2d74b9b0e3c87e076003dbfe472a424cb3032463cb339e351460765a822e | a0dc2d74b9b0e3c87e076003dbfe472a424cb3032463cb339e351460765a822e |
3 | 2021/12/8 17:06 | 55e3192d096e62d4f9cd00e734a949de2b8e55b13d9b85b1d2d2999c9db2e72c | 55e3192d096e62d4f9cd00e734a949de2b8e55b13d9b85b1d2d2999c9db2e72c |
4 | 2021/12/8 17:06 | 9b602e9b9e8556eff1a28962d4580b34d9bf054f4831f4f924d4a6dfad660e88 | 9b602e9b9e8556eff1a28962d4580b34d9bf054f4831f4f924d4a6dfad660e88 |
5 | 2021/12/8 17:06 | 5c0d4f4953843ed6f3c54ea7ca2cc4a86d8b7723c3bf0f3fd403d4c61a77feca | 5c0d4f4953843ed6f3c54ea7ca2cc4a86d8b7723c3bf0f3fd403d4c61a77feca |
I wanted to use ‘ignoreAfterHeader’ to ignore the second line, but it didn’t work.
console output:
±-------±------------±-------------------±-------------------+ | test_id| time_point| id_number| mobile| ±-------±------------±-------------------±-------------------+ |测试序号| 回溯日期| 身份证号| 手机号| | 1|12/8/21 17:06|cbdddb8e8421b2349…|cbdddb8e8421b2349…| | 2|12/8/21 17:06|a0dc2d74b9b0e3c87…|a0dc2d74b9b0e3c87…| | 3|12/8/21 17:06|55e3192d096e62d4f…|55e3192d096e62d4f…| | 4|12/8/21 17:06|9b602e9b9e8556eff…|9b602e9b9e8556eff…| | 5|12/8/21 17:06|5c0d4f4953843ed6f…|5c0d4f4953843ed6f…| | 6|12/8/21 17:06|f83340f3147b49827…|f83340f3147b49827…| | 7|12/8/21 17:06|d712cf4114c03dc43…|d712cf4114c03dc43…| | 8|12/8/21 17:06|fefad899b5dc20858…|fefad899b5dc20858…| | 9|12/8/21 17:06|8e7a98f9565619a4d…|8e7a98f9565619a4d…| | 10|12/8/21 17:06|3eaa72f81914fb894…|3eaa72f81914fb894…| | 11|12/8/21 17:06|d5744897e47fb6d78…|d5744897e47fb6d78…| | 12|12/8/21 17:06|6f61c3af9dcc39522…|6f61c3af9dcc39522…| | 13|12/8/21 17:06|abe1b0a5a9e58808c…|abe1b0a5a9e58808c…| | 14|12/8/21 17:06|87c186adf88a37443…|87c186adf88a37443…| | 15|12/8/21 17:06|7b4073a22410aafc3…|7b4073a22410aafc3…| | 16|12/8/21 17:06|dab089f470a4bcb77…|dab089f470a4bcb77…| | 17|12/8/21 17:06|1f78641036c71b8e6…|1f78641036c71b8e6…| | 18|12/8/21 17:06|47fb25b4d4af9f2da…|47fb25b4d4af9f2da…| | 19|12/8/21 17:06|8f1818a052ee87314…|8f1818a052ee87314…| ±-------±------------±-------------------±-------------------+
Expected Behavior
I expect that option ‘ignoreAfterHeader’ do work.
Steps To Reproduce
public static void main(String[] args) { SparkSession sparkSession = SparkSession.builder() .master(“local[*]”) .appName(“demo”) .getOrCreate(); Dataset<Row> rows = sparkSession.read() .format(“com.crealytics.spark.excel”) .option(“dataAddress”, “‘new贷前画像-DCPACP指标3.0’!A1”) .option(“header”, true) .option(“ignoreAfterHeader”, 1L) .option(“maxRowsInMemory”, 20) .load(“file:///Users/td/Downloads/20w_2id_AtypeSM3.xlsx”); rows.show(); }
Environment
- Spark version: 3.1.1
- Spark-Excel version: 0.14.0
- OS: MacOS
- Cluster environment local[*]
Anything else?
No response
Issue Analytics
- State:
- Created a year ago
- Comments:13
Top GitHub Comments
Please check these potential duplicates:
Although I use
3.1.1_0.18.3
version, it is still so.