question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error reading variable length ASCII file

See original GitHub issue

Describe the bug

Running into an issue when trying to read a variable length newline separated ASCII file using Cobrix. Please see the Stacktrace below:

at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:210)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:757)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:91)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1643)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArithmeticException: BigInteger would overflow supported range
	at java.math.BigInteger.reportOverflow(BigInteger.java:1084)
	at java.math.BigInteger.pow(BigInteger.java:2391)
	at java.math.BigDecimal.bigTenToThe(BigDecimal.java:3574)
	at java.math.BigDecimal.bigMultiplyPowerTen(BigDecimal.java:3707)
	at java.math.BigDecimal.setScale(BigDecimal.java:2448)
	at java.math.BigDecimal.setScale(BigDecimal.java:2515)
	at scala.math.BigDecimal.setScale(BigDecimal.scala:646)
	at org.apache.spark.sql.types.Decimal.set(Decimal.scala:147)
	at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:559)
	at org.apache.spark.sql.types.Decimal$.fromDecimal(Decimal.scala:582)
	at org.apache.spark.sql.types.Decimal.fromDecimal(Decimal.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_6.StaticInvoke_1338$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_6.createNamedStruct_36_4$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_6.createNamedStruct_36$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_7.createNamedStruct_52_2$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_7.createNamedStruct_52$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_7.createNamedStruct_53_5$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_7.createNamedStruct_53$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_8.writeFields_35_15$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass_8.writeFields_35$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:235)
	... 30 more

To Reproduce

Steps to reproduce the behaviour OR commands run:

  1. Read a Variable length ASCII of a bigger size (like at least 100 GB)
  2. Use the following options to read
.option("encoding", "ascii")
.option("is_text", "true")
  1. See error

Expected behaviour

The file should parse correctly.

Screenshots

Please see the Stacktrace provided above.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13

github_iconTop GitHub Comments

1reaction
pritdbcommented, Dec 2, 2021

Thanks a lot @yruslan for all the updates. I don’t currently have access to the environment where this occurred, but have requested the folks who have to test this out. Will keep you posted when I hear from them.

1reaction
pritdbcommented, Nov 12, 2021

Hi @yruslan ,

Here are the versions: Apache Spark 3.1.2 Scala: 2.12.10

And the read & write code:

val df = spark
  .read
  .format("cobol")
  .option("copybook", "<path-to-copybook>")
  .option("encoding", "ascii")
  .option("is_text", "true")
  .option("schema_retention_policy", "collapse_root")
  .option("drop_value_fillers", "false")
  .load(inputFile)

// Causes the error
df.count()

// Causes the same error
df
  .write
  .partitionBy("field-1")
  .format("delta")
  .mode("overwrite")
  .option("replaceWhere", s"field-1 = '$field1_value' ")
  .save(outputPath)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Error reading variable length UTF8 string using H5LT C++ API
I am unable to read the following string using H5LTread_dataset_string API. Below code works fine for fixed length strings. Variable length ...
Read more >
Reading (variable length) input from a file in C - Stack Overflow
Here is a simple program to read each line , regardless of its size. #include <stdio.h> #include <stdlib.h> int read_data(){ char *line ...
Read more >
Variable length records - wrong length READ - IBM
Enterprise COBOL Version 5.1 corrects READ statement processing of wrong-length records.
Read more >
- Problem loading the BMF from raw ASCII files
If the file has variable length records, then you can use the following code to read only the first character of each row...
Read more >
File Type - Accounting Support Resources
If the length of each record in the ASCII text file is variable, the file ... If Working Papers displays an error message...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found