Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parsing the data file incorrectly

See original GitHub issue

I am currently working on a project converting EBCDIC binary file to UTF-8 text file. I am using cobrix but the output seems to be incorrect. I am using the following script to load the datafile and copyBook:

dataFrame = spark.read.format("cobol").options(copybook = copyBook).option("is_record_sequence", "true").load(filename)

The output is showing as below:

A screenshot from original data presentation from mainframe:

It looks like when parsing the data, it always skips the “EOB_FAMILY_NUM” field and that field will always be Null. Other fields are mismatched as well. I have tried adding more options like .option("rdw_adjustment", 4) but it doesn’t solve the issue. Do you know anything I can do to solve that issues?

I also attached the copy book screenshot below:

Issue Analytics

State:
Created 3 years ago
Comments:7

Top GitHub Comments

1reaction

jianyu-gongcommented, Jan 4, 2021

Hi Ruslan, after getting my new files with RDW. Cobrix is working perfectly fine now. Thanks again!

0reactions

yruslancommented, Dec 26, 2020

In order to read variable record length files, there should be a way to determine the record length for each record. RDW is the best way, it is general, explicit, and deterministic. So if you can preserve RDW it would be very easy to extract data from the file. Other options are more complicated. If one of the record fields contains record size, it can be used. If there is no such field, but there is a field that determines record type, a custom record extractor can be used.

You can send the file and the copybook (or links to them on GDrive/Dropbox) to yruslan@gmail.com.