Parsing the data file incorrectlySee original GitHub issue
I am currently working on a project converting EBCDIC binary file to UTF-8 text file. I am using cobrix but the output seems to be incorrect. I am using the following script to load the datafile and copyBook:
dataFrame = spark.read.format("cobol").options(copybook = copyBook).option("is_record_sequence", "true").load(filename)
The output is showing as below:
A screenshot from original data presentation from mainframe:
It looks like when parsing the data, it always skips the “EOB_FAMILY_NUM” field and that field will always be Null. Other fields are mismatched as well. I have tried adding more options like
.option("rdw_adjustment", 4) but it doesn’t solve the issue. Do you know anything I can do to solve that issues?
I also attached the copy book screenshot below:
- Created 3 years ago
Top GitHub Comments
Hi Ruslan, after getting my new files with RDW. Cobrix is working perfectly fine now. Thanks again!
In order to read variable record length files, there should be a way to determine the record length for each record. RDW is the best way, it is general, explicit, and deterministic. So if you can preserve RDW it would be very easy to extract data from the file. Other options are more complicated. If one of the record fields contains record size, it can be used. If there is no such field, but there is a field that determines record type, a custom record extractor can be used.
You can send the file and the copybook (or links to them on GDrive/Dropbox) to firstname.lastname@example.org.