skip_blank has different values during data preparation
See original GitHub issueI’m running a standard training with parallel sentences containing empty source or target lines and data_io.py returns an error while building buckets:
IndexError: index 897268 is out of bounds for axis 0 with size 897268
The parallel_iter() function in data_io.py is always called with the skip_blank argument set as True, except right here. This line makes us keep the sentence pairs containing “blanks”, which seems to bring the mismatch reflected in the error above. I don’t get the error anymore when I set skip_blank to True (or when I remove the sentence pairs containing blanks in the data).
@mjpost This line came with this PR. Would it be an issue to let the default skip_blank value here ?
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Missing data values - IBM
Frequently, such obviously wrong values are purposely entered, or fields left blank, during a questionnaire to indicate a nonresponse.
Read more >How to Deal with Missing Values in Your Dataset - KDnuggets
In this article, we are going to talk about how to identify and treat the missing values in the data step by step....
Read more >Data Preparation - PyCaret Official - GitBook
Datasets for various reasons may have missing values or empty records, often encoded as blanks or NaN . Most of the machine learning...
Read more >Automatically skip blanks in Excel charts with formulas (ignore ...
Download the workbook here: http://www.xelplus.com/ skip -dates- in -excel-charts/ In this video I show you how to dynamically ignore blank dates ...
Read more >Skip blank records in a join - Designer - Alteryx Community
Solved: Is there a way I can skip blank or nonexistent records while doing a join? I have a list of 200 client...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Thanks for reporting! This indeed doesn’t seem right.
I should also have written, thanks for tracking this down and filing a perfect bug report. It seems you covered a real hole in my use cases 😃