Request: Ignore Dataset transforms when iterating to the most recent checkpoint when resuming training
See original GitHub issue🚀 Feature request
It’d be great if, when resuming training from a checkpoint and using a Dataset with a format/transform function applied, the dataset’s format/transform function could be ignored while iterating up to the last checkpoint step.
Motivation
I doubt it’s much of an issue most of the time, but I’ve started playing with dataset.set_transform()
for doing some heavy preprocessing, and just iterating through samples to the current checkpoint step can take a ridiculously long time compared to a dataset without a transform applied. And I don’t think there’s any case where the transformed sample would be used, right?
See this conversation in the forum for more backstory and my rudimentary thoughts on how I’d accomplish it.
Your contribution
I’m hesitant to try updating any of the trainer code myself since it’s so complicated, and needs to cover so many edge cases I’m not familiar with.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
This is already there 😃 Just pass along
--ignore_data_skip
in your script orignore_data_skip=True
in yourTrainingArguments
.This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.